Hi everybody,
Is DVFS for memory bus really working on Odroid XU3/4 board?
Using a simple microbenchmark that is doing only memory accesses, memory DVFS
seems to not working properly:
The microbenchmark is doing pointer chasing by following index in an array.
Indices in the array are set to follow a random pattern (cutting prefetcher),
and forcing RAM access.
git clone https://github.com/wwilly/benchmark.git \
&& cd benchmark \
&& source env.sh \
&& ./bench_build.sh \
&& bash source/scripts/test_dvfs_mem.sh
Python 3, cmake and sudo rights are required.
Results:
DVFS CPU with performance governor
mem_gov = simple_ondemand at 165000000 Hz in idle, should be bumped when the
benchmark is running.
- on the LITTLE cluster it takes 4.74308 s to run (683.004 c per memory access),
- on the big cluster it takes 4.76556 s to run (980.343 c per moemory access).
While forcing DVFS memory bus to use performance governor,
mem_gov = performance at 825000000 Hz in idle,
- on the LITTLE cluster it takes 1.1451 s to run (164.894 c per memory access),
- on the big cluster it takes 1.18448 s to run (243.664 c per memory access).
The kernel used is the last 5.7.5 stable with default exynos_defconfig.
Cheers,
Willy
On Tue, 23 Jun 2020 at 18:47, Willy Wolff <[email protected]> wrote:
>
> Hi everybody,
>
> Is DVFS for memory bus really working on Odroid XU3/4 board?
> Using a simple microbenchmark that is doing only memory accesses, memory DVFS
> seems to not working properly:
>
> The microbenchmark is doing pointer chasing by following index in an array.
> Indices in the array are set to follow a random pattern (cutting prefetcher),
> and forcing RAM access.
>
> git clone https://github.com/wwilly/benchmark.git \
> && cd benchmark \
> && source env.sh \
> && ./bench_build.sh \
> && bash source/scripts/test_dvfs_mem.sh
>
> Python 3, cmake and sudo rights are required.
>
> Results:
> DVFS CPU with performance governor
> mem_gov = simple_ondemand at 165000000 Hz in idle, should be bumped when the
> benchmark is running.
> - on the LITTLE cluster it takes 4.74308 s to run (683.004 c per memory access),
> - on the big cluster it takes 4.76556 s to run (980.343 c per moemory access).
>
> While forcing DVFS memory bus to use performance governor,
> mem_gov = performance at 825000000 Hz in idle,
> - on the LITTLE cluster it takes 1.1451 s to run (164.894 c per memory access),
> - on the big cluster it takes 1.18448 s to run (243.664 c per memory access).
>
> The kernel used is the last 5.7.5 stable with default exynos_defconfig.
Thanks for the report. Few thoughts:
1. What trans_stat are saying? Except DMC driver you can also check
all other devfreq devices (e.g. wcore) - maybe the devfreq events
(nocp) are not properly assigned?
2. Try running the measurement for ~1 minutes or longer. The counters
might have some delay (which would require probably fixing but the
point is to narrow the problem).
3. What do you understand by "mem_gov"? Which device is it?
Best regards,
Krzysztof
On Tue, Jun 23, 2020 at 09:02:38PM +0200, Krzysztof Kozlowski wrote:
> On Tue, 23 Jun 2020 at 18:47, Willy Wolff <[email protected]> wrote:
> >
> > Hi everybody,
> >
> > Is DVFS for memory bus really working on Odroid XU3/4 board?
> > Using a simple microbenchmark that is doing only memory accesses, memory DVFS
> > seems to not working properly:
> >
> > The microbenchmark is doing pointer chasing by following index in an array.
> > Indices in the array are set to follow a random pattern (cutting prefetcher),
> > and forcing RAM access.
> >
> > git clone https://github.com/wwilly/benchmark.git \
> > && cd benchmark \
> > && source env.sh \
> > && ./bench_build.sh \
> > && bash source/scripts/test_dvfs_mem.sh
> >
> > Python 3, cmake and sudo rights are required.
> >
> > Results:
> > DVFS CPU with performance governor
> > mem_gov = simple_ondemand at 165000000 Hz in idle, should be bumped when the
> > benchmark is running.
> > - on the LITTLE cluster it takes 4.74308 s to run (683.004 c per memory access),
> > - on the big cluster it takes 4.76556 s to run (980.343 c per moemory access).
> >
> > While forcing DVFS memory bus to use performance governor,
> > mem_gov = performance at 825000000 Hz in idle,
> > - on the LITTLE cluster it takes 1.1451 s to run (164.894 c per memory access),
> > - on the big cluster it takes 1.18448 s to run (243.664 c per memory access).
> >
> > The kernel used is the last 5.7.5 stable with default exynos_defconfig.
>
> Thanks for the report. Few thoughts:
> 1. What trans_stat are saying? Except DMC driver you can also check
> all other devfreq devices (e.g. wcore) - maybe the devfreq events
> (nocp) are not properly assigned?
> 2. Try running the measurement for ~1 minutes or longer. The counters
> might have some delay (which would require probably fixing but the
> point is to narrow the problem).
> 3. What do you understand by "mem_gov"? Which device is it?
+Cc Lukasz who was working on this.
I just run memtester and more-or-less ondemand works (at least ramps
up):
Before:
/sys/class/devfreq/10c20000.memory-controller$ cat trans_stat
From : To
: 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
* 165000000: 0 0 0 0 0 0 0 0 1795950
206000000: 1 0 0 0 0 0 0 0 4770
275000000: 0 1 0 0 0 0 0 0 15540
413000000: 0 0 1 0 0 0 0 0 20780
543000000: 0 0 0 1 0 0 0 1 10760
633000000: 0 0 0 0 2 0 0 0 10310
728000000: 0 0 0 0 0 0 0 0 0
825000000: 0 0 0 0 0 2 0 0 25920
Total transition : 9
$ sudo memtester 1G
During memtester:
/sys/class/devfreq/10c20000.memory-controller$ cat trans_stat
From : To
: 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
165000000: 0 0 0 0 0 0 0 1 1801490
206000000: 1 0 0 0 0 0 0 0 4770
275000000: 0 1 0 0 0 0 0 0 15540
413000000: 0 0 1 0 0 0 0 0 20780
543000000: 0 0 0 1 0 0 0 2 11090
633000000: 0 0 0 0 3 0 0 0 17210
728000000: 0 0 0 0 0 0 0 0 0
* 825000000: 0 0 0 0 0 3 0 0 169020
Total transition : 13
However after killing memtester it stays at 633 MHz for very long time
and does not slow down. This is indeed weird...
Best regards,
Krzysztof
Hi Krzysztof,
Thanks to look at it.
mem_gov is /sys/class/devfreq/10c20000.memory-controller/governor
Here some numbers after increasing the running time:
Running using simple_ondemand:
Before:
From : To
: 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
* 165000000: 0 0 0 0 0 0 0 4 4528600
206000000: 5 0 0 0 0 0 0 0 57780
275000000: 0 5 0 0 0 0 0 0 50060
413000000: 0 0 5 0 0 0 0 0 46240
543000000: 0 0 0 5 0 0 0 0 48970
633000000: 0 0 0 0 5 0 0 0 47330
728000000: 0 0 0 0 0 0 0 0 0
825000000: 0 0 0 0 0 5 0 0 331300
Total transition : 34
After:
From : To
: 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
* 165000000: 0 0 0 0 0 0 0 4 5098890
206000000: 5 0 0 0 0 0 0 0 57780
275000000: 0 5 0 0 0 0 0 0 50060
413000000: 0 0 5 0 0 0 0 0 46240
543000000: 0 0 0 5 0 0 0 0 48970
633000000: 0 0 0 0 5 0 0 0 47330
728000000: 0 0 0 0 0 0 0 0 0
825000000: 0 0 0 0 0 5 0 0 331300
Total transition : 34
With a running time of:
LITTLE => 283.699 s (680.877 c per mem access)
big => 284.47 s (975.327 c per mem access)
And when I set to the performance governor:
Before:
From : To
: 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
165000000: 0 0 0 0 0 0 0 5 5099040
206000000: 5 0 0 0 0 0 0 0 57780
275000000: 0 5 0 0 0 0 0 0 50060
413000000: 0 0 5 0 0 0 0 0 46240
543000000: 0 0 0 5 0 0 0 0 48970
633000000: 0 0 0 0 5 0 0 0 47330
728000000: 0 0 0 0 0 0 0 0 0
* 825000000: 0 0 0 0 0 5 0 0 331350
Total transition : 35
After:
From : To
: 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
165000000: 0 0 0 0 0 0 0 5 5099040
206000000: 5 0 0 0 0 0 0 0 57780
275000000: 0 5 0 0 0 0 0 0 50060
413000000: 0 0 5 0 0 0 0 0 46240
543000000: 0 0 0 5 0 0 0 0 48970
633000000: 0 0 0 0 5 0 0 0 47330
728000000: 0 0 0 0 0 0 0 0 0
* 825000000: 0 0 0 0 0 5 0 0 472980
Total transition : 35
With a running time of:
LITTLE: 68.8428 s (165.223 c per mem access)
big: 71.3268 s (244.549 c per mem access)
I see some transition, but not occuring during the benchmark.
I haven't dive into the code, but maybe it is the heuristic behind that is not
well defined? If you know how it's working that would be helpfull before I dive
in it.
I run your test as well, and indeed, it seems to work for large bunch of memory,
and there is some delay before making a transition (seems to be around 10s).
When you kill memtester, it reduces the freq stepwisely every ~10s.
Note that the timing shown above account for the critical path, and the code is
looping on reading only, there is no write in the critical path.
Maybe memtester is doing writes and devfreq heuristic uses only write info?
Cheers,
Willy
On 2020-06-23-21-11-29, Krzysztof Kozlowski wrote:
> On Tue, Jun 23, 2020 at 09:02:38PM +0200, Krzysztof Kozlowski wrote:
> > On Tue, 23 Jun 2020 at 18:47, Willy Wolff <[email protected]> wrote:
> > >
> > > Hi everybody,
> > >
> > > Is DVFS for memory bus really working on Odroid XU3/4 board?
> > > Using a simple microbenchmark that is doing only memory accesses, memory DVFS
> > > seems to not working properly:
> > >
> > > The microbenchmark is doing pointer chasing by following index in an array.
> > > Indices in the array are set to follow a random pattern (cutting prefetcher),
> > > and forcing RAM access.
> > >
> > > git clone https://github.com/wwilly/benchmark.git \
> > > && cd benchmark \
> > > && source env.sh \
> > > && ./bench_build.sh \
> > > && bash source/scripts/test_dvfs_mem.sh
> > >
> > > Python 3, cmake and sudo rights are required.
> > >
> > > Results:
> > > DVFS CPU with performance governor
> > > mem_gov = simple_ondemand at 165000000 Hz in idle, should be bumped when the
> > > benchmark is running.
> > > - on the LITTLE cluster it takes 4.74308 s to run (683.004 c per memory access),
> > > - on the big cluster it takes 4.76556 s to run (980.343 c per moemory access).
> > >
> > > While forcing DVFS memory bus to use performance governor,
> > > mem_gov = performance at 825000000 Hz in idle,
> > > - on the LITTLE cluster it takes 1.1451 s to run (164.894 c per memory access),
> > > - on the big cluster it takes 1.18448 s to run (243.664 c per memory access).
> > >
> > > The kernel used is the last 5.7.5 stable with default exynos_defconfig.
> >
> > Thanks for the report. Few thoughts:
> > 1. What trans_stat are saying? Except DMC driver you can also check
> > all other devfreq devices (e.g. wcore) - maybe the devfreq events
> > (nocp) are not properly assigned?
> > 2. Try running the measurement for ~1 minutes or longer. The counters
> > might have some delay (which would require probably fixing but the
> > point is to narrow the problem).
> > 3. What do you understand by "mem_gov"? Which device is it?
>
> +Cc Lukasz who was working on this.
>
> I just run memtester and more-or-less ondemand works (at least ramps
> up):
>
> Before:
> /sys/class/devfreq/10c20000.memory-controller$ cat trans_stat
> From : To
> : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
> * 165000000: 0 0 0 0 0 0 0 0 1795950
> 206000000: 1 0 0 0 0 0 0 0 4770
> 275000000: 0 1 0 0 0 0 0 0 15540
> 413000000: 0 0 1 0 0 0 0 0 20780
> 543000000: 0 0 0 1 0 0 0 1 10760
> 633000000: 0 0 0 0 2 0 0 0 10310
> 728000000: 0 0 0 0 0 0 0 0 0
> 825000000: 0 0 0 0 0 2 0 0 25920
> Total transition : 9
>
>
> $ sudo memtester 1G
>
> During memtester:
> /sys/class/devfreq/10c20000.memory-controller$ cat trans_stat
> From : To
> : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
> 165000000: 0 0 0 0 0 0 0 1 1801490
> 206000000: 1 0 0 0 0 0 0 0 4770
> 275000000: 0 1 0 0 0 0 0 0 15540
> 413000000: 0 0 1 0 0 0 0 0 20780
> 543000000: 0 0 0 1 0 0 0 2 11090
> 633000000: 0 0 0 0 3 0 0 0 17210
> 728000000: 0 0 0 0 0 0 0 0 0
> * 825000000: 0 0 0 0 0 3 0 0 169020
> Total transition : 13
>
> However after killing memtester it stays at 633 MHz for very long time
> and does not slow down. This is indeed weird...
>
> Best regards,
> Krzysztof
On Wed, Jun 24, 2020 at 10:01:17AM +0200, Willy Wolff wrote:
> Hi Krzysztof,
> Thanks to look at it.
>
> mem_gov is /sys/class/devfreq/10c20000.memory-controller/governor
>
> Here some numbers after increasing the running time:
>
> Running using simple_ondemand:
> Before:
> From : To
> : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
> * 165000000: 0 0 0 0 0 0 0 4 4528600
> 206000000: 5 0 0 0 0 0 0 0 57780
> 275000000: 0 5 0 0 0 0 0 0 50060
> 413000000: 0 0 5 0 0 0 0 0 46240
> 543000000: 0 0 0 5 0 0 0 0 48970
> 633000000: 0 0 0 0 5 0 0 0 47330
> 728000000: 0 0 0 0 0 0 0 0 0
> 825000000: 0 0 0 0 0 5 0 0 331300
> Total transition : 34
>
>
> After:
> From : To
> : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
> * 165000000: 0 0 0 0 0 0 0 4 5098890
> 206000000: 5 0 0 0 0 0 0 0 57780
> 275000000: 0 5 0 0 0 0 0 0 50060
> 413000000: 0 0 5 0 0 0 0 0 46240
> 543000000: 0 0 0 5 0 0 0 0 48970
> 633000000: 0 0 0 0 5 0 0 0 47330
> 728000000: 0 0 0 0 0 0 0 0 0
> 825000000: 0 0 0 0 0 5 0 0 331300
> Total transition : 34
>
> With a running time of:
> LITTLE => 283.699 s (680.877 c per mem access)
> big => 284.47 s (975.327 c per mem access)
I see there were no transitions during your memory test.
>
> And when I set to the performance governor:
> Before:
> From : To
> : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
> 165000000: 0 0 0 0 0 0 0 5 5099040
> 206000000: 5 0 0 0 0 0 0 0 57780
> 275000000: 0 5 0 0 0 0 0 0 50060
> 413000000: 0 0 5 0 0 0 0 0 46240
> 543000000: 0 0 0 5 0 0 0 0 48970
> 633000000: 0 0 0 0 5 0 0 0 47330
> 728000000: 0 0 0 0 0 0 0 0 0
> * 825000000: 0 0 0 0 0 5 0 0 331350
> Total transition : 35
>
> After:
> From : To
> : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
> 165000000: 0 0 0 0 0 0 0 5 5099040
> 206000000: 5 0 0 0 0 0 0 0 57780
> 275000000: 0 5 0 0 0 0 0 0 50060
> 413000000: 0 0 5 0 0 0 0 0 46240
> 543000000: 0 0 0 5 0 0 0 0 48970
> 633000000: 0 0 0 0 5 0 0 0 47330
> 728000000: 0 0 0 0 0 0 0 0 0
> * 825000000: 0 0 0 0 0 5 0 0 472980
> Total transition : 35
>
> With a running time of:
> LITTLE: 68.8428 s (165.223 c per mem access)
> big: 71.3268 s (244.549 c per mem access)
>
>
> I see some transition, but not occuring during the benchmark.
> I haven't dive into the code, but maybe it is the heuristic behind that is not
> well defined? If you know how it's working that would be helpfull before I dive
> in it.
Sorry, don't know that much. It seems it counts time between overflow of
DMC perf events and based on this bumps up the frequency.
Maybe your test does not fit well in current formula? Maybe the formula
has some drawbacks...
>
> I run your test as well, and indeed, it seems to work for large bunch of memory,
> and there is some delay before making a transition (seems to be around 10s).
> When you kill memtester, it reduces the freq stepwisely every ~10s.
>
> Note that the timing shown above account for the critical path, and the code is
> looping on reading only, there is no write in the critical path.
> Maybe memtester is doing writes and devfreq heuristic uses only write info?
>
You mentioned that you want to cut the prefetcher to have direct access
to RAM. But prefetcher also accesses the RAM. He does not get the
contents from the air. Although this is unrelated to the problem
because your pattern should kick ondemand as well.
Best regards,
Krzysztof
On 2020-06-24-10-14-38, Krzysztof Kozlowski wrote:
> On Wed, Jun 24, 2020 at 10:01:17AM +0200, Willy Wolff wrote:
> > Hi Krzysztof,
> > Thanks to look at it.
> >
> > mem_gov is /sys/class/devfreq/10c20000.memory-controller/governor
> >
> > Here some numbers after increasing the running time:
> >
> > Running using simple_ondemand:
> > Before:
> > From : To
> > : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
> > * 165000000: 0 0 0 0 0 0 0 4 4528600
> > 206000000: 5 0 0 0 0 0 0 0 57780
> > 275000000: 0 5 0 0 0 0 0 0 50060
> > 413000000: 0 0 5 0 0 0 0 0 46240
> > 543000000: 0 0 0 5 0 0 0 0 48970
> > 633000000: 0 0 0 0 5 0 0 0 47330
> > 728000000: 0 0 0 0 0 0 0 0 0
> > 825000000: 0 0 0 0 0 5 0 0 331300
> > Total transition : 34
> >
> >
> > After:
> > From : To
> > : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
> > * 165000000: 0 0 0 0 0 0 0 4 5098890
> > 206000000: 5 0 0 0 0 0 0 0 57780
> > 275000000: 0 5 0 0 0 0 0 0 50060
> > 413000000: 0 0 5 0 0 0 0 0 46240
> > 543000000: 0 0 0 5 0 0 0 0 48970
> > 633000000: 0 0 0 0 5 0 0 0 47330
> > 728000000: 0 0 0 0 0 0 0 0 0
> > 825000000: 0 0 0 0 0 5 0 0 331300
> > Total transition : 34
> >
> > With a running time of:
> > LITTLE => 283.699 s (680.877 c per mem access)
> > big => 284.47 s (975.327 c per mem access)
>
> I see there were no transitions during your memory test.
>
> >
> > And when I set to the performance governor:
> > Before:
> > From : To
> > : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
> > 165000000: 0 0 0 0 0 0 0 5 5099040
> > 206000000: 5 0 0 0 0 0 0 0 57780
> > 275000000: 0 5 0 0 0 0 0 0 50060
> > 413000000: 0 0 5 0 0 0 0 0 46240
> > 543000000: 0 0 0 5 0 0 0 0 48970
> > 633000000: 0 0 0 0 5 0 0 0 47330
> > 728000000: 0 0 0 0 0 0 0 0 0
> > * 825000000: 0 0 0 0 0 5 0 0 331350
> > Total transition : 35
> >
> > After:
> > From : To
> > : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
> > 165000000: 0 0 0 0 0 0 0 5 5099040
> > 206000000: 5 0 0 0 0 0 0 0 57780
> > 275000000: 0 5 0 0 0 0 0 0 50060
> > 413000000: 0 0 5 0 0 0 0 0 46240
> > 543000000: 0 0 0 5 0 0 0 0 48970
> > 633000000: 0 0 0 0 5 0 0 0 47330
> > 728000000: 0 0 0 0 0 0 0 0 0
> > * 825000000: 0 0 0 0 0 5 0 0 472980
> > Total transition : 35
> >
> > With a running time of:
> > LITTLE: 68.8428 s (165.223 c per mem access)
> > big: 71.3268 s (244.549 c per mem access)
> >
> >
> > I see some transition, but not occuring during the benchmark.
> > I haven't dive into the code, but maybe it is the heuristic behind that is not
> > well defined? If you know how it's working that would be helpfull before I dive
> > in it.
>
> Sorry, don't know that much. It seems it counts time between overflow of
> DMC perf events and based on this bumps up the frequency.
>
> Maybe your test does not fit well in current formula? Maybe the formula
> has some drawbacks...
OK, I will read the code then.
>
> >
> > I run your test as well, and indeed, it seems to work for large bunch of memory,
> > and there is some delay before making a transition (seems to be around 10s).
> > When you kill memtester, it reduces the freq stepwisely every ~10s.
> >
> > Note that the timing shown above account for the critical path, and the code is
> > looping on reading only, there is no write in the critical path.
> > Maybe memtester is doing writes and devfreq heuristic uses only write info?
> >
> You mentioned that you want to cut the prefetcher to have direct access
> to RAM. But prefetcher also accesses the RAM. He does not get the
> contents from the air. Although this is unrelated to the problem
> because your pattern should kick ondemand as well.
Yes obvisouly. I was just describing a bit the microbenchmark and the memory pattern
access. I was suggesting that a random pattern will break the effectiveness of the
prefetcher, and as such we have a worst case situation on the memory bus.
>
> Best regards,
> Krzysztof
Hi Krzysztof and Willy
On 6/23/20 8:11 PM, Krzysztof Kozlowski wrote:
> On Tue, Jun 23, 2020 at 09:02:38PM +0200, Krzysztof Kozlowski wrote:
>> On Tue, 23 Jun 2020 at 18:47, Willy Wolff <[email protected]> wrote:
>>>
>>> Hi everybody,
>>>
>>> Is DVFS for memory bus really working on Odroid XU3/4 board?
>>> Using a simple microbenchmark that is doing only memory accesses, memory DVFS
>>> seems to not working properly:
>>>
>>> The microbenchmark is doing pointer chasing by following index in an array.
>>> Indices in the array are set to follow a random pattern (cutting prefetcher),
>>> and forcing RAM access.
>>>
>>> git clone https://github.com/wwilly/benchmark.git \
>>> && cd benchmark \
>>> && source env.sh \
>>> && ./bench_build.sh \
>>> && bash source/scripts/test_dvfs_mem.sh
>>>
>>> Python 3, cmake and sudo rights are required.
>>>
>>> Results:
>>> DVFS CPU with performance governor
>>> mem_gov = simple_ondemand at 165000000 Hz in idle, should be bumped when the
>>> benchmark is running.
>>> - on the LITTLE cluster it takes 4.74308 s to run (683.004 c per memory access),
>>> - on the big cluster it takes 4.76556 s to run (980.343 c per moemory access).
>>>
>>> While forcing DVFS memory bus to use performance governor,
>>> mem_gov = performance at 825000000 Hz in idle,
>>> - on the LITTLE cluster it takes 1.1451 s to run (164.894 c per memory access),
>>> - on the big cluster it takes 1.18448 s to run (243.664 c per memory access).
>>>
>>> The kernel used is the last 5.7.5 stable with default exynos_defconfig.
>>
>> Thanks for the report. Few thoughts:
>> 1. What trans_stat are saying? Except DMC driver you can also check
>> all other devfreq devices (e.g. wcore) - maybe the devfreq events
>> (nocp) are not properly assigned?
>> 2. Try running the measurement for ~1 minutes or longer. The counters
>> might have some delay (which would require probably fixing but the
>> point is to narrow the problem).
>> 3. What do you understand by "mem_gov"? Which device is it?
>
> +Cc Lukasz who was working on this.
Thanks Krzysztof for adding me here.
>
> I just run memtester and more-or-less ondemand works (at least ramps
> up):
>
> Before:
> /sys/class/devfreq/10c20000.memory-controller$ cat trans_stat
> From : To
> : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
> * 165000000: 0 0 0 0 0 0 0 0 1795950
> 206000000: 1 0 0 0 0 0 0 0 4770
> 275000000: 0 1 0 0 0 0 0 0 15540
> 413000000: 0 0 1 0 0 0 0 0 20780
> 543000000: 0 0 0 1 0 0 0 1 10760
> 633000000: 0 0 0 0 2 0 0 0 10310
> 728000000: 0 0 0 0 0 0 0 0 0
> 825000000: 0 0 0 0 0 2 0 0 25920
> Total transition : 9
>
>
> $ sudo memtester 1G
>
> During memtester:
> /sys/class/devfreq/10c20000.memory-controller$ cat trans_stat
> From : To
> : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
> 165000000: 0 0 0 0 0 0 0 1 1801490
> 206000000: 1 0 0 0 0 0 0 0 4770
> 275000000: 0 1 0 0 0 0 0 0 15540
> 413000000: 0 0 1 0 0 0 0 0 20780
> 543000000: 0 0 0 1 0 0 0 2 11090
> 633000000: 0 0 0 0 3 0 0 0 17210
> 728000000: 0 0 0 0 0 0 0 0 0
> * 825000000: 0 0 0 0 0 3 0 0 169020
> Total transition : 13
>
> However after killing memtester it stays at 633 MHz for very long time
> and does not slow down. This is indeed weird...
I had issues with devfreq governor which wasn't called by devfreq
workqueue. The old DELAYED vs DEFERRED work discussions and my patches
for it [1]. If the CPU which scheduled the next work went idle, the
devfreq workqueue will not be kicked and devfreq governor won't check
DMC status and will not decide to decrease the frequency based on low
busy_time.
The same applies for going up with the frequency. They both are
done by the governor but the workqueue must be scheduled periodically.
I couldn't do much with this back then. I have given the example that
this is causing issues with the DMC [2]. There is also a description
of your situation staying at 633MHz for long time:
' When it is missing opportunity
to change the frequency, it can either harm the performance or power
consumption, depending of the frequency the device stuck on.'
The patches were not accepted because it will cause CPU wake-up from
idle, which increases the energy consumption. I know that there were
some other attempts, but I don't know the status.
I had also this devfreq workqueue issue when I have been working on
thermal cooling for devfreq. The device status was not updated, because
the devfreq workqueue didn't check the device [3].
Let me investigate if that is the case.
Regards,
Lukasz
[1] https://lkml.org/lkml/2019/2/11/1146
[2] https://lkml.org/lkml/2019/2/12/383
[3]
https://lwn.net/ml/linux-kernel/[email protected]/
>
> Best regards,
> Krzysztof
>
Hi,
On 24.06.2020 12:32, Lukasz Luba wrote:
> Hi Krzysztof and Willy
>
> On 6/23/20 8:11 PM, Krzysztof Kozlowski wrote:
>> On Tue, Jun 23, 2020 at 09:02:38PM +0200, Krzysztof Kozlowski wrote:
>>> On Tue, 23 Jun 2020 at 18:47, Willy Wolff <[email protected]> wrote:
>>>>
>>>> Hi everybody,
>>>>
>>>> Is DVFS for memory bus really working on Odroid XU3/4 board?
>>>> Using a simple microbenchmark that is doing only memory accesses, memory DVFS
>>>> seems to not working properly:
>>>>
>>>> The microbenchmark is doing pointer chasing by following index in an array.
>>>> Indices in the array are set to follow a random pattern (cutting prefetcher),
>>>> and forcing RAM access.
>>>>
>>>> git clone https://protect2.fireeye.com/url?k=c364e88a-9eb6fe2f-c36563c5-0cc47a31bee8-631885f0a63a11a0&q=1&u=https%3A%2F%2Fgithub.com%2Fwwilly%2Fbenchmark.git \
>>>> && cd benchmark \
>>>> && source env.sh \
>>>> && ./bench_build.sh \
>>>> && bash source/scripts/test_dvfs_mem.sh
>>>>
>>>> Python 3, cmake and sudo rights are required.
>>>>
>>>> Results:
>>>> DVFS CPU with performance governor
>>>> mem_gov = simple_ondemand at 165000000 Hz in idle, should be bumped when the
>>>> benchmark is running.
>>>> - on the LITTLE cluster it takes 4.74308 s to run (683.004 c per memory access),
>>>> - on the big cluster it takes 4.76556 s to run (980.343 c per moemory access).
>>>>
>>>> While forcing DVFS memory bus to use performance governor,
>>>> mem_gov = performance at 825000000 Hz in idle,
>>>> - on the LITTLE cluster it takes 1.1451 s to run (164.894 c per memory access),
>>>> - on the big cluster it takes 1.18448 s to run (243.664 c per memory access).
>>>>
>>>> The kernel used is the last 5.7.5 stable with default exynos_defconfig.
>>>
>>> Thanks for the report. Few thoughts:
>>> 1. What trans_stat are saying? Except DMC driver you can also check
>>> all other devfreq devices (e.g. wcore) - maybe the devfreq events
>>> (nocp) are not properly assigned?
>>> 2. Try running the measurement for ~1 minutes or longer. The counters
>>> might have some delay (which would require probably fixing but the
>>> point is to narrow the problem).
>>> 3. What do you understand by "mem_gov"? Which device is it?
>>
>> +Cc Lukasz who was working on this.
>
> Thanks Krzysztof for adding me here.
>
>>
>> I just run memtester and more-or-less ondemand works (at least ramps
>> up):
>>
>> Before:
>> /sys/class/devfreq/10c20000.memory-controller$ cat trans_stat
>> From : To
>> : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
>> * 165000000: 0 0 0 0 0 0 0 0 1795950
>> 206000000: 1 0 0 0 0 0 0 0 4770
>> 275000000: 0 1 0 0 0 0 0 0 15540
>> 413000000: 0 0 1 0 0 0 0 0 20780
>> 543000000: 0 0 0 1 0 0 0 1 10760
>> 633000000: 0 0 0 0 2 0 0 0 10310
>> 728000000: 0 0 0 0 0 0 0 0 0
>> 825000000: 0 0 0 0 0 2 0 0 25920
>> Total transition : 9
>>
>>
>> $ sudo memtester 1G
>>
>> During memtester:
>> /sys/class/devfreq/10c20000.memory-controller$ cat trans_stat
>> From : To
>> : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
>> 165000000: 0 0 0 0 0 0 0 1 1801490
>> 206000000: 1 0 0 0 0 0 0 0 4770
>> 275000000: 0 1 0 0 0 0 0 0 15540
>> 413000000: 0 0 1 0 0 0 0 0 20780
>> 543000000: 0 0 0 1 0 0 0 2 11090
>> 633000000: 0 0 0 0 3 0 0 0 17210
>> 728000000: 0 0 0 0 0 0 0 0 0
>> * 825000000: 0 0 0 0 0 3 0 0 169020
>> Total transition : 13
>>
>> However after killing memtester it stays at 633 MHz for very long time
>> and does not slow down. This is indeed weird...
>
> I had issues with devfreq governor which wasn't called by devfreq
> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
> for it [1]. If the CPU which scheduled the next work went idle, the
> devfreq workqueue will not be kicked and devfreq governor won't check
> DMC status and will not decide to decrease the frequency based on low
> busy_time.
> The same applies for going up with the frequency. They both are
> done by the governor but the workqueue must be scheduled periodically.
>
> I couldn't do much with this back then. I have given the example that
> this is causing issues with the DMC [2]. There is also a description
> of your situation staying at 633MHz for long time:
> ' When it is missing opportunity
> to change the frequency, it can either harm the performance or power
> consumption, depending of the frequency the device stuck on.'
>
> The patches were not accepted because it will cause CPU wake-up from
> idle, which increases the energy consumption. I know that there were
> some other attempts, but I don't know the status.
>
> I had also this devfreq workqueue issue when I have been working on
> thermal cooling for devfreq. The device status was not updated, because
> the devfreq workqueue didn't check the device [3].
>
> Let me investigate if that is the case.
>
> Regards,
> Lukasz
>
> [1] https%3A%2F%2Flkml.org%2Flkml%2F2019%2F2%2F11%2F1146
> [2] https%3A%2F%2Flkml.org%2Flkml%2F2019%2F2%2F12%2F383
> [3] https%3A%2F%2Flwn.net%2Fml%2Flinux-kernel%2F20200511111912.3001-11-lukasz.luba%40arm.com%2F
and here was another try to fix wq: "PM / devfreq: add possibility for delayed work"
https://lkml.org/lkml/2019/12/9/486
--
Best regards,
Kamil Konieczny
Samsung R&D Institute Poland
On Wed, Jun 24, 2020 at 01:18:42PM +0200, Kamil Konieczny wrote:
> Hi,
>
> On 24.06.2020 12:32, Lukasz Luba wrote:
> > Hi Krzysztof and Willy
> >
> > On 6/23/20 8:11 PM, Krzysztof Kozlowski wrote:
> >> On Tue, Jun 23, 2020 at 09:02:38PM +0200, Krzysztof Kozlowski wrote:
> >>> On Tue, 23 Jun 2020 at 18:47, Willy Wolff <[email protected]> wrote:
> >>>>
> >>>> Hi everybody,
> >>>>
> >>>> Is DVFS for memory bus really working on Odroid XU3/4 board?
> >>>> Using a simple microbenchmark that is doing only memory accesses, memory DVFS
> >>>> seems to not working properly:
> >>>>
> >>>> The microbenchmark is doing pointer chasing by following index in an array.
> >>>> Indices in the array are set to follow a random pattern (cutting prefetcher),
> >>>> and forcing RAM access.
> >>>>
> >>>> git clone https://protect2.fireeye.com/url?k=c364e88a-9eb6fe2f-c36563c5-0cc47a31bee8-631885f0a63a11a0&q=1&u=https%3A%2F%2Fgithub.com%2Fwwilly%2Fbenchmark.git \
> >>>> && cd benchmark \
> >>>> && source env.sh \
> >>>> && ./bench_build.sh \
> >>>> && bash source/scripts/test_dvfs_mem.sh
> >>>>
> >>>> Python 3, cmake and sudo rights are required.
> >>>>
> >>>> Results:
> >>>> DVFS CPU with performance governor
> >>>> mem_gov = simple_ondemand at 165000000 Hz in idle, should be bumped when the
> >>>> benchmark is running.
> >>>> - on the LITTLE cluster it takes 4.74308 s to run (683.004 c per memory access),
> >>>> - on the big cluster it takes 4.76556 s to run (980.343 c per moemory access).
> >>>>
> >>>> While forcing DVFS memory bus to use performance governor,
> >>>> mem_gov = performance at 825000000 Hz in idle,
> >>>> - on the LITTLE cluster it takes 1.1451 s to run (164.894 c per memory access),
> >>>> - on the big cluster it takes 1.18448 s to run (243.664 c per memory access).
> >>>>
> >>>> The kernel used is the last 5.7.5 stable with default exynos_defconfig.
> >>>
> >>> Thanks for the report. Few thoughts:
> >>> 1. What trans_stat are saying? Except DMC driver you can also check
> >>> all other devfreq devices (e.g. wcore) - maybe the devfreq events
> >>> (nocp) are not properly assigned?
> >>> 2. Try running the measurement for ~1 minutes or longer. The counters
> >>> might have some delay (which would require probably fixing but the
> >>> point is to narrow the problem).
> >>> 3. What do you understand by "mem_gov"? Which device is it?
> >>
> >> +Cc Lukasz who was working on this.
> >
> > Thanks Krzysztof for adding me here.
> >
> >>
> >> I just run memtester and more-or-less ondemand works (at least ramps
> >> up):
> >>
> >> Before:
> >> /sys/class/devfreq/10c20000.memory-controller$ cat trans_stat
> >> From : To
> >> : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
> >> * 165000000: 0 0 0 0 0 0 0 0 1795950
> >> 206000000: 1 0 0 0 0 0 0 0 4770
> >> 275000000: 0 1 0 0 0 0 0 0 15540
> >> 413000000: 0 0 1 0 0 0 0 0 20780
> >> 543000000: 0 0 0 1 0 0 0 1 10760
> >> 633000000: 0 0 0 0 2 0 0 0 10310
> >> 728000000: 0 0 0 0 0 0 0 0 0
> >> 825000000: 0 0 0 0 0 2 0 0 25920
> >> Total transition : 9
> >>
> >>
> >> $ sudo memtester 1G
> >>
> >> During memtester:
> >> /sys/class/devfreq/10c20000.memory-controller$ cat trans_stat
> >> From : To
> >> : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
> >> 165000000: 0 0 0 0 0 0 0 1 1801490
> >> 206000000: 1 0 0 0 0 0 0 0 4770
> >> 275000000: 0 1 0 0 0 0 0 0 15540
> >> 413000000: 0 0 1 0 0 0 0 0 20780
> >> 543000000: 0 0 0 1 0 0 0 2 11090
> >> 633000000: 0 0 0 0 3 0 0 0 17210
> >> 728000000: 0 0 0 0 0 0 0 0 0
> >> * 825000000: 0 0 0 0 0 3 0 0 169020
> >> Total transition : 13
> >>
> >> However after killing memtester it stays at 633 MHz for very long time
> >> and does not slow down. This is indeed weird...
> >
> > I had issues with devfreq governor which wasn't called by devfreq
> > workqueue. The old DELAYED vs DEFERRED work discussions and my patches
> > for it [1]. If the CPU which scheduled the next work went idle, the
> > devfreq workqueue will not be kicked and devfreq governor won't check
> > DMC status and will not decide to decrease the frequency based on low
> > busy_time.
> > The same applies for going up with the frequency. They both are
> > done by the governor but the workqueue must be scheduled periodically.
> >
> > I couldn't do much with this back then. I have given the example that
> > this is causing issues with the DMC [2]. There is also a description
> > of your situation staying at 633MHz for long time:
> > ' When it is missing opportunity
> > to change the frequency, it can either harm the performance or power
> > consumption, depending of the frequency the device stuck on.'
> >
> > The patches were not accepted because it will cause CPU wake-up from
> > idle, which increases the energy consumption. I know that there were
> > some other attempts, but I don't know the status.
> >
> > I had also this devfreq workqueue issue when I have been working on
> > thermal cooling for devfreq. The device status was not updated, because
> > the devfreq workqueue didn't check the device [3].
> >
> > Let me investigate if that is the case.
> >
> > Regards,
> > Lukasz
> >
> > [1] https%3A%2F%2Flkml.org%2Flkml%2F2019%2F2%2F11%2F1146
> > [2] https%3A%2F%2Flkml.org%2Flkml%2F2019%2F2%2F12%2F383
> > [3] https%3A%2F%2Flwn.net%2Fml%2Flinux-kernel%2F20200511111912.3001-11-lukasz.luba%40arm.com%2F
>
> and here was another try to fix wq: "PM / devfreq: add possibility for delayed work"
>
> https://lkml.org/lkml/2019/12/9/486
My case was clearly showing wrong behavior. System was idle but not
sleeping - network working, SSH connection ongoing. Therefore at least
one CPU was not idle and could adjust the devfreq/DMC... but this did not
happen. The system stayed for like a minute in 633 MHz OPP.
Not-waking up idle processors - ok... so why not using power efficient
workqueue? It is exactly for this purpose - wake up from time to time on
whatever CPU to do the necessary job.
Best regards,
Krzysztof
On 6/24/20 1:06 PM, Krzysztof Kozlowski wrote:
> On Wed, Jun 24, 2020 at 01:18:42PM +0200, Kamil Konieczny wrote:
>> Hi,
>>
>> On 24.06.2020 12:32, Lukasz Luba wrote:
>>> Hi Krzysztof and Willy
>>>
>>> On 6/23/20 8:11 PM, Krzysztof Kozlowski wrote:
>>>> On Tue, Jun 23, 2020 at 09:02:38PM +0200, Krzysztof Kozlowski wrote:
>>>>> On Tue, 23 Jun 2020 at 18:47, Willy Wolff <[email protected]> wrote:
>>>>>>
>>>>>> Hi everybody,
>>>>>>
>>>>>> Is DVFS for memory bus really working on Odroid XU3/4 board?
>>>>>> Using a simple microbenchmark that is doing only memory accesses, memory DVFS
>>>>>> seems to not working properly:
>>>>>>
>>>>>> The microbenchmark is doing pointer chasing by following index in an array.
>>>>>> Indices in the array are set to follow a random pattern (cutting prefetcher),
>>>>>> and forcing RAM access.
>>>>>>
>>>>>> git clone https://protect2.fireeye.com/url?k=c364e88a-9eb6fe2f-c36563c5-0cc47a31bee8-631885f0a63a11a0&q=1&u=https%3A%2F%2Fgithub.com%2Fwwilly%2Fbenchmark.git \
>>>>>> && cd benchmark \
>>>>>> && source env.sh \
>>>>>> && ./bench_build.sh \
>>>>>> && bash source/scripts/test_dvfs_mem.sh
>>>>>>
>>>>>> Python 3, cmake and sudo rights are required.
>>>>>>
>>>>>> Results:
>>>>>> DVFS CPU with performance governor
>>>>>> mem_gov = simple_ondemand at 165000000 Hz in idle, should be bumped when the
>>>>>> benchmark is running.
>>>>>> - on the LITTLE cluster it takes 4.74308 s to run (683.004 c per memory access),
>>>>>> - on the big cluster it takes 4.76556 s to run (980.343 c per moemory access).
>>>>>>
>>>>>> While forcing DVFS memory bus to use performance governor,
>>>>>> mem_gov = performance at 825000000 Hz in idle,
>>>>>> - on the LITTLE cluster it takes 1.1451 s to run (164.894 c per memory access),
>>>>>> - on the big cluster it takes 1.18448 s to run (243.664 c per memory access).
>>>>>>
>>>>>> The kernel used is the last 5.7.5 stable with default exynos_defconfig.
>>>>>
>>>>> Thanks for the report. Few thoughts:
>>>>> 1. What trans_stat are saying? Except DMC driver you can also check
>>>>> all other devfreq devices (e.g. wcore) - maybe the devfreq events
>>>>> (nocp) are not properly assigned?
>>>>> 2. Try running the measurement for ~1 minutes or longer. The counters
>>>>> might have some delay (which would require probably fixing but the
>>>>> point is to narrow the problem).
>>>>> 3. What do you understand by "mem_gov"? Which device is it?
>>>>
>>>> +Cc Lukasz who was working on this.
>>>
>>> Thanks Krzysztof for adding me here.
>>>
>>>>
>>>> I just run memtester and more-or-less ondemand works (at least ramps
>>>> up):
>>>>
>>>> Before:
>>>> /sys/class/devfreq/10c20000.memory-controller$ cat trans_stat
>>>> From : To
>>>> : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
>>>> * 165000000: 0 0 0 0 0 0 0 0 1795950
>>>> 206000000: 1 0 0 0 0 0 0 0 4770
>>>> 275000000: 0 1 0 0 0 0 0 0 15540
>>>> 413000000: 0 0 1 0 0 0 0 0 20780
>>>> 543000000: 0 0 0 1 0 0 0 1 10760
>>>> 633000000: 0 0 0 0 2 0 0 0 10310
>>>> 728000000: 0 0 0 0 0 0 0 0 0
>>>> 825000000: 0 0 0 0 0 2 0 0 25920
>>>> Total transition : 9
>>>>
>>>>
>>>> $ sudo memtester 1G
>>>>
>>>> During memtester:
>>>> /sys/class/devfreq/10c20000.memory-controller$ cat trans_stat
>>>> From : To
>>>> : 165000000 206000000 275000000 413000000 543000000 633000000 728000000 825000000 time(ms)
>>>> 165000000: 0 0 0 0 0 0 0 1 1801490
>>>> 206000000: 1 0 0 0 0 0 0 0 4770
>>>> 275000000: 0 1 0 0 0 0 0 0 15540
>>>> 413000000: 0 0 1 0 0 0 0 0 20780
>>>> 543000000: 0 0 0 1 0 0 0 2 11090
>>>> 633000000: 0 0 0 0 3 0 0 0 17210
>>>> 728000000: 0 0 0 0 0 0 0 0 0
>>>> * 825000000: 0 0 0 0 0 3 0 0 169020
>>>> Total transition : 13
>>>>
>>>> However after killing memtester it stays at 633 MHz for very long time
>>>> and does not slow down. This is indeed weird...
>>>
>>> I had issues with devfreq governor which wasn't called by devfreq
>>> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
>>> for it [1]. If the CPU which scheduled the next work went idle, the
>>> devfreq workqueue will not be kicked and devfreq governor won't check
>>> DMC status and will not decide to decrease the frequency based on low
>>> busy_time.
>>> The same applies for going up with the frequency. They both are
>>> done by the governor but the workqueue must be scheduled periodically.
>>>
>>> I couldn't do much with this back then. I have given the example that
>>> this is causing issues with the DMC [2]. There is also a description
>>> of your situation staying at 633MHz for long time:
>>> ' When it is missing opportunity
>>> to change the frequency, it can either harm the performance or power
>>> consumption, depending of the frequency the device stuck on.'
>>>
>>> The patches were not accepted because it will cause CPU wake-up from
>>> idle, which increases the energy consumption. I know that there were
>>> some other attempts, but I don't know the status.
>>>
>>> I had also this devfreq workqueue issue when I have been working on
>>> thermal cooling for devfreq. The device status was not updated, because
>>> the devfreq workqueue didn't check the device [3].
>>>
>>> Let me investigate if that is the case.
>>>
>>> Regards,
>>> Lukasz
>>>
>>> [1] https%3A%2F%2Flkml.org%2Flkml%2F2019%2F2%2F11%2F1146
>>> [2] https%3A%2F%2Flkml.org%2Flkml%2F2019%2F2%2F12%2F383
>>> [3] https%3A%2F%2Flwn.net%2Fml%2Flinux-kernel%2F20200511111912.3001-11-lukasz.luba%40arm.com%2F
>>
>> and here was another try to fix wq: "PM / devfreq: add possibility for delayed work"
>>
>> https://lkml.org/lkml/2019/12/9/486
>
> My case was clearly showing wrong behavior. System was idle but not
> sleeping - network working, SSH connection ongoing. Therefore at least
> one CPU was not idle and could adjust the devfreq/DMC... but this did not
> happen. The system stayed for like a minute in 633 MHz OPP.
>
> Not-waking up idle processors - ok... so why not using power efficient
> workqueue? It is exactly for this purpose - wake up from time to time on
> whatever CPU to do the necessary job.
IIRC I've done this experiment, still keeping in devfreq:
INIT_DEFERRABLE_WORK()
just applying patch [1]. It uses a system_wq which should
be the same as system_power_efficient_wq when
CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set (our case).
This wasn't solving the issue for the deferred work. That's
why the patch 2/2 following patch 1/2 [1] was needed.
The deferred work uses TIMER_DEFERRABLE in it's initialization
and this is the problem. When the deferred work was queued on a CPU,
next that CPU went idle, the work was not migrated to some other CPU.
The former cpu is also not woken up according to the documentation [2].
That's why Kamil's approach should be continue IMHO. It gives more
control over important devices like: bus, dmc, gpu, which utilization
does not strictly correspond to cpu utilization (which might be low or
even 0 and cpu put into idle).
I think Kamil was pointing out also some other issues not only dmc
(buses probably), but I realized too late to help him.
Regards,
Lukasz
[1]
https://lore.kernel.org/lkml/[email protected]/
[2] https://elixir.bootlin.com/linux/latest/source/include/linux/timer.h#L40
>
> Best regards,
> Krzysztof
>
On Wed, Jun 24, 2020 at 02:03:03PM +0100, Lukasz Luba wrote:
>
>
> On 6/24/20 1:06 PM, Krzysztof Kozlowski wrote:
> > My case was clearly showing wrong behavior. System was idle but not
> > sleeping - network working, SSH connection ongoing. Therefore at least
> > one CPU was not idle and could adjust the devfreq/DMC... but this did not
> > happen. The system stayed for like a minute in 633 MHz OPP.
> >
> > Not-waking up idle processors - ok... so why not using power efficient
> > workqueue? It is exactly for this purpose - wake up from time to time on
> > whatever CPU to do the necessary job.
>
> IIRC I've done this experiment, still keeping in devfreq:
> INIT_DEFERRABLE_WORK()
> just applying patch [1]. It uses a system_wq which should
> be the same as system_power_efficient_wq when
> CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set (our case).
> This wasn't solving the issue for the deferred work. That's
> why the patch 2/2 following patch 1/2 [1] was needed.
>
> The deferred work uses TIMER_DEFERRABLE in it's initialization
> and this is the problem. When the deferred work was queued on a CPU,
> next that CPU went idle, the work was not migrated to some other CPU.
> The former cpu is also not woken up according to the documentation [2].
Yes, you need either workqueue.power_efficient kernel param or CONFIG
option to actually enable it. But at least it could then work on any
CPU.
Another solution is to use directly WQ_UNBOUND.
> That's why Kamil's approach should be continue IMHO. It gives more
> control over important devices like: bus, dmc, gpu, which utilization
> does not strictly correspond to cpu utilization (which might be low or
> even 0 and cpu put into idle).
>
> I think Kamil was pointing out also some other issues not only dmc
> (buses probably), but I realized too late to help him.
This should not be a configurable option. Why someone would prefer to
use one over another and decide about this during build or run time?
Instead it should be just *right* all the time. Always.
Argument that we want to save power so we will not wake up any CPU is
ridiculous if because of this system stays in high-power mode.
If system is idle and memory going to be idle, someone should be woken
up to save more power and slow down memory controller.
If system is idle but memory going to be busy, the currently busy CPU
(which performs some memory-intensive job) could do the job and ramp up
the devfreq performance.
Best regards,
Krzysztof
On 6/24/20 2:13 PM, Krzysztof Kozlowski wrote:
> On Wed, Jun 24, 2020 at 02:03:03PM +0100, Lukasz Luba wrote:
>>
>>
>> On 6/24/20 1:06 PM, Krzysztof Kozlowski wrote:
>>> My case was clearly showing wrong behavior. System was idle but not
>>> sleeping - network working, SSH connection ongoing. Therefore at least
>>> one CPU was not idle and could adjust the devfreq/DMC... but this did not
>>> happen. The system stayed for like a minute in 633 MHz OPP.
>>>
>>> Not-waking up idle processors - ok... so why not using power efficient
>>> workqueue? It is exactly for this purpose - wake up from time to time on
>>> whatever CPU to do the necessary job.
>>
>> IIRC I've done this experiment, still keeping in devfreq:
>> INIT_DEFERRABLE_WORK()
>> just applying patch [1]. It uses a system_wq which should
>> be the same as system_power_efficient_wq when
>> CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set (our case).
>> This wasn't solving the issue for the deferred work. That's
>> why the patch 2/2 following patch 1/2 [1] was needed.
>>
>> The deferred work uses TIMER_DEFERRABLE in it's initialization
>> and this is the problem. When the deferred work was queued on a CPU,
>> next that CPU went idle, the work was not migrated to some other CPU.
>> The former cpu is also not woken up according to the documentation [2].
>
> Yes, you need either workqueue.power_efficient kernel param or CONFIG
> option to actually enable it. But at least it could then work on any
> CPU.
>
> Another solution is to use directly WQ_UNBOUND.
>
>> That's why Kamil's approach should be continue IMHO. It gives more
>> control over important devices like: bus, dmc, gpu, which utilization
>> does not strictly correspond to cpu utilization (which might be low or
>> even 0 and cpu put into idle).
>>
>> I think Kamil was pointing out also some other issues not only dmc
>> (buses probably), but I realized too late to help him.
>
> This should not be a configurable option. Why someone would prefer to
> use one over another and decide about this during build or run time?
> Instead it should be just *right* all the time. Always.
I had the same opinion, as you can see in my explanation to those
patches, but I failed. That's why I agree with Kamil's approach
because had higher chance to get into mainline and fix at least some
of the use cases.
>
> Argument that we want to save power so we will not wake up any CPU is
> ridiculous if because of this system stays in high-power mode.
>
> If system is idle and memory going to be idle, someone should be woken
> up to save more power and slow down memory controller.
>
> If system is idle but memory going to be busy, the currently busy CPU
> (which performs some memory-intensive job) could do the job and ramp up
> the devfreq performance.
I agree. I think this devfreq mechanism was designed in the times
where there was/were 1 or 2 CPUs in the system. After a while we got ~8
and not all of them are used. This scenario was probably not
experimented widely on mainline platforms.
That is a good material for improvements, for someone who has time and
power.
Regards,
Lukasz
>
> Best regards,
> Krzysztof
>
Hi All,
On 24.06.2020 12:32, Lukasz Luba wrote:
> I had issues with devfreq governor which wasn't called by devfreq
> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
> for it [1]. If the CPU which scheduled the next work went idle, the
> devfreq workqueue will not be kicked and devfreq governor won't check
> DMC status and will not decide to decrease the frequency based on low
> busy_time.
> The same applies for going up with the frequency. They both are
> done by the governor but the workqueue must be scheduled periodically.
As I have been working on resolving the video mixer IOMMU fault issue
described here: https://patchwork.kernel.org/patch/10861757
I did some investigation of the devfreq operation, mostly on Odroid U3.
My conclusions are similar to what Lukasz says above. I would like to add
that broken scheduling of the performance counters read and the devfreq
updates seems to have one more serious implication. In each call, which
normally should happen periodically with fixed interval we stop the counters,
read counter values and start the counters again. But if period between
calls becomes long enough to let any of the counters overflow, we will
get wrong performance measurement results. My observations are that
the workqueue job can be suspended for several seconds and conditions for
the counter overflow occur sooner or later, depending among others
on the CPUs load.
Wrong bus load measurement can lead to setting too low interconnect bus
clock frequency and then bad things happen in peripheral devices.
I agree the workqueue issue needs to be fixed. I have some WIP code to use
the performance counters overflow interrupts instead of SW polling and with
that the interconnect bus clock control seems to work much better.
--
Regards,
Sylwester
Hi Sylwester,
On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:
> Hi All,
>
> On 24.06.2020 12:32, Lukasz Luba wrote:
>> I had issues with devfreq governor which wasn't called by devfreq
>> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
>> for it [1]. If the CPU which scheduled the next work went idle, the
>> devfreq workqueue will not be kicked and devfreq governor won't check
>> DMC status and will not decide to decrease the frequency based on low
>> busy_time.
>> The same applies for going up with the frequency. They both are
>> done by the governor but the workqueue must be scheduled periodically.
>
> As I have been working on resolving the video mixer IOMMU fault issue
> described here: https://patchwork.kernel.org/patch/10861757
> I did some investigation of the devfreq operation, mostly on Odroid U3.
>
> My conclusions are similar to what Lukasz says above. I would like to add
> that broken scheduling of the performance counters read and the devfreq
> updates seems to have one more serious implication. In each call, which
> normally should happen periodically with fixed interval we stop the counters,
> read counter values and start the counters again. But if period between
> calls becomes long enough to let any of the counters overflow, we will
> get wrong performance measurement results. My observations are that
> the workqueue job can be suspended for several seconds and conditions for
> the counter overflow occur sooner or later, depending among others
> on the CPUs load.
> Wrong bus load measurement can lead to setting too low interconnect bus
> clock frequency and then bad things happen in peripheral devices.
>
> I agree the workqueue issue needs to be fixed. I have some WIP code to use
> the performance counters overflow interrupts instead of SW polling and with
> that the interconnect bus clock control seems to work much better.
>
Thank you for sharing your use case and investigation results. I think
we are reaching a decent number of developers to maybe address this
issue: 'workqueue issue needs to be fixed'.
I have been facing this devfreq workqueue issue ~5 times in different
platforms.
Regarding the 'performance counters overflow interrupts' there is one
thing worth to keep in mind: variable utilization and frequency.
For example, in order to make a conclusion in algorithm deciding that
the device should increase or decrease the frequency, we fix the period
of observation, i.e. to 500ms. That can cause the long delay if the
utilization of the device suddenly drops. For example we set an
overflow threshold to value i.e. 1000 and we know that at 1000MHz
and full utilization (100%) the counter will reach that threshold
after 500ms (which we want, because we don't want too many interrupts
per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
threshold after 50*500ms = 25s. It is impossible just for the counters
to predict next utilization and adjust the threshold.
To address that, we still need to have another mechanism (like watchdog)
which will be triggered just to check if the threshold needs adjustment.
This mechanism can be a local timer in the driver or a framework
timer running kind of 'for loop' on all this type of devices (like
the scheduled workqueue). In both cases in the system there will be
interrupts, timers (even at workqueues) and scheduling.
The approach to force developers to implement their local watchdog
timers (or workqueues) in drivers is IMHO wrong and that's why we have
frameworks.
Regards,
Lukasz
Hi Lukasz,
On 25.06.2020 12:02, Lukasz Luba wrote:
> Hi Sylwester,
>
> On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:
>> Hi All,
>>
>> On 24.06.2020 12:32, Lukasz Luba wrote:
>>> I had issues with devfreq governor which wasn't called by devfreq
>>> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
>>> for it [1]. If the CPU which scheduled the next work went idle, the
>>> devfreq workqueue will not be kicked and devfreq governor won't check
>>> DMC status and will not decide to decrease the frequency based on low
>>> busy_time.
>>> The same applies for going up with the frequency. They both are
>>> done by the governor but the workqueue must be scheduled periodically.
>>
>> As I have been working on resolving the video mixer IOMMU fault issue
>> described here: https://patchwork.kernel.org/patch/10861757
>> I did some investigation of the devfreq operation, mostly on Odroid U3.
>>
>> My conclusions are similar to what Lukasz says above. I would like to add
>> that broken scheduling of the performance counters read and the devfreq
>> updates seems to have one more serious implication. In each call, which
>> normally should happen periodically with fixed interval we stop the counters,
>> read counter values and start the counters again. But if period between
>> calls becomes long enough to let any of the counters overflow, we will
>> get wrong performance measurement results. My observations are that
>> the workqueue job can be suspended for several seconds and conditions for
>> the counter overflow occur sooner or later, depending among others
>> on the CPUs load.
>> Wrong bus load measurement can lead to setting too low interconnect bus
>> clock frequency and then bad things happen in peripheral devices.
>>
>> I agree the workqueue issue needs to be fixed. I have some WIP code to use
>> the performance counters overflow interrupts instead of SW polling and with
>> that the interconnect bus clock control seems to work much better.
>>
>
> Thank you for sharing your use case and investigation results. I think
> we are reaching a decent number of developers to maybe address this
> issue: 'workqueue issue needs to be fixed'.
> I have been facing this devfreq workqueue issue ~5 times in different
> platforms.
>
> Regarding the 'performance counters overflow interrupts' there is one
> thing worth to keep in mind: variable utilization and frequency.
> For example, in order to make a conclusion in algorithm deciding that
> the device should increase or decrease the frequency, we fix the period
> of observation, i.e. to 500ms. That can cause the long delay if the
> utilization of the device suddenly drops. For example we set an
> overflow threshold to value i.e. 1000 and we know that at 1000MHz
> and full utilization (100%) the counter will reach that threshold
> after 500ms (which we want, because we don't want too many interrupts
> per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
> to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
> threshold after 50*500ms = 25s. It is impossible just for the counters
> to predict next utilization and adjust the threshold. [...]
irq triggers for underflow and overflow, so driver can adjust freq
--
Best regards,
Kamil Konieczny
Samsung R&D Institute Poland
On 6/25/20 12:30 PM, Kamil Konieczny wrote:
> Hi Lukasz,
>
> On 25.06.2020 12:02, Lukasz Luba wrote:
>> Hi Sylwester,
>>
>> On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:
>>> Hi All,
>>>
>>> On 24.06.2020 12:32, Lukasz Luba wrote:
>>>> I had issues with devfreq governor which wasn't called by devfreq
>>>> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
>>>> for it [1]. If the CPU which scheduled the next work went idle, the
>>>> devfreq workqueue will not be kicked and devfreq governor won't check
>>>> DMC status and will not decide to decrease the frequency based on low
>>>> busy_time.
>>>> The same applies for going up with the frequency. They both are
>>>> done by the governor but the workqueue must be scheduled periodically.
>>>
>>> As I have been working on resolving the video mixer IOMMU fault issue
>>> described here: https://patchwork.kernel.org/patch/10861757
>>> I did some investigation of the devfreq operation, mostly on Odroid U3.
>>>
>>> My conclusions are similar to what Lukasz says above. I would like to add
>>> that broken scheduling of the performance counters read and the devfreq
>>> updates seems to have one more serious implication. In each call, which
>>> normally should happen periodically with fixed interval we stop the counters,
>>> read counter values and start the counters again. But if period between
>>> calls becomes long enough to let any of the counters overflow, we will
>>> get wrong performance measurement results. My observations are that
>>> the workqueue job can be suspended for several seconds and conditions for
>>> the counter overflow occur sooner or later, depending among others
>>> on the CPUs load.
>>> Wrong bus load measurement can lead to setting too low interconnect bus
>>> clock frequency and then bad things happen in peripheral devices.
>>>
>>> I agree the workqueue issue needs to be fixed. I have some WIP code to use
>>> the performance counters overflow interrupts instead of SW polling and with
>>> that the interconnect bus clock control seems to work much better.
>>>
>>
>> Thank you for sharing your use case and investigation results. I think
>> we are reaching a decent number of developers to maybe address this
>> issue: 'workqueue issue needs to be fixed'.
>> I have been facing this devfreq workqueue issue ~5 times in different
>> platforms.
>>
>> Regarding the 'performance counters overflow interrupts' there is one
>> thing worth to keep in mind: variable utilization and frequency.
>> For example, in order to make a conclusion in algorithm deciding that
>> the device should increase or decrease the frequency, we fix the period
>> of observation, i.e. to 500ms. That can cause the long delay if the
>> utilization of the device suddenly drops. For example we set an
>> overflow threshold to value i.e. 1000 and we know that at 1000MHz
>> and full utilization (100%) the counter will reach that threshold
>> after 500ms (which we want, because we don't want too many interrupts
>> per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
>> to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
>> threshold after 50*500ms = 25s. It is impossible just for the counters
>> to predict next utilization and adjust the threshold. [...]
>
> irq triggers for underflow and overflow, so driver can adjust freq
>
Probably possible on some platforms, depends on how many PMU registers
are available, what information can be can assign to them and type of
interrupt. A lot of hassle and still - platform and device specific.
Also, drivers should not adjust the freq, governors (different types
of them with different settings that they can handle) should do it.
What the framework can do is to take this responsibility and provide
generic way to monitor the devices (or stop if they are suspended).
That should work nicely with the governors, which try to predict the
next best frequency. From my experience the more fluctuating intervals
the governors are called, the more odd decisions they make.
That's why I think having a predictable interval i.e. 100ms is something
desirable. Tuning the governors is easier in this case, statistics
are easier to trace and interpret, solution is not to platform specific,
etc.
Kamil do you have plans to refresh and push your next version of the
workqueue solution?
Regards,
Lukasz
On 25.06.2020 14:02, Lukasz Luba wrote:
>
>
> On 6/25/20 12:30 PM, Kamil Konieczny wrote:
>> Hi Lukasz,
>>
>> On 25.06.2020 12:02, Lukasz Luba wrote:
>>> Hi Sylwester,
>>>
>>> On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:
>>>> Hi All,
>>>>
>>>> On 24.06.2020 12:32, Lukasz Luba wrote:
>>>>> I had issues with devfreq governor which wasn't called by devfreq
>>>>> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
>>>>> for it [1]. If the CPU which scheduled the next work went idle, the
>>>>> devfreq workqueue will not be kicked and devfreq governor won't check
>>>>> DMC status and will not decide to decrease the frequency based on low
>>>>> busy_time.
>>>>> The same applies for going up with the frequency. They both are
>>>>> done by the governor but the workqueue must be scheduled periodically.
>>>>
>>>> As I have been working on resolving the video mixer IOMMU fault issue
>>>> described here: https://patchwork.kernel.org/patch/10861757
>>>> I did some investigation of the devfreq operation, mostly on Odroid U3.
>>>>
>>>> My conclusions are similar to what Lukasz says above. I would like to add
>>>> that broken scheduling of the performance counters read and the devfreq
>>>> updates seems to have one more serious implication. In each call, which
>>>> normally should happen periodically with fixed interval we stop the counters,
>>>> read counter values and start the counters again. But if period between
>>>> calls becomes long enough to let any of the counters overflow, we will
>>>> get wrong performance measurement results. My observations are that
>>>> the workqueue job can be suspended for several seconds and conditions for
>>>> the counter overflow occur sooner or later, depending among others
>>>> on the CPUs load.
>>>> Wrong bus load measurement can lead to setting too low interconnect bus
>>>> clock frequency and then bad things happen in peripheral devices.
>>>>
>>>> I agree the workqueue issue needs to be fixed. I have some WIP code to use
>>>> the performance counters overflow interrupts instead of SW polling and with
>>>> that the interconnect bus clock control seems to work much better.
>>>>
>>>
>>> Thank you for sharing your use case and investigation results. I think
>>> we are reaching a decent number of developers to maybe address this
>>> issue: 'workqueue issue needs to be fixed'.
>>> I have been facing this devfreq workqueue issue ~5 times in different
>>> platforms.
>>>
>>> Regarding the 'performance counters overflow interrupts' there is one
>>> thing worth to keep in mind: variable utilization and frequency.
>>> For example, in order to make a conclusion in algorithm deciding that
>>> the device should increase or decrease the frequency, we fix the period
>>> of observation, i.e. to 500ms. That can cause the long delay if the
>>> utilization of the device suddenly drops. For example we set an
>>> overflow threshold to value i.e. 1000 and we know that at 1000MHz
>>> and full utilization (100%) the counter will reach that threshold
>>> after 500ms (which we want, because we don't want too many interrupts
>>> per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
>>> to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
>>> threshold after 50*500ms = 25s. It is impossible just for the counters
>>> to predict next utilization and adjust the threshold. [...]
>>
>> irq triggers for underflow and overflow, so driver can adjust freq
>>
>
> Probably possible on some platforms, depends on how many PMU registers
> are available, what information can be can assign to them and type of
> interrupt. A lot of hassle and still - platform and device specific.
> Also, drivers should not adjust the freq, governors (different types
> of them with different settings that they can handle) should do it.
>
> What the framework can do is to take this responsibility and provide
> generic way to monitor the devices (or stop if they are suspended).
> That should work nicely with the governors, which try to predict the
> next best frequency. From my experience the more fluctuating intervals
> the governors are called, the more odd decisions they make.
> That's why I think having a predictable interval i.e. 100ms is something
> desirable. Tuning the governors is easier in this case, statistics
> are easier to trace and interpret, solution is not to platform specific,
> etc.
>
> Kamil do you have plans to refresh and push your next version of the
> workqueue solution?
I do not, as Bartek takes over my work,
+CC Bartek
--
Best regards,
Kamil Konieczny
Samsung R&D Institute Poland
On 6/25/20 2:12 PM, Kamil Konieczny wrote:
> On 25.06.2020 14:02, Lukasz Luba wrote:
>>
>>
>> On 6/25/20 12:30 PM, Kamil Konieczny wrote:
>>> Hi Lukasz,
>>>
>>> On 25.06.2020 12:02, Lukasz Luba wrote:
>>>> Hi Sylwester,
>>>>
>>>> On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:
>>>>> Hi All,
>>>>>
>>>>> On 24.06.2020 12:32, Lukasz Luba wrote:
>>>>>> I had issues with devfreq governor which wasn't called by devfreq
>>>>>> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
>>>>>> for it [1]. If the CPU which scheduled the next work went idle, the
>>>>>> devfreq workqueue will not be kicked and devfreq governor won't check
>>>>>> DMC status and will not decide to decrease the frequency based on low
>>>>>> busy_time.
>>>>>> The same applies for going up with the frequency. They both are
>>>>>> done by the governor but the workqueue must be scheduled periodically.
>>>>>
>>>>> As I have been working on resolving the video mixer IOMMU fault issue
>>>>> described here: https://patchwork.kernel.org/patch/10861757
>>>>> I did some investigation of the devfreq operation, mostly on Odroid U3.
>>>>>
>>>>> My conclusions are similar to what Lukasz says above. I would like to add
>>>>> that broken scheduling of the performance counters read and the devfreq
>>>>> updates seems to have one more serious implication. In each call, which
>>>>> normally should happen periodically with fixed interval we stop the counters,
>>>>> read counter values and start the counters again. But if period between
>>>>> calls becomes long enough to let any of the counters overflow, we will
>>>>> get wrong performance measurement results. My observations are that
>>>>> the workqueue job can be suspended for several seconds and conditions for
>>>>> the counter overflow occur sooner or later, depending among others
>>>>> on the CPUs load.
>>>>> Wrong bus load measurement can lead to setting too low interconnect bus
>>>>> clock frequency and then bad things happen in peripheral devices.
>>>>>
>>>>> I agree the workqueue issue needs to be fixed. I have some WIP code to use
>>>>> the performance counters overflow interrupts instead of SW polling and with
>>>>> that the interconnect bus clock control seems to work much better.
>>>>>
>>>>
>>>> Thank you for sharing your use case and investigation results. I think
>>>> we are reaching a decent number of developers to maybe address this
>>>> issue: 'workqueue issue needs to be fixed'.
>>>> I have been facing this devfreq workqueue issue ~5 times in different
>>>> platforms.
>>>>
>>>> Regarding the 'performance counters overflow interrupts' there is one
>>>> thing worth to keep in mind: variable utilization and frequency.
>>>> For example, in order to make a conclusion in algorithm deciding that
>>>> the device should increase or decrease the frequency, we fix the period
>>>> of observation, i.e. to 500ms. That can cause the long delay if the
>>>> utilization of the device suddenly drops. For example we set an
>>>> overflow threshold to value i.e. 1000 and we know that at 1000MHz
>>>> and full utilization (100%) the counter will reach that threshold
>>>> after 500ms (which we want, because we don't want too many interrupts
>>>> per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
>>>> to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
>>>> threshold after 50*500ms = 25s. It is impossible just for the counters
>>>> to predict next utilization and adjust the threshold. [...]
>>>
>>> irq triggers for underflow and overflow, so driver can adjust freq
>>>
>>
>> Probably possible on some platforms, depends on how many PMU registers
>> are available, what information can be can assign to them and type of
>> interrupt. A lot of hassle and still - platform and device specific.
>> Also, drivers should not adjust the freq, governors (different types
>> of them with different settings that they can handle) should do it.
>>
>> What the framework can do is to take this responsibility and provide
>> generic way to monitor the devices (or stop if they are suspended).
>> That should work nicely with the governors, which try to predict the
>> next best frequency. From my experience the more fluctuating intervals
>> the governors are called, the more odd decisions they make.
>> That's why I think having a predictable interval i.e. 100ms is something
>> desirable. Tuning the governors is easier in this case, statistics
>> are easier to trace and interpret, solution is not to platform specific,
>> etc.
>>
>> Kamil do you have plans to refresh and push your next version of the
>> workqueue solution?
>
> I do not, as Bartek takes over my work,
> +CC Bartek
Hi Lukasz,
As you remember in January Chanwoo has proposed another idea (to allow
selecting workqueue type by devfreq device driver):
"I'm developing the RFC patch and then I'll send it as soon as possible."
(https://lore.kernel.org/linux-pm/[email protected]/)
"After posting my suggestion, we can discuss it"
(https://lore.kernel.org/linux-pm/[email protected]/)
so we have been waiting on the patch to be posted..
Similarly we have been waiting on (any) feedback for exynos-bus/nocp
fixes for Exynos5422 support (which have been posted by Kamil also in
January):
https://lore.kernel.org/linux-pm/[email protected]/
Considering the above and how hard it has been to push the changes
through review/merge process last year we are near giving up when it
comes to upstream devfreq contributions. Sylwester is still working on
exynos-bus & interconnect integration (continuation of Artur Swigon's
work from last year) & related issues (IRQ support for PPMU) but
I'm seriously considering putting it all on-hold..
Best regards,
--
Bartlomiej Zolnierkiewicz
Samsung R&D Institute Poland
Samsung Electronics
Hi Lukasz,
On 25.06.2020 12:02, Lukasz Luba wrote:
> Regarding the 'performance counters overflow interrupts' there is one
> thing worth to keep in mind: variable utilization and frequency.
> For example, in order to make a conclusion in algorithm deciding that
> the device should increase or decrease the frequency, we fix the period
> of observation, i.e. to 500ms. That can cause the long delay if the
> utilization of the device suddenly drops. For example we set an
> overflow threshold to value i.e. 1000 and we know that at 1000MHz
> and full utilization (100%) the counter will reach that threshold
> after 500ms (which we want, because we don't want too many interrupts
> per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
> to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
> threshold after 50*500ms = 25s. It is impossible just for the counters
> to predict next utilization and adjust the threshold.
Agreed, that's in case when we use just the performance counter (PMCNT)
overflow interrupts. In my experiments I used the (total) cycle counter
(CCNT) overflow interrupts. As that counter is clocked with fixed rate
between devfreq updates it can be used as a timer by pre-loading it with
initial value depending on current bus frequency. But we could as well
use some reliable system timer mechanism to generate periodic events.
I was hoping to use the cycle counter to generate low frequency monitor
events and the actual performance counters overflow interrupts to detect
any sudden changes of utilization. However, it seems it cannot be done
with as simple performance counters HW architecture as on Exynos4412.
It looks like on Exynos5422 we have all what is needed, there is more
flexibility in selecting the counter source signal, e.g. each counter
can be a clock cycle counter or can count various bus events related to
actual utilization. Moreover, we could configure the counter gating period
and alarm interrupts are available for when the counter value drops below
configured MIN threshold or exceeds configured MAX value.
So it should be possible to configure the HW to generate the utilization
monitoring events without excessive continuous CPU intervention.
But I'm rather not going to work on the Exynos5422 SoC support at the moment.
> To address that, we still need to have another mechanism (like watchdog)
> which will be triggered just to check if the threshold needs adjustment.
> This mechanism can be a local timer in the driver or a framework
> timer running kind of 'for loop' on all this type of devices (like
> the scheduled workqueue). In both cases in the system there will be
> interrupts, timers (even at workqueues) and scheduling.
> The approach to force developers to implement their local watchdog
> timers (or workqueues) in drivers is IMHO wrong and that's why we have
> frameworks.
Yes, it should be also possible in the framework to use the counter alarm
events where the hardware is advanced enough, in order to avoid excessive
SW polling.
--
Regards,
Sylwester
Hi,
Sorry for late reply because of my perfornal issue. I count not check the email.
On 6/26/20 8:22 PM, Bartlomiej Zolnierkiewicz wrote:
>
> On 6/25/20 2:12 PM, Kamil Konieczny wrote:
>> On 25.06.2020 14:02, Lukasz Luba wrote:
>>>
>>>
>>> On 6/25/20 12:30 PM, Kamil Konieczny wrote:
>>>> Hi Lukasz,
>>>>
>>>> On 25.06.2020 12:02, Lukasz Luba wrote:
>>>>> Hi Sylwester,
>>>>>
>>>>> On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> On 24.06.2020 12:32, Lukasz Luba wrote:
>>>>>>> I had issues with devfreq governor which wasn't called by devfreq
>>>>>>> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
>>>>>>> for it [1]. If the CPU which scheduled the next work went idle, the
>>>>>>> devfreq workqueue will not be kicked and devfreq governor won't check
>>>>>>> DMC status and will not decide to decrease the frequency based on low
>>>>>>> busy_time.
>>>>>>> The same applies for going up with the frequency. They both are
>>>>>>> done by the governor but the workqueue must be scheduled periodically.
>>>>>>
>>>>>> As I have been working on resolving the video mixer IOMMU fault issue
>>>>>> described here: https://patchwork.kernel.org/patch/10861757
>>>>>> I did some investigation of the devfreq operation, mostly on Odroid U3.
>>>>>>
>>>>>> My conclusions are similar to what Lukasz says above. I would like to add
>>>>>> that broken scheduling of the performance counters read and the devfreq
>>>>>> updates seems to have one more serious implication. In each call, which
>>>>>> normally should happen periodically with fixed interval we stop the counters,
>>>>>> read counter values and start the counters again. But if period between
>>>>>> calls becomes long enough to let any of the counters overflow, we will
>>>>>> get wrong performance measurement results. My observations are that
>>>>>> the workqueue job can be suspended for several seconds and conditions for
>>>>>> the counter overflow occur sooner or later, depending among others
>>>>>> on the CPUs load.
>>>>>> Wrong bus load measurement can lead to setting too low interconnect bus
>>>>>> clock frequency and then bad things happen in peripheral devices.
>>>>>>
>>>>>> I agree the workqueue issue needs to be fixed. I have some WIP code to use
>>>>>> the performance counters overflow interrupts instead of SW polling and with
>>>>>> that the interconnect bus clock control seems to work much better.
>>>>>>
>>>>>
>>>>> Thank you for sharing your use case and investigation results. I think
>>>>> we are reaching a decent number of developers to maybe address this
>>>>> issue: 'workqueue issue needs to be fixed'.
>>>>> I have been facing this devfreq workqueue issue ~5 times in different
>>>>> platforms.
>>>>>
>>>>> Regarding the 'performance counters overflow interrupts' there is one
>>>>> thing worth to keep in mind: variable utilization and frequency.
>>>>> For example, in order to make a conclusion in algorithm deciding that
>>>>> the device should increase or decrease the frequency, we fix the period
>>>>> of observation, i.e. to 500ms. That can cause the long delay if the
>>>>> utilization of the device suddenly drops. For example we set an
>>>>> overflow threshold to value i.e. 1000 and we know that at 1000MHz
>>>>> and full utilization (100%) the counter will reach that threshold
>>>>> after 500ms (which we want, because we don't want too many interrupts
>>>>> per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
>>>>> to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
>>>>> threshold after 50*500ms = 25s. It is impossible just for the counters
>>>>> to predict next utilization and adjust the threshold. [...]
>>>>
>>>> irq triggers for underflow and overflow, so driver can adjust freq
>>>>
>>>
>>> Probably possible on some platforms, depends on how many PMU registers
>>> are available, what information can be can assign to them and type of
>>> interrupt. A lot of hassle and still - platform and device specific.
>>> Also, drivers should not adjust the freq, governors (different types
>>> of them with different settings that they can handle) should do it.
>>>
>>> What the framework can do is to take this responsibility and provide
>>> generic way to monitor the devices (or stop if they are suspended).
>>> That should work nicely with the governors, which try to predict the
>>> next best frequency. From my experience the more fluctuating intervals
>>> the governors are called, the more odd decisions they make.
>>> That's why I think having a predictable interval i.e. 100ms is something
>>> desirable. Tuning the governors is easier in this case, statistics
>>> are easier to trace and interpret, solution is not to platform specific,
>>> etc.
>>>
>>> Kamil do you have plans to refresh and push your next version of the
>>> workqueue solution?
>>
>> I do not, as Bartek takes over my work,
>> +CC Bartek
>
> Hi Lukasz,
>
> As you remember in January Chanwoo has proposed another idea (to allow
> selecting workqueue type by devfreq device driver):
>
> "I'm developing the RFC patch and then I'll send it as soon as possible."
> (https://lore.kernel.org/linux-pm/[email protected]/)
>
> "After posting my suggestion, we can discuss it"
> (https://lore.kernel.org/linux-pm/[email protected]/)
>
> so we have been waiting on the patch to be posted..
Sorry for this. I'll send it within few days.
>
> Similarly we have been waiting on (any) feedback for exynos-bus/nocp
> fixes for Exynos5422 support (which have been posted by Kamil also in
> January):
>
> https://lore.kernel.org/linux-pm/[email protected]/
>
> Considering the above and how hard it has been to push the changes
> through review/merge process last year we are near giving up when it
> comes to upstream devfreq contributions. Sylwester is still working on
> exynos-bus & interconnect integration (continuation of Artur Swigon's
> work from last year) & related issues (IRQ support for PPMU) but
> I'm seriously considering putting it all on-hold..
The Sylwester's patches (originally Artus Swigon's path) were reviewed
and I agreed this approach about devfreq/interconnect. It needs
the review from interconnect maintainer.
>
> Best regards,
> --
> Bartlomiej Zolnierkiewicz
> Samsung R&D Institute Poland
> Samsung Electronics
>
>
--
Best Regards,
Chanwoo Choi
Samsung Electronics
Hi Sylwester,
On 6/25/20 12:11 AM, Sylwester Nawrocki wrote:
> Hi All,
>
> On 24.06.2020 12:32, Lukasz Luba wrote:
>> I had issues with devfreq governor which wasn't called by devfreq
>> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
>> for it [1]. If the CPU which scheduled the next work went idle, the
>> devfreq workqueue will not be kicked and devfreq governor won't check
>> DMC status and will not decide to decrease the frequency based on low
>> busy_time.
>> The same applies for going up with the frequency. They both are
>> done by the governor but the workqueue must be scheduled periodically.
>
> As I have been working on resolving the video mixer IOMMU fault issue
> described here: https://patchwork.kernel.org/patch/10861757
> I did some investigation of the devfreq operation, mostly on Odroid U3.
>
> My conclusions are similar to what Lukasz says above. I would like to add
> that broken scheduling of the performance counters read and the devfreq
> updates seems to have one more serious implication. In each call, which
> normally should happen periodically with fixed interval we stop the counters,
> read counter values and start the counters again. But if period between
> calls becomes long enough to let any of the counters overflow, we will
> get wrong performance measurement results. My observations are that
> the workqueue job can be suspended for several seconds and conditions for
> the counter overflow occur sooner or later, depending among others
> on the CPUs load.
> Wrong bus load measurement can lead to setting too low interconnect bus
> clock frequency and then bad things happen in peripheral devices.
>
> I agree the workqueue issue needs to be fixed. I have some WIP code to use
> the performance counters overflow interrupts instead of SW polling and with
It is good way to resolve the overflow issue.
> that the interconnect bus clock control seems to work much better.
>
--
Best Regards,
Chanwoo Choi
Samsung Electronics
On 6/26/20 12:22 PM, Bartlomiej Zolnierkiewicz wrote:
>
> On 6/25/20 2:12 PM, Kamil Konieczny wrote:
>> On 25.06.2020 14:02, Lukasz Luba wrote:
>>>
>>>
>>> On 6/25/20 12:30 PM, Kamil Konieczny wrote:
[snip]
>>>
>>> Kamil do you have plans to refresh and push your next version of the
>>> workqueue solution?
>>
>> I do not, as Bartek takes over my work,
>> +CC Bartek
>
> Hi Lukasz,
Hi Bartek,
>
> As you remember in January Chanwoo has proposed another idea (to allow
> selecting workqueue type by devfreq device driver):
>
> "I'm developing the RFC patch and then I'll send it as soon as possible."
> (https://lore.kernel.org/linux-pm/[email protected]/)
>
> "After posting my suggestion, we can discuss it"
> (https://lore.kernel.org/linux-pm/[email protected]/)
>
> so we have been waiting on the patch to be posted..
>
> Similarly we have been waiting on (any) feedback for exynos-bus/nocp
> fixes for Exynos5422 support (which have been posted by Kamil also in
> January):
>
> https://lore.kernel.org/linux-pm/[email protected]/
>
> Considering the above and how hard it has been to push the changes
> through review/merge process last year we are near giving up when it
> comes to upstream devfreq contributions. Sylwester is still working on
> exynos-bus & interconnect integration (continuation of Artur Swigon's
> work from last year) & related issues (IRQ support for PPMU) but
> I'm seriously considering putting it all on-hold..
Thank you for detailed explanation and update. I see. Anyway, if you or
Sylwester need some help with this devfreq workqueue, I offer my time
as a reviewer&tester.
The more generic solution you propose, the better for all platforms.
Regards,
Lukasz
>
> Best regards,
> --
> Bartlomiej Zolnierkiewicz
> Samsung R&D Institute Poland
> Samsung Electronics
>
On 6/26/20 6:50 PM, Sylwester Nawrocki wrote:
> Hi Lukasz,
>
> On 25.06.2020 12:02, Lukasz Luba wrote:
>> Regarding the 'performance counters overflow interrupts' there is one
>> thing worth to keep in mind: variable utilization and frequency.
>> For example, in order to make a conclusion in algorithm deciding that
>> the device should increase or decrease the frequency, we fix the period
>> of observation, i.e. to 500ms. That can cause the long delay if the
>> utilization of the device suddenly drops. For example we set an
>> overflow threshold to value i.e. 1000 and we know that at 1000MHz
>> and full utilization (100%) the counter will reach that threshold
>> after 500ms (which we want, because we don't want too many interrupts
>> per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
>> to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
>> threshold after 50*500ms = 25s. It is impossible just for the counters
>> to predict next utilization and adjust the threshold.
>
> Agreed, that's in case when we use just the performance counter (PMCNT)
> overflow interrupts. In my experiments I used the (total) cycle counter
> (CCNT) overflow interrupts. As that counter is clocked with fixed rate
> between devfreq updates it can be used as a timer by pre-loading it with
> initial value depending on current bus frequency. But we could as well
> use some reliable system timer mechanism to generate periodic events.
> I was hoping to use the cycle counter to generate low frequency monitor
> events and the actual performance counters overflow interrupts to detect
> any sudden changes of utilization. However, it seems it cannot be done
> with as simple performance counters HW architecture as on Exynos4412.
> It looks like on Exynos5422 we have all what is needed, there is more
> flexibility in selecting the counter source signal, e.g. each counter
> can be a clock cycle counter or can count various bus events related to
> actual utilization. Moreover, we could configure the counter gating period
> and alarm interrupts are available for when the counter value drops below
> configured MIN threshold or exceeds configured MAX value.
I see. I don't have TRM for Exynos5422 so couldn't see that. I also
have to keep in mind other platforms which might not have this feature.
>
> So it should be possible to configure the HW to generate the utilization
> monitoring events without excessive continuous CPU intervention.
I agree, that would be desirable especially for low load in the system.
> But I'm rather not going to work on the Exynos5422 SoC support at the moment.
I see.
>
>> To address that, we still need to have another mechanism (like watchdog)
>> which will be triggered just to check if the threshold needs adjustment.
>> This mechanism can be a local timer in the driver or a framework
>> timer running kind of 'for loop' on all this type of devices (like
>> the scheduled workqueue). In both cases in the system there will be
>> interrupts, timers (even at workqueues) and scheduling.
>> The approach to force developers to implement their local watchdog
>> timers (or workqueues) in drivers is IMHO wrong and that's why we have
>> frameworks.
>
> Yes, it should be also possible in the framework to use the counter alarm
> events where the hardware is advanced enough, in order to avoid excessive
> SW polling.
Looks promising, but that would need more plumbing I assume.
Regards,
Lukasz
>
> --
> Regards,
> Sylwester
>
Hi Chanwoo,
On 6/29/20 2:43 AM, Chanwoo Choi wrote:
> Hi,
>
> Sorry for late reply because of my perfornal issue. I count not check the email.
I hope you are good now.
>
> On 6/26/20 8:22 PM, Bartlomiej Zolnierkiewicz wrote:
>>
>> On 6/25/20 2:12 PM, Kamil Konieczny wrote:
>>> On 25.06.2020 14:02, Lukasz Luba wrote:
>>>>
>>>>
>>>> On 6/25/20 12:30 PM, Kamil Konieczny wrote:
>>>>> Hi Lukasz,
>>>>>
>>>>> On 25.06.2020 12:02, Lukasz Luba wrote:
>>>>>> Hi Sylwester,
>>>>>>
>>>>>> On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> On 24.06.2020 12:32, Lukasz Luba wrote:
>>>>>>>> I had issues with devfreq governor which wasn't called by devfreq
>>>>>>>> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
>>>>>>>> for it [1]. If the CPU which scheduled the next work went idle, the
>>>>>>>> devfreq workqueue will not be kicked and devfreq governor won't check
>>>>>>>> DMC status and will not decide to decrease the frequency based on low
>>>>>>>> busy_time.
>>>>>>>> The same applies for going up with the frequency. They both are
>>>>>>>> done by the governor but the workqueue must be scheduled periodically.
>>>>>>>
>>>>>>> As I have been working on resolving the video mixer IOMMU fault issue
>>>>>>> described here: https://patchwork.kernel.org/patch/10861757
>>>>>>> I did some investigation of the devfreq operation, mostly on Odroid U3.
>>>>>>>
>>>>>>> My conclusions are similar to what Lukasz says above. I would like to add
>>>>>>> that broken scheduling of the performance counters read and the devfreq
>>>>>>> updates seems to have one more serious implication. In each call, which
>>>>>>> normally should happen periodically with fixed interval we stop the counters,
>>>>>>> read counter values and start the counters again. But if period between
>>>>>>> calls becomes long enough to let any of the counters overflow, we will
>>>>>>> get wrong performance measurement results. My observations are that
>>>>>>> the workqueue job can be suspended for several seconds and conditions for
>>>>>>> the counter overflow occur sooner or later, depending among others
>>>>>>> on the CPUs load.
>>>>>>> Wrong bus load measurement can lead to setting too low interconnect bus
>>>>>>> clock frequency and then bad things happen in peripheral devices.
>>>>>>>
>>>>>>> I agree the workqueue issue needs to be fixed. I have some WIP code to use
>>>>>>> the performance counters overflow interrupts instead of SW polling and with
>>>>>>> that the interconnect bus clock control seems to work much better.
>>>>>>>
>>>>>>
>>>>>> Thank you for sharing your use case and investigation results. I think
>>>>>> we are reaching a decent number of developers to maybe address this
>>>>>> issue: 'workqueue issue needs to be fixed'.
>>>>>> I have been facing this devfreq workqueue issue ~5 times in different
>>>>>> platforms.
>>>>>>
>>>>>> Regarding the 'performance counters overflow interrupts' there is one
>>>>>> thing worth to keep in mind: variable utilization and frequency.
>>>>>> For example, in order to make a conclusion in algorithm deciding that
>>>>>> the device should increase or decrease the frequency, we fix the period
>>>>>> of observation, i.e. to 500ms. That can cause the long delay if the
>>>>>> utilization of the device suddenly drops. For example we set an
>>>>>> overflow threshold to value i.e. 1000 and we know that at 1000MHz
>>>>>> and full utilization (100%) the counter will reach that threshold
>>>>>> after 500ms (which we want, because we don't want too many interrupts
>>>>>> per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
>>>>>> to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
>>>>>> threshold after 50*500ms = 25s. It is impossible just for the counters
>>>>>> to predict next utilization and adjust the threshold. [...]
>>>>>
>>>>> irq triggers for underflow and overflow, so driver can adjust freq
>>>>>
>>>>
>>>> Probably possible on some platforms, depends on how many PMU registers
>>>> are available, what information can be can assign to them and type of
>>>> interrupt. A lot of hassle and still - platform and device specific.
>>>> Also, drivers should not adjust the freq, governors (different types
>>>> of them with different settings that they can handle) should do it.
>>>>
>>>> What the framework can do is to take this responsibility and provide
>>>> generic way to monitor the devices (or stop if they are suspended).
>>>> That should work nicely with the governors, which try to predict the
>>>> next best frequency. From my experience the more fluctuating intervals
>>>> the governors are called, the more odd decisions they make.
>>>> That's why I think having a predictable interval i.e. 100ms is something
>>>> desirable. Tuning the governors is easier in this case, statistics
>>>> are easier to trace and interpret, solution is not to platform specific,
>>>> etc.
>>>>
>>>> Kamil do you have plans to refresh and push your next version of the
>>>> workqueue solution?
>>>
>>> I do not, as Bartek takes over my work,
>>> +CC Bartek
>>
>> Hi Lukasz,
>>
>> As you remember in January Chanwoo has proposed another idea (to allow
>> selecting workqueue type by devfreq device driver):
>>
>> "I'm developing the RFC patch and then I'll send it as soon as possible."
>> (https://lore.kernel.org/linux-pm/[email protected]/)
>>
>> "After posting my suggestion, we can discuss it"
>> (https://lore.kernel.org/linux-pm/[email protected]/)
>>
>> so we have been waiting on the patch to be posted..
>
> Sorry for this. I'll send it within few days.
Feel free to add me on CC, I can review&test the patches if you like.
Stay safe and healthy.
Regards,
Lukasz
On 2020-06-29-12-52-10, Lukasz Luba wrote:
> Hi Chanwoo,
>
> On 6/29/20 2:43 AM, Chanwoo Choi wrote:
> > Hi,
> >
> > Sorry for late reply because of my perfornal issue. I count not check the email.
>
> I hope you are good now.
>
> >
> > On 6/26/20 8:22 PM, Bartlomiej Zolnierkiewicz wrote:
> > >
> > > On 6/25/20 2:12 PM, Kamil Konieczny wrote:
> > > > On 25.06.2020 14:02, Lukasz Luba wrote:
> > > > >
> > > > >
> > > > > On 6/25/20 12:30 PM, Kamil Konieczny wrote:
> > > > > > Hi Lukasz,
> > > > > >
> > > > > > On 25.06.2020 12:02, Lukasz Luba wrote:
> > > > > > > Hi Sylwester,
> > > > > > >
> > > > > > > On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > On 24.06.2020 12:32, Lukasz Luba wrote:
> > > > > > > > > I had issues with devfreq governor which wasn't called by devfreq
> > > > > > > > > workqueue. The old DELAYED vs DEFERRED work discussions and my patches
> > > > > > > > > for it [1]. If the CPU which scheduled the next work went idle, the
> > > > > > > > > devfreq workqueue will not be kicked and devfreq governor won't check
> > > > > > > > > DMC status and will not decide to decrease the frequency based on low
> > > > > > > > > busy_time.
> > > > > > > > > The same applies for going up with the frequency. They both are
> > > > > > > > > done by the governor but the workqueue must be scheduled periodically.
> > > > > > > >
> > > > > > > > As I have been working on resolving the video mixer IOMMU fault issue
> > > > > > > > described here: https://patchwork.kernel.org/patch/10861757
> > > > > > > > I did some investigation of the devfreq operation, mostly on Odroid U3.
> > > > > > > >
> > > > > > > > My conclusions are similar to what Lukasz says above. I would like to add
> > > > > > > > that broken scheduling of the performance counters read and the devfreq
> > > > > > > > updates seems to have one more serious implication. In each call, which
> > > > > > > > normally should happen periodically with fixed interval we stop the counters,
> > > > > > > > read counter values and start the counters again. But if period between
> > > > > > > > calls becomes long enough to let any of the counters overflow, we will
> > > > > > > > get wrong performance measurement results. My observations are that
> > > > > > > > the workqueue job can be suspended for several seconds and conditions for
> > > > > > > > the counter overflow occur sooner or later, depending among others
> > > > > > > > on the CPUs load.
> > > > > > > > Wrong bus load measurement can lead to setting too low interconnect bus
> > > > > > > > clock frequency and then bad things happen in peripheral devices.
> > > > > > > >
> > > > > > > > I agree the workqueue issue needs to be fixed. I have some WIP code to use
> > > > > > > > the performance counters overflow interrupts instead of SW polling and with
> > > > > > > > that the interconnect bus clock control seems to work much better.
> > > > > > > >
> > > > > > >
> > > > > > > Thank you for sharing your use case and investigation results. I think
> > > > > > > we are reaching a decent number of developers to maybe address this
> > > > > > > issue: 'workqueue issue needs to be fixed'.
> > > > > > > I have been facing this devfreq workqueue issue ~5 times in different
> > > > > > > platforms.
> > > > > > >
> > > > > > > Regarding the 'performance counters overflow interrupts' there is one
> > > > > > > thing worth to keep in mind: variable utilization and frequency.
> > > > > > > For example, in order to make a conclusion in algorithm deciding that
> > > > > > > the device should increase or decrease the frequency, we fix the period
> > > > > > > of observation, i.e. to 500ms. That can cause the long delay if the
> > > > > > > utilization of the device suddenly drops. For example we set an
> > > > > > > overflow threshold to value i.e. 1000 and we know that at 1000MHz
> > > > > > > and full utilization (100%) the counter will reach that threshold
> > > > > > > after 500ms (which we want, because we don't want too many interrupts
> > > > > > > per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
> > > > > > > to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
> > > > > > > threshold after 50*500ms = 25s. It is impossible just for the counters
> > > > > > > to predict next utilization and adjust the threshold. [...]
> > > > > >
> > > > > > irq triggers for underflow and overflow, so driver can adjust freq
> > > > > >
> > > > >
> > > > > Probably possible on some platforms, depends on how many PMU registers
> > > > > are available, what information can be can assign to them and type of
> > > > > interrupt. A lot of hassle and still - platform and device specific.
> > > > > Also, drivers should not adjust the freq, governors (different types
> > > > > of them with different settings that they can handle) should do it.
> > > > >
> > > > > What the framework can do is to take this responsibility and provide
> > > > > generic way to monitor the devices (or stop if they are suspended).
> > > > > That should work nicely with the governors, which try to predict the
> > > > > next best frequency. From my experience the more fluctuating intervals
> > > > > the governors are called, the more odd decisions they make.
> > > > > That's why I think having a predictable interval i.e. 100ms is something
> > > > > desirable. Tuning the governors is easier in this case, statistics
> > > > > are easier to trace and interpret, solution is not to platform specific,
> > > > > etc.
> > > > >
> > > > > Kamil do you have plans to refresh and push your next version of the
> > > > > workqueue solution?
> > > >
> > > > I do not, as Bartek takes over my work,
> > > > +CC Bartek
> > >
> > > Hi Lukasz,
> > >
> > > As you remember in January Chanwoo has proposed another idea (to allow
> > > selecting workqueue type by devfreq device driver):
> > >
> > > "I'm developing the RFC patch and then I'll send it as soon as possible."
> > > (https://lore.kernel.org/linux-pm/[email protected]/)
> > >
> > > "After posting my suggestion, we can discuss it"
> > > (https://lore.kernel.org/linux-pm/[email protected]/)
> > >
> > > so we have been waiting on the patch to be posted..
> >
> > Sorry for this. I'll send it within few days.
>
>
> Feel free to add me on CC, I can review&test the patches if you like.
Please CC me too.
>
> Stay safe and healthy.
>
> Regards,
> Lukasz
>
Cheers,
Willy