LinuxLists.cc - [PATCH v2] clk: mediatek: clk-mux: Add .determine

2022-10-11 14:08:10

Subject: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

Since commit 262ca38f4b6e ("clk: Stop forwarding clk_rate_requests
to the parent"), the clk_rate_request is .. as the title says, not
forwarded anymore to the parent: this produces an issue with the
MediaTek clock MUX driver during GPU DVFS on MT8195, but not on
MT8192 or others.

This is because, differently from others, like MT8192 where all of
the clocks in the MFG parents tree are of mtk_mux type, but in the
parent tree of MT8195's MFG clock, we have one mtk_mux clock and
one (clk framework generic) mux clock, like so:

names: mfg_bg3d -> mfg_ck_fast_ref -> top_mfg_core_tmp (or) mfgpll
types: mtk_gate -> mux -> mtk_mux (or) mtk_pll

To solve this issue and also keep the GPU DVFS clocks code working
as expected, wire up a .determine_rate() callback for the mtk_mux
ops; for that, the standard clk_mux_determine_rate_flags() was used
as it was possible to.

This commit was successfully tested on MT6795 Xperia M5, MT8173 Elm,
MT8192 Spherion and MT8195 Tomato; no regressions were seen.

For the sake of some more documentation about this issue here's the
trace of it:

[ 12.211587] ------------[ cut here ]------------
[ 12.211589] WARNING: CPU: 6 PID: 78 at drivers/clk/clk.c:1462 clk_core_init_rate_req+0x84/0x90
[ 12.211593] Modules linked in: stp crct10dif_ce mtk_adsp_common llc rfkill snd_sof_xtensa_dsp
panfrost(+) sbs_battery cros_ec_lid_angle cros_ec_sensors snd_sof_of
cros_ec_sensors_core hid_multitouch cros_usbpd_logger snd_sof gpu_sched
snd_sof_utils fuse ipv6
[ 12.211614] CPU: 6 PID: 78 Comm: kworker/u16:2 Tainted: G W 6.0.0-next-20221011+ #58
[ 12.211616] Hardware name: Acer Tomato (rev2) board (DT)
[ 12.211617] Workqueue: devfreq_wq devfreq_monitor
[ 12.211620] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 12.211622] pc : clk_core_init_rate_req+0x84/0x90
[ 12.211625] lr : clk_core_forward_rate_req+0xa4/0xe4
[ 12.211627] sp : ffff80000893b8e0
[ 12.211628] x29: ffff80000893b8e0 x28: ffffdddf92f9b000 x27: ffff46a2c0e8bc05
[ 12.211632] x26: ffff46a2c1041200 x25: 0000000000000000 x24: 00000000173eed80
[ 12.211636] x23: ffff80000893b9c0 x22: ffff80000893b940 x21: 0000000000000000
[ 12.211641] x20: ffff46a2c1039f00 x19: ffff46a2c1039f00 x18: 0000000000000000
[ 12.211645] x17: 0000000000000038 x16: 000000000000d904 x15: 0000000000000003
[ 12.211649] x14: ffffdddf9357ce48 x13: ffffdddf935e71c8 x12: 000000000004803c
[ 12.211653] x11: 00000000a867d7ad x10: 00000000a867d7ad x9 : ffffdddf90c28df4
[ 12.211657] x8 : ffffdddf9357a980 x7 : 0000000000000000 x6 : 0000000000000004
[ 12.211661] x5 : ffffffffffffffc8 x4 : 00000000173eed80 x3 : ffff80000893b940
[ 12.211665] x2 : 00000000173eed80 x1 : ffff80000893b940 x0 : 0000000000000000
[ 12.211669] Call trace:
[ 12.211670] clk_core_init_rate_req+0x84/0x90
[ 12.211673] clk_core_round_rate_nolock+0xe8/0x10c
[ 12.211675] clk_mux_determine_rate_flags+0x174/0x1f0
[ 12.211677] clk_mux_determine_rate+0x1c/0x30
[ 12.211680] clk_core_determine_round_nolock+0x74/0x130
[ 12.211682] clk_core_round_rate_nolock+0x58/0x10c
[ 12.211684] clk_core_round_rate_nolock+0xf4/0x10c
[ 12.211686] clk_core_set_rate_nolock+0x194/0x2ac
[ 12.211688] clk_set_rate+0x40/0x94
[ 12.211691] _opp_config_clk_single+0x38/0xa0
[ 12.211693] _set_opp+0x1b0/0x500
[ 12.211695] dev_pm_opp_set_rate+0x120/0x290
[ 12.211697] panfrost_devfreq_target+0x3c/0x50 [panfrost]
[ 12.211705] devfreq_set_target+0x8c/0x2d0
[ 12.211707] devfreq_update_target+0xcc/0xf4
[ 12.211708] devfreq_monitor+0x40/0x1d0
[ 12.211710] process_one_work+0x294/0x664
[ 12.211712] worker_thread+0x7c/0x45c
[ 12.211713] kthread+0x104/0x110
[ 12.211716] ret_from_fork+0x10/0x20
[ 12.211718] irq event stamp: 7102
[ 12.211719] hardirqs last enabled at (7101): [<ffffdddf904ea5a0>] finish_task_switch.isra.0+0xec/0x2f0
[ 12.211723] hardirqs last disabled at (7102): [<ffffdddf91794b74>] el1_dbg+0x24/0x90
[ 12.211726] softirqs last enabled at (6716): [<ffffdddf90410be4>] __do_softirq+0x414/0x588
[ 12.211728] softirqs last disabled at (6507): [<ffffdddf904171d8>] ____do_softirq+0x18/0x24
[ 12.211730] ---[ end trace 0000000000000000 ]---

Fixes: 262ca38f4b6e ("clk: Stop forwarding clk_rate_requests to the parent")
Signed-off-by: AngeloGioacchino Del Regno <[email protected]>
---

Changes in v2:
- Removed changes not relevant to this commit (ugh, sorry!)

drivers/clk/mediatek/clk-mux.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/drivers/clk/mediatek/clk-mux.c b/drivers/clk/mediatek/clk-mux.c
index 4421e4859257..ba1720b9e231 100644
--- a/drivers/clk/mediatek/clk-mux.c
+++ b/drivers/clk/mediatek/clk-mux.c
@@ -129,9 +129,18 @@ static int mtk_clk_mux_set_parent_setclr_lock(struct clk_hw *hw, u8 index)
return 0;
}

+static int mtk_clk_mux_determine_rate(struct clk_hw *hw,
+ struct clk_rate_request *req)
+{
+ struct mtk_clk_mux *mux = to_mtk_clk_mux(hw);
+
+ return clk_mux_determine_rate_flags(hw, req, mux->data->flags);
+}
+
const struct clk_ops mtk_mux_clr_set_upd_ops = {
.get_parent = mtk_clk_mux_get_parent,
.set_parent = mtk_clk_mux_set_parent_setclr_lock,
+ .determine_rate = mtk_clk_mux_determine_rate,
};
EXPORT_SYMBOL_GPL(mtk_mux_clr_set_upd_ops);

@@ -141,6 +150,7 @@ const struct clk_ops mtk_mux_gate_clr_set_upd_ops = {
.is_enabled = mtk_clk_mux_is_enabled,
.get_parent = mtk_clk_mux_get_parent,
.set_parent = mtk_clk_mux_set_parent_setclr_lock,
+ .determine_rate = mtk_clk_mux_determine_rate,
};
EXPORT_SYMBOL_GPL(mtk_mux_gate_clr_set_upd_ops);

--
2.37.2

2022-10-12 08:58:25

by Maxime Ripard

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

Hi,

On Tue, Oct 11, 2022 at 03:55:48PM +0200, AngeloGioacchino Del Regno wrote:
> Since commit 262ca38f4b6e ("clk: Stop forwarding clk_rate_requests
> to the parent"), the clk_rate_request is .. as the title says, not
> forwarded anymore to the parent:

It's not entirely true, the rate request should still be forwarded, but
we don't pass the same instance of clk_rate_request anymore.

> this produces an issue with the MediaTek clock MUX driver during GPU
> DVFS on MT8195, but not on MT8192 or others.
>
> This is because, differently from others, like MT8192 where all of
> the clocks in the MFG parents tree are of mtk_mux type, but in the
> parent tree of MT8195's MFG clock, we have one mtk_mux clock and
> one (clk framework generic) mux clock, like so:
>
> names: mfg_bg3d -> mfg_ck_fast_ref -> top_mfg_core_tmp (or) mfgpll
> types: mtk_gate -> mux -> mtk_mux (or) mtk_pll
>
> To solve this issue and also keep the GPU DVFS clocks code working
> as expected, wire up a .determine_rate() callback for the mtk_mux
> ops; for that, the standard clk_mux_determine_rate_flags() was used
> as it was possible to.

It probably fixes things indeed, but I'm a bit worried that it just
works around the actual issue instead of fixing the actual bug...

> This commit was successfully tested on MT6795 Xperia M5, MT8173 Elm,
> MT8192 Spherion and MT8195 Tomato; no regressions were seen.
>
> For the sake of some more documentation about this issue here's the
> trace of it:
>
> [ 12.211587] ------------[ cut here ]------------
> [ 12.211589] WARNING: CPU: 6 PID: 78 at drivers/clk/clk.c:1462 clk_core_init_rate_req+0x84/0x90
> [ 12.211593] Modules linked in: stp crct10dif_ce mtk_adsp_common llc rfkill snd_sof_xtensa_dsp
> panfrost(+) sbs_battery cros_ec_lid_angle cros_ec_sensors snd_sof_of
> cros_ec_sensors_core hid_multitouch cros_usbpd_logger snd_sof gpu_sched
> snd_sof_utils fuse ipv6
> [ 12.211614] CPU: 6 PID: 78 Comm: kworker/u16:2 Tainted: G W 6.0.0-next-20221011+ #58
> [ 12.211616] Hardware name: Acer Tomato (rev2) board (DT)
> [ 12.211617] Workqueue: devfreq_wq devfreq_monitor
> [ 12.211620] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 12.211622] pc : clk_core_init_rate_req+0x84/0x90
> [ 12.211625] lr : clk_core_forward_rate_req+0xa4/0xe4
> [ 12.211627] sp : ffff80000893b8e0
> [ 12.211628] x29: ffff80000893b8e0 x28: ffffdddf92f9b000 x27: ffff46a2c0e8bc05
> [ 12.211632] x26: ffff46a2c1041200 x25: 0000000000000000 x24: 00000000173eed80
> [ 12.211636] x23: ffff80000893b9c0 x22: ffff80000893b940 x21: 0000000000000000
> [ 12.211641] x20: ffff46a2c1039f00 x19: ffff46a2c1039f00 x18: 0000000000000000
> [ 12.211645] x17: 0000000000000038 x16: 000000000000d904 x15: 0000000000000003
> [ 12.211649] x14: ffffdddf9357ce48 x13: ffffdddf935e71c8 x12: 000000000004803c
> [ 12.211653] x11: 00000000a867d7ad x10: 00000000a867d7ad x9 : ffffdddf90c28df4
> [ 12.211657] x8 : ffffdddf9357a980 x7 : 0000000000000000 x6 : 0000000000000004
> [ 12.211661] x5 : ffffffffffffffc8 x4 : 00000000173eed80 x3 : ffff80000893b940
> [ 12.211665] x2 : 00000000173eed80 x1 : ffff80000893b940 x0 : 0000000000000000
> [ 12.211669] Call trace:
> [ 12.211670] clk_core_init_rate_req+0x84/0x90
> [ 12.211673] clk_core_round_rate_nolock+0xe8/0x10c
> [ 12.211675] clk_mux_determine_rate_flags+0x174/0x1f0
> [ 12.211677] clk_mux_determine_rate+0x1c/0x30
> [ 12.211680] clk_core_determine_round_nolock+0x74/0x130
> [ 12.211682] clk_core_round_rate_nolock+0x58/0x10c
> [ 12.211684] clk_core_round_rate_nolock+0xf4/0x10c
> [ 12.211686] clk_core_set_rate_nolock+0x194/0x2ac
> [ 12.211688] clk_set_rate+0x40/0x94
> [ 12.211691] _opp_config_clk_single+0x38/0xa0
> [ 12.211693] _set_opp+0x1b0/0x500
> [ 12.211695] dev_pm_opp_set_rate+0x120/0x290
> [ 12.211697] panfrost_devfreq_target+0x3c/0x50 [panfrost]
> [ 12.211705] devfreq_set_target+0x8c/0x2d0
> [ 12.211707] devfreq_update_target+0xcc/0xf4
> [ 12.211708] devfreq_monitor+0x40/0x1d0
> [ 12.211710] process_one_work+0x294/0x664
> [ 12.211712] worker_thread+0x7c/0x45c
> [ 12.211713] kthread+0x104/0x110
> [ 12.211716] ret_from_fork+0x10/0x20
> [ 12.211718] irq event stamp: 7102
> [ 12.211719] hardirqs last enabled at (7101): [<ffffdddf904ea5a0>] finish_task_switch.isra.0+0xec/0x2f0
> [ 12.211723] hardirqs last disabled at (7102): [<ffffdddf91794b74>] el1_dbg+0x24/0x90
> [ 12.211726] softirqs last enabled at (6716): [<ffffdddf90410be4>] __do_softirq+0x414/0x588
> [ 12.211728] softirqs last disabled at (6507): [<ffffdddf904171d8>] ____do_softirq+0x18/0x24
> [ 12.211730] ---[ end trace 0000000000000000 ]---

... Indeed, you shouldn't hit that warning at all. It happens in
clk_core_round_rate_nolock, which takes (before your patch) the
CLK_SET_RATE_PARENT branch. This indeed has been changed by the patch
you mentioned, and will call clk_core_forward_rate_req() now, that in
turn calls clk_core_init_rate_nolock().

I think the warning you hit is because core->parent is NULL, which is
passed to clk_core_forward_rate_req() as the parent argument, and we'll
call clk_core_init_rate_req() with parent set as the core argument.

In clk_core_init_rate_req(), the first thing we do is a WARN_ON(!core),
which is what you hit here I think.

This is different to the previous behavior that was calling
clk_core_round_rate_nolock() with core->parent directly, and
clk_core_round_rate_nolock() if its core argument is NULL will set
req->rate to 0 and bail out without returning an error.

Now, your patch probably works because now that you provide a
determine_rate implementation, clk_core_can_round() returns true and
you'll take a different branch in clk_core_round_rate_nolock(), avoiding
that issue entirely.

Does that patch work better (on top of next-20221012)?

diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
index c3c3f8c07258..b831a4227236 100644
--- a/drivers/clk/clk.c
+++ b/drivers/clk/clk.c
@@ -1459,12 +1459,15 @@ static void clk_core_init_rate_req(struct clk_core * const core,
{
struct clk_core *parent;

- if (WARN_ON(!core || !req))
+ if (WARN_ON(!req))
return;

memset(req, 0, sizeof(*req));
-
req->rate = rate;
+
+ if (!core)
+ return;
+
clk_core_get_boundaries(core, &req->min_rate, &req->max_rate);

parent = core->parent;

Thanks!
Maxime

2022-10-12 09:56:26

by Maxime Ripard

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

On Wed, Oct 12, 2022 at 11:09:59AM +0200, AngeloGioacchino Del Regno wrote:
> Il 12/10/22 10:55, Maxime Ripard ha scritto:
> > Hi,
> >
> > On Tue, Oct 11, 2022 at 03:55:48PM +0200, AngeloGioacchino Del Regno wrote:
> > > Since commit 262ca38f4b6e ("clk: Stop forwarding clk_rate_requests
> > > to the parent"), the clk_rate_request is .. as the title says, not
> > > forwarded anymore to the parent:
> >
> > It's not entirely true, the rate request should still be forwarded, but
> > we don't pass the same instance of clk_rate_request anymore.
> >
> > > this produces an issue with the MediaTek clock MUX driver during GPU
> > > DVFS on MT8195, but not on MT8192 or others.
> > >
> > > This is because, differently from others, like MT8192 where all of
> > > the clocks in the MFG parents tree are of mtk_mux type, but in the
> > > parent tree of MT8195's MFG clock, we have one mtk_mux clock and
> > > one (clk framework generic) mux clock, like so:
> > >
> > > names: mfg_bg3d -> mfg_ck_fast_ref -> top_mfg_core_tmp (or) mfgpll
> > > types: mtk_gate -> mux -> mtk_mux (or) mtk_pll
> > >
> > > To solve this issue and also keep the GPU DVFS clocks code working
> > > as expected, wire up a .determine_rate() callback for the mtk_mux
> > > ops; for that, the standard clk_mux_determine_rate_flags() was used
> > > as it was possible to.
> >
> > It probably fixes things indeed, but I'm a bit worried that it just
> > works around the actual issue instead of fixing the actual bug...
> >
> > > This commit was successfully tested on MT6795 Xperia M5, MT8173 Elm,
> > > MT8192 Spherion and MT8195 Tomato; no regressions were seen.
> > >
> > > For the sake of some more documentation about this issue here's the
> > > trace of it:
> > >
> > > [ 12.211587] ------------[ cut here ]------------
> > > [ 12.211589] WARNING: CPU: 6 PID: 78 at drivers/clk/clk.c:1462 clk_core_init_rate_req+0x84/0x90
> > > [ 12.211593] Modules linked in: stp crct10dif_ce mtk_adsp_common llc rfkill snd_sof_xtensa_dsp
> > > panfrost(+) sbs_battery cros_ec_lid_angle cros_ec_sensors snd_sof_of
> > > cros_ec_sensors_core hid_multitouch cros_usbpd_logger snd_sof gpu_sched
> > > snd_sof_utils fuse ipv6
> > > [ 12.211614] CPU: 6 PID: 78 Comm: kworker/u16:2 Tainted: G W 6.0.0-next-20221011+ #58
> > > [ 12.211616] Hardware name: Acer Tomato (rev2) board (DT)
> > > [ 12.211617] Workqueue: devfreq_wq devfreq_monitor
> > > [ 12.211620] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > [ 12.211622] pc : clk_core_init_rate_req+0x84/0x90
> > > [ 12.211625] lr : clk_core_forward_rate_req+0xa4/0xe4
> > > [ 12.211627] sp : ffff80000893b8e0
> > > [ 12.211628] x29: ffff80000893b8e0 x28: ffffdddf92f9b000 x27: ffff46a2c0e8bc05
> > > [ 12.211632] x26: ffff46a2c1041200 x25: 0000000000000000 x24: 00000000173eed80
> > > [ 12.211636] x23: ffff80000893b9c0 x22: ffff80000893b940 x21: 0000000000000000
> > > [ 12.211641] x20: ffff46a2c1039f00 x19: ffff46a2c1039f00 x18: 0000000000000000
> > > [ 12.211645] x17: 0000000000000038 x16: 000000000000d904 x15: 0000000000000003
> > > [ 12.211649] x14: ffffdddf9357ce48 x13: ffffdddf935e71c8 x12: 000000000004803c
> > > [ 12.211653] x11: 00000000a867d7ad x10: 00000000a867d7ad x9 : ffffdddf90c28df4
> > > [ 12.211657] x8 : ffffdddf9357a980 x7 : 0000000000000000 x6 : 0000000000000004
> > > [ 12.211661] x5 : ffffffffffffffc8 x4 : 00000000173eed80 x3 : ffff80000893b940
> > > [ 12.211665] x2 : 00000000173eed80 x1 : ffff80000893b940 x0 : 0000000000000000
> > > [ 12.211669] Call trace:
> > > [ 12.211670] clk_core_init_rate_req+0x84/0x90
> > > [ 12.211673] clk_core_round_rate_nolock+0xe8/0x10c
> > > [ 12.211675] clk_mux_determine_rate_flags+0x174/0x1f0
> > > [ 12.211677] clk_mux_determine_rate+0x1c/0x30
> > > [ 12.211680] clk_core_determine_round_nolock+0x74/0x130
> > > [ 12.211682] clk_core_round_rate_nolock+0x58/0x10c
> > > [ 12.211684] clk_core_round_rate_nolock+0xf4/0x10c
> > > [ 12.211686] clk_core_set_rate_nolock+0x194/0x2ac
> > > [ 12.211688] clk_set_rate+0x40/0x94
> > > [ 12.211691] _opp_config_clk_single+0x38/0xa0
> > > [ 12.211693] _set_opp+0x1b0/0x500
> > > [ 12.211695] dev_pm_opp_set_rate+0x120/0x290
> > > [ 12.211697] panfrost_devfreq_target+0x3c/0x50 [panfrost]
> > > [ 12.211705] devfreq_set_target+0x8c/0x2d0
> > > [ 12.211707] devfreq_update_target+0xcc/0xf4
> > > [ 12.211708] devfreq_monitor+0x40/0x1d0
> > > [ 12.211710] process_one_work+0x294/0x664
> > > [ 12.211712] worker_thread+0x7c/0x45c
> > > [ 12.211713] kthread+0x104/0x110
> > > [ 12.211716] ret_from_fork+0x10/0x20
> > > [ 12.211718] irq event stamp: 7102
> > > [ 12.211719] hardirqs last enabled at (7101): [<ffffdddf904ea5a0>] finish_task_switch.isra.0+0xec/0x2f0
> > > [ 12.211723] hardirqs last disabled at (7102): [<ffffdddf91794b74>] el1_dbg+0x24/0x90
> > > [ 12.211726] softirqs last enabled at (6716): [<ffffdddf90410be4>] __do_softirq+0x414/0x588
> > > [ 12.211728] softirqs last disabled at (6507): [<ffffdddf904171d8>] ____do_softirq+0x18/0x24
> > > [ 12.211730] ---[ end trace 0000000000000000 ]---
> >
> > ... Indeed, you shouldn't hit that warning at all. It happens in
> > clk_core_round_rate_nolock, which takes (before your patch) the
> > CLK_SET_RATE_PARENT branch. This indeed has been changed by the patch
> > you mentioned, and will call clk_core_forward_rate_req() now, that in
> > turn calls clk_core_init_rate_nolock().
> >
> > I think the warning you hit is because core->parent is NULL, which is
> > passed to clk_core_forward_rate_req() as the parent argument, and we'll
> > call clk_core_init_rate_req() with parent set as the core argument.
> >
> > In clk_core_init_rate_req(), the first thing we do is a WARN_ON(!core),
> > which is what you hit here I think.
> >
> > This is different to the previous behavior that was calling
> > clk_core_round_rate_nolock() with core->parent directly, and
> > clk_core_round_rate_nolock() if its core argument is NULL will set
> > req->rate to 0 and bail out without returning an error.
> >
> > Now, your patch probably works because now that you provide a
> > determine_rate implementation, clk_core_can_round() returns true and
> > you'll take a different branch in clk_core_round_rate_nolock(), avoiding
> > that issue entirely.
> >
> > Does that patch work better (on top of next-20221012)?
>
> I admit I didn't go too deep in the research, as my brain processed that as
> "this is a mux clock, not really different from a standard mux, this callback
> is missing, that's not optimal"... then that fixed it and called it a day.
>
> I should've prolonged my research for a better understanding of what was
> actually going on.

No worries :)

> What you said actually opened my mind and, with little surprise, your patch
> works as good as mine - no warnings and the clock scales as expected!

I'm actually wondering if you didn't encounter two issues. What kernel
were you testing before? If it's older than today's next
(next-20221012), you're likely missing

https://lore.kernel.org/linux-clk/[email protected]/

Which is likely to be what fixed the clock scaling. And my patch only
fixed the warning. Could you test next-20221012? If I'm right, you
should only get the warning.

> I still think that the mtk-mux driver should get a determine_rate callback but,
> at this point, that's going to have an entirely different commit description...

Yeah, it might, but as you said it's a separate discussion

> Please go on and send your patch: if you want, please remember to add me to
> the Cc's, so that I can give you my R-b tag in a timely manner.

Thanks!
Maxime

2022-10-12 10:05:21

by AngeloGioacchino Del Regno

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

Il 12/10/22 10:55, Maxime Ripard ha scritto:
> Hi,
>
> On Tue, Oct 11, 2022 at 03:55:48PM +0200, AngeloGioacchino Del Regno wrote:
>> Since commit 262ca38f4b6e ("clk: Stop forwarding clk_rate_requests
>> to the parent"), the clk_rate_request is .. as the title says, not
>> forwarded anymore to the parent:
>
> It's not entirely true, the rate request should still be forwarded, but
> we don't pass the same instance of clk_rate_request anymore.
>
>> this produces an issue with the MediaTek clock MUX driver during GPU
>> DVFS on MT8195, but not on MT8192 or others.
>>
>> This is because, differently from others, like MT8192 where all of
>> the clocks in the MFG parents tree are of mtk_mux type, but in the
>> parent tree of MT8195's MFG clock, we have one mtk_mux clock and
>> one (clk framework generic) mux clock, like so:
>>
>> names: mfg_bg3d -> mfg_ck_fast_ref -> top_mfg_core_tmp (or) mfgpll
>> types: mtk_gate -> mux -> mtk_mux (or) mtk_pll
>>
>> To solve this issue and also keep the GPU DVFS clocks code working
>> as expected, wire up a .determine_rate() callback for the mtk_mux
>> ops; for that, the standard clk_mux_determine_rate_flags() was used
>> as it was possible to.
>
> It probably fixes things indeed, but I'm a bit worried that it just
> works around the actual issue instead of fixing the actual bug...
>
>> This commit was successfully tested on MT6795 Xperia M5, MT8173 Elm,
>> MT8192 Spherion and MT8195 Tomato; no regressions were seen.
>>
>> For the sake of some more documentation about this issue here's the
>> trace of it:
>>
>> [ 12.211587] ------------[ cut here ]------------
>> [ 12.211589] WARNING: CPU: 6 PID: 78 at drivers/clk/clk.c:1462 clk_core_init_rate_req+0x84/0x90
>> [ 12.211593] Modules linked in: stp crct10dif_ce mtk_adsp_common llc rfkill snd_sof_xtensa_dsp
>> panfrost(+) sbs_battery cros_ec_lid_angle cros_ec_sensors snd_sof_of
>> cros_ec_sensors_core hid_multitouch cros_usbpd_logger snd_sof gpu_sched
>> snd_sof_utils fuse ipv6
>> [ 12.211614] CPU: 6 PID: 78 Comm: kworker/u16:2 Tainted: G W 6.0.0-next-20221011+ #58
>> [ 12.211616] Hardware name: Acer Tomato (rev2) board (DT)
>> [ 12.211617] Workqueue: devfreq_wq devfreq_monitor
>> [ 12.211620] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [ 12.211622] pc : clk_core_init_rate_req+0x84/0x90
>> [ 12.211625] lr : clk_core_forward_rate_req+0xa4/0xe4
>> [ 12.211627] sp : ffff80000893b8e0
>> [ 12.211628] x29: ffff80000893b8e0 x28: ffffdddf92f9b000 x27: ffff46a2c0e8bc05
>> [ 12.211632] x26: ffff46a2c1041200 x25: 0000000000000000 x24: 00000000173eed80
>> [ 12.211636] x23: ffff80000893b9c0 x22: ffff80000893b940 x21: 0000000000000000
>> [ 12.211641] x20: ffff46a2c1039f00 x19: ffff46a2c1039f00 x18: 0000000000000000
>> [ 12.211645] x17: 0000000000000038 x16: 000000000000d904 x15: 0000000000000003
>> [ 12.211649] x14: ffffdddf9357ce48 x13: ffffdddf935e71c8 x12: 000000000004803c
>> [ 12.211653] x11: 00000000a867d7ad x10: 00000000a867d7ad x9 : ffffdddf90c28df4
>> [ 12.211657] x8 : ffffdddf9357a980 x7 : 0000000000000000 x6 : 0000000000000004
>> [ 12.211661] x5 : ffffffffffffffc8 x4 : 00000000173eed80 x3 : ffff80000893b940
>> [ 12.211665] x2 : 00000000173eed80 x1 : ffff80000893b940 x0 : 0000000000000000
>> [ 12.211669] Call trace:
>> [ 12.211670] clk_core_init_rate_req+0x84/0x90
>> [ 12.211673] clk_core_round_rate_nolock+0xe8/0x10c
>> [ 12.211675] clk_mux_determine_rate_flags+0x174/0x1f0
>> [ 12.211677] clk_mux_determine_rate+0x1c/0x30
>> [ 12.211680] clk_core_determine_round_nolock+0x74/0x130
>> [ 12.211682] clk_core_round_rate_nolock+0x58/0x10c
>> [ 12.211684] clk_core_round_rate_nolock+0xf4/0x10c
>> [ 12.211686] clk_core_set_rate_nolock+0x194/0x2ac
>> [ 12.211688] clk_set_rate+0x40/0x94
>> [ 12.211691] _opp_config_clk_single+0x38/0xa0
>> [ 12.211693] _set_opp+0x1b0/0x500
>> [ 12.211695] dev_pm_opp_set_rate+0x120/0x290
>> [ 12.211697] panfrost_devfreq_target+0x3c/0x50 [panfrost]
>> [ 12.211705] devfreq_set_target+0x8c/0x2d0
>> [ 12.211707] devfreq_update_target+0xcc/0xf4
>> [ 12.211708] devfreq_monitor+0x40/0x1d0
>> [ 12.211710] process_one_work+0x294/0x664
>> [ 12.211712] worker_thread+0x7c/0x45c
>> [ 12.211713] kthread+0x104/0x110
>> [ 12.211716] ret_from_fork+0x10/0x20
>> [ 12.211718] irq event stamp: 7102
>> [ 12.211719] hardirqs last enabled at (7101): [<ffffdddf904ea5a0>] finish_task_switch.isra.0+0xec/0x2f0
>> [ 12.211723] hardirqs last disabled at (7102): [<ffffdddf91794b74>] el1_dbg+0x24/0x90
>> [ 12.211726] softirqs last enabled at (6716): [<ffffdddf90410be4>] __do_softirq+0x414/0x588
>> [ 12.211728] softirqs last disabled at (6507): [<ffffdddf904171d8>] ____do_softirq+0x18/0x24
>> [ 12.211730] ---[ end trace 0000000000000000 ]---
>
> ... Indeed, you shouldn't hit that warning at all. It happens in
> clk_core_round_rate_nolock, which takes (before your patch) the
> CLK_SET_RATE_PARENT branch. This indeed has been changed by the patch
> you mentioned, and will call clk_core_forward_rate_req() now, that in
> turn calls clk_core_init_rate_nolock().
>
> I think the warning you hit is because core->parent is NULL, which is
> passed to clk_core_forward_rate_req() as the parent argument, and we'll
> call clk_core_init_rate_req() with parent set as the core argument.
>
> In clk_core_init_rate_req(), the first thing we do is a WARN_ON(!core),
> which is what you hit here I think.
>
> This is different to the previous behavior that was calling
> clk_core_round_rate_nolock() with core->parent directly, and
> clk_core_round_rate_nolock() if its core argument is NULL will set
> req->rate to 0 and bail out without returning an error.
>
> Now, your patch probably works because now that you provide a
> determine_rate implementation, clk_core_can_round() returns true and
> you'll take a different branch in clk_core_round_rate_nolock(), avoiding
> that issue entirely.
>
> Does that patch work better (on top of next-20221012)?

Hello Maxime,

I admit I didn't go too deep in the research, as my brain processed that as
"this is a mux clock, not really different from a standard mux, this callback
is missing, that's not optimal"... then that fixed it and called it a day.

I should've prolonged my research for a better understanding of what was
actually going on.
What you said actually opened my mind and, with little surprise, your patch
works as good as mine - no warnings and the clock scales as expected!

I still think that the mtk-mux driver should get a determine_rate callback but,
at this point, that's going to have an entirely different commit description...

Please go on and send your patch: if you want, please remember to add me to
the Cc's, so that I can give you my R-b tag in a timely manner.

Cheers!
Angelo

>
> diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
> index c3c3f8c07258..b831a4227236 100644
> --- a/drivers/clk/clk.c
> +++ b/drivers/clk/clk.c
> @@ -1459,12 +1459,15 @@ static void clk_core_init_rate_req(struct clk_core * const core,
> {
> struct clk_core *parent;
>
> - if (WARN_ON(!core || !req))
> + if (WARN_ON(!req))
> return;
>
> memset(req, 0, sizeof(*req));
> -
> req->rate = rate;
> +
> + if (!core)
> + return;
> +
> clk_core_get_boundaries(core, &req->min_rate, &req->max_rate);
>
> parent = core->parent;
>
> Thanks!
> Maxime

2022-10-12 10:49:47

by AngeloGioacchino Del Regno

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

Il 12/10/22 11:40, Maxime Ripard ha scritto:
> On Wed, Oct 12, 2022 at 11:09:59AM +0200, AngeloGioacchino Del Regno wrote:
>> Il 12/10/22 10:55, Maxime Ripard ha scritto:
>>> Hi,
>>>
>>> On Tue, Oct 11, 2022 at 03:55:48PM +0200, AngeloGioacchino Del Regno wrote:
>>>> Since commit 262ca38f4b6e ("clk: Stop forwarding clk_rate_requests
>>>> to the parent"), the clk_rate_request is .. as the title says, not
>>>> forwarded anymore to the parent:
>>>
>>> It's not entirely true, the rate request should still be forwarded, but
>>> we don't pass the same instance of clk_rate_request anymore.
>>>
>>>> this produces an issue with the MediaTek clock MUX driver during GPU
>>>> DVFS on MT8195, but not on MT8192 or others.
>>>>
>>>> This is because, differently from others, like MT8192 where all of
>>>> the clocks in the MFG parents tree are of mtk_mux type, but in the
>>>> parent tree of MT8195's MFG clock, we have one mtk_mux clock and
>>>> one (clk framework generic) mux clock, like so:
>>>>
>>>> names: mfg_bg3d -> mfg_ck_fast_ref -> top_mfg_core_tmp (or) mfgpll
>>>> types: mtk_gate -> mux -> mtk_mux (or) mtk_pll
>>>>
>>>> To solve this issue and also keep the GPU DVFS clocks code working
>>>> as expected, wire up a .determine_rate() callback for the mtk_mux
>>>> ops; for that, the standard clk_mux_determine_rate_flags() was used
>>>> as it was possible to.
>>>
>>> It probably fixes things indeed, but I'm a bit worried that it just
>>> works around the actual issue instead of fixing the actual bug...
>>>
>>>> This commit was successfully tested on MT6795 Xperia M5, MT8173 Elm,
>>>> MT8192 Spherion and MT8195 Tomato; no regressions were seen.
>>>>
>>>> For the sake of some more documentation about this issue here's the
>>>> trace of it:
>>>>
>>>> [ 12.211587] ------------[ cut here ]------------
>>>> [ 12.211589] WARNING: CPU: 6 PID: 78 at drivers/clk/clk.c:1462 clk_core_init_rate_req+0x84/0x90
>>>> [ 12.211593] Modules linked in: stp crct10dif_ce mtk_adsp_common llc rfkill snd_sof_xtensa_dsp
>>>> panfrost(+) sbs_battery cros_ec_lid_angle cros_ec_sensors snd_sof_of
>>>> cros_ec_sensors_core hid_multitouch cros_usbpd_logger snd_sof gpu_sched
>>>> snd_sof_utils fuse ipv6
>>>> [ 12.211614] CPU: 6 PID: 78 Comm: kworker/u16:2 Tainted: G W 6.0.0-next-20221011+ #58
>>>> [ 12.211616] Hardware name: Acer Tomato (rev2) board (DT)
>>>> [ 12.211617] Workqueue: devfreq_wq devfreq_monitor
>>>> [ 12.211620] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>> [ 12.211622] pc : clk_core_init_rate_req+0x84/0x90
>>>> [ 12.211625] lr : clk_core_forward_rate_req+0xa4/0xe4
>>>> [ 12.211627] sp : ffff80000893b8e0
>>>> [ 12.211628] x29: ffff80000893b8e0 x28: ffffdddf92f9b000 x27: ffff46a2c0e8bc05
>>>> [ 12.211632] x26: ffff46a2c1041200 x25: 0000000000000000 x24: 00000000173eed80
>>>> [ 12.211636] x23: ffff80000893b9c0 x22: ffff80000893b940 x21: 0000000000000000
>>>> [ 12.211641] x20: ffff46a2c1039f00 x19: ffff46a2c1039f00 x18: 0000000000000000
>>>> [ 12.211645] x17: 0000000000000038 x16: 000000000000d904 x15: 0000000000000003
>>>> [ 12.211649] x14: ffffdddf9357ce48 x13: ffffdddf935e71c8 x12: 000000000004803c
>>>> [ 12.211653] x11: 00000000a867d7ad x10: 00000000a867d7ad x9 : ffffdddf90c28df4
>>>> [ 12.211657] x8 : ffffdddf9357a980 x7 : 0000000000000000 x6 : 0000000000000004
>>>> [ 12.211661] x5 : ffffffffffffffc8 x4 : 00000000173eed80 x3 : ffff80000893b940
>>>> [ 12.211665] x2 : 00000000173eed80 x1 : ffff80000893b940 x0 : 0000000000000000
>>>> [ 12.211669] Call trace:
>>>> [ 12.211670] clk_core_init_rate_req+0x84/0x90
>>>> [ 12.211673] clk_core_round_rate_nolock+0xe8/0x10c
>>>> [ 12.211675] clk_mux_determine_rate_flags+0x174/0x1f0
>>>> [ 12.211677] clk_mux_determine_rate+0x1c/0x30
>>>> [ 12.211680] clk_core_determine_round_nolock+0x74/0x130
>>>> [ 12.211682] clk_core_round_rate_nolock+0x58/0x10c
>>>> [ 12.211684] clk_core_round_rate_nolock+0xf4/0x10c
>>>> [ 12.211686] clk_core_set_rate_nolock+0x194/0x2ac
>>>> [ 12.211688] clk_set_rate+0x40/0x94
>>>> [ 12.211691] _opp_config_clk_single+0x38/0xa0
>>>> [ 12.211693] _set_opp+0x1b0/0x500
>>>> [ 12.211695] dev_pm_opp_set_rate+0x120/0x290
>>>> [ 12.211697] panfrost_devfreq_target+0x3c/0x50 [panfrost]
>>>> [ 12.211705] devfreq_set_target+0x8c/0x2d0
>>>> [ 12.211707] devfreq_update_target+0xcc/0xf4
>>>> [ 12.211708] devfreq_monitor+0x40/0x1d0
>>>> [ 12.211710] process_one_work+0x294/0x664
>>>> [ 12.211712] worker_thread+0x7c/0x45c
>>>> [ 12.211713] kthread+0x104/0x110
>>>> [ 12.211716] ret_from_fork+0x10/0x20
>>>> [ 12.211718] irq event stamp: 7102
>>>> [ 12.211719] hardirqs last enabled at (7101): [<ffffdddf904ea5a0>] finish_task_switch.isra.0+0xec/0x2f0
>>>> [ 12.211723] hardirqs last disabled at (7102): [<ffffdddf91794b74>] el1_dbg+0x24/0x90
>>>> [ 12.211726] softirqs last enabled at (6716): [<ffffdddf90410be4>] __do_softirq+0x414/0x588
>>>> [ 12.211728] softirqs last disabled at (6507): [<ffffdddf904171d8>] ____do_softirq+0x18/0x24
>>>> [ 12.211730] ---[ end trace 0000000000000000 ]---
>>>
>>> ... Indeed, you shouldn't hit that warning at all. It happens in
>>> clk_core_round_rate_nolock, which takes (before your patch) the
>>> CLK_SET_RATE_PARENT branch. This indeed has been changed by the patch
>>> you mentioned, and will call clk_core_forward_rate_req() now, that in
>>> turn calls clk_core_init_rate_nolock().
>>>
>>> I think the warning you hit is because core->parent is NULL, which is
>>> passed to clk_core_forward_rate_req() as the parent argument, and we'll
>>> call clk_core_init_rate_req() with parent set as the core argument.
>>>
>>> In clk_core_init_rate_req(), the first thing we do is a WARN_ON(!core),
>>> which is what you hit here I think.
>>>
>>> This is different to the previous behavior that was calling
>>> clk_core_round_rate_nolock() with core->parent directly, and
>>> clk_core_round_rate_nolock() if its core argument is NULL will set
>>> req->rate to 0 and bail out without returning an error.
>>>
>>> Now, your patch probably works because now that you provide a
>>> determine_rate implementation, clk_core_can_round() returns true and
>>> you'll take a different branch in clk_core_round_rate_nolock(), avoiding
>>> that issue entirely.
>>>
>>> Does that patch work better (on top of next-20221012)?
>>
>> I admit I didn't go too deep in the research, as my brain processed that as
>> "this is a mux clock, not really different from a standard mux, this callback
>> is missing, that's not optimal"... then that fixed it and called it a day.
>>
>> I should've prolonged my research for a better understanding of what was
>> actually going on.
>
> No worries :)
>
>> What you said actually opened my mind and, with little surprise, your patch
>> works as good as mine - no warnings and the clock scales as expected!
>
> I'm actually wondering if you didn't encounter two issues. What kernel
> were you testing before? If it's older than today's next
> (next-20221012), you're likely missing
>

I was testing next-20221011.

> https://lore.kernel.org/linux-clk/[email protected]/
>
> Which is likely to be what fixed the clock scaling. And my patch only
> fixed the warning. Could you test next-20221012? If I'm right, you
> should only get the warning.
>

No, I am getting the same situation even after rebasing over next-20221012, without
any of the two patches (determine_rate() for mtk mux, nor the one you shared for
clk.c), when the warning happens, I get very slow GPU operation and the same "nice"
timeout:

[ 27.785514] panfrost 13000000.gpu: gpu sched timeout, js=1, config=0x7b00,
status=0x0, head=0xa1cb180, tail=0xa1cb180, sched_job=00000000f07d39e3

...so I'm not encountering the same issue that you've fixed with the patches that
got merged in next-20221012.

Of course, as you were expecting, the warning is also still there and still
the same:

[ 27.750504] WARNING: CPU: 4 PID: 164 at drivers/clk/clk.c:1462
clk_core_init_rate_req+0x84/0x90

Cheers,
Angelo

>> I still think that the mtk-mux driver should get a determine_rate callback but,
>> at this point, that's going to have an entirely different commit description...
>
> Yeah, it might, but as you said it's a separate discussion
>
>> Please go on and send your patch: if you want, please remember to add me to
>> the Cc's, so that I can give you my R-b tag in a timely manner.
>
> Thanks!
> Maxime

2022-10-12 12:21:09

by Maxime Ripard

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

On Wed, Oct 12, 2022 at 11:57:15AM +0200, AngeloGioacchino Del Regno wrote:
> Il 12/10/22 11:40, Maxime Ripard ha scritto:
> > On Wed, Oct 12, 2022 at 11:09:59AM +0200, AngeloGioacchino Del Regno wrote:
> > > Il 12/10/22 10:55, Maxime Ripard ha scritto:
> > > > Hi,
> > > >
> > > > On Tue, Oct 11, 2022 at 03:55:48PM +0200, AngeloGioacchino Del Regno wrote:
> > > > > Since commit 262ca38f4b6e ("clk: Stop forwarding clk_rate_requests
> > > > > to the parent"), the clk_rate_request is .. as the title says, not
> > > > > forwarded anymore to the parent:
> > > >
> > > > It's not entirely true, the rate request should still be forwarded, but
> > > > we don't pass the same instance of clk_rate_request anymore.
> > > >
> > > > > this produces an issue with the MediaTek clock MUX driver during GPU
> > > > > DVFS on MT8195, but not on MT8192 or others.
> > > > >
> > > > > This is because, differently from others, like MT8192 where all of
> > > > > the clocks in the MFG parents tree are of mtk_mux type, but in the
> > > > > parent tree of MT8195's MFG clock, we have one mtk_mux clock and
> > > > > one (clk framework generic) mux clock, like so:
> > > > >
> > > > > names: mfg_bg3d -> mfg_ck_fast_ref -> top_mfg_core_tmp (or) mfgpll
> > > > > types: mtk_gate -> mux -> mtk_mux (or) mtk_pll
> > > > >
> > > > > To solve this issue and also keep the GPU DVFS clocks code working
> > > > > as expected, wire up a .determine_rate() callback for the mtk_mux
> > > > > ops; for that, the standard clk_mux_determine_rate_flags() was used
> > > > > as it was possible to.
> > > >
> > > > It probably fixes things indeed, but I'm a bit worried that it just
> > > > works around the actual issue instead of fixing the actual bug...
> > > >
> > > > > This commit was successfully tested on MT6795 Xperia M5, MT8173 Elm,
> > > > > MT8192 Spherion and MT8195 Tomato; no regressions were seen.
> > > > >
> > > > > For the sake of some more documentation about this issue here's the
> > > > > trace of it:
> > > > >
> > > > > [ 12.211587] ------------[ cut here ]------------
> > > > > [ 12.211589] WARNING: CPU: 6 PID: 78 at drivers/clk/clk.c:1462 clk_core_init_rate_req+0x84/0x90
> > > > > [ 12.211593] Modules linked in: stp crct10dif_ce mtk_adsp_common llc rfkill snd_sof_xtensa_dsp
> > > > > panfrost(+) sbs_battery cros_ec_lid_angle cros_ec_sensors snd_sof_of
> > > > > cros_ec_sensors_core hid_multitouch cros_usbpd_logger snd_sof gpu_sched
> > > > > snd_sof_utils fuse ipv6
> > > > > [ 12.211614] CPU: 6 PID: 78 Comm: kworker/u16:2 Tainted: G W 6.0.0-next-20221011+ #58
> > > > > [ 12.211616] Hardware name: Acer Tomato (rev2) board (DT)
> > > > > [ 12.211617] Workqueue: devfreq_wq devfreq_monitor
> > > > > [ 12.211620] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > [ 12.211622] pc : clk_core_init_rate_req+0x84/0x90
> > > > > [ 12.211625] lr : clk_core_forward_rate_req+0xa4/0xe4
> > > > > [ 12.211627] sp : ffff80000893b8e0
> > > > > [ 12.211628] x29: ffff80000893b8e0 x28: ffffdddf92f9b000 x27: ffff46a2c0e8bc05
> > > > > [ 12.211632] x26: ffff46a2c1041200 x25: 0000000000000000 x24: 00000000173eed80
> > > > > [ 12.211636] x23: ffff80000893b9c0 x22: ffff80000893b940 x21: 0000000000000000
> > > > > [ 12.211641] x20: ffff46a2c1039f00 x19: ffff46a2c1039f00 x18: 0000000000000000
> > > > > [ 12.211645] x17: 0000000000000038 x16: 000000000000d904 x15: 0000000000000003
> > > > > [ 12.211649] x14: ffffdddf9357ce48 x13: ffffdddf935e71c8 x12: 000000000004803c
> > > > > [ 12.211653] x11: 00000000a867d7ad x10: 00000000a867d7ad x9 : ffffdddf90c28df4
> > > > > [ 12.211657] x8 : ffffdddf9357a980 x7 : 0000000000000000 x6 : 0000000000000004
> > > > > [ 12.211661] x5 : ffffffffffffffc8 x4 : 00000000173eed80 x3 : ffff80000893b940
> > > > > [ 12.211665] x2 : 00000000173eed80 x1 : ffff80000893b940 x0 : 0000000000000000
> > > > > [ 12.211669] Call trace:
> > > > > [ 12.211670] clk_core_init_rate_req+0x84/0x90
> > > > > [ 12.211673] clk_core_round_rate_nolock+0xe8/0x10c
> > > > > [ 12.211675] clk_mux_determine_rate_flags+0x174/0x1f0
> > > > > [ 12.211677] clk_mux_determine_rate+0x1c/0x30
> > > > > [ 12.211680] clk_core_determine_round_nolock+0x74/0x130
> > > > > [ 12.211682] clk_core_round_rate_nolock+0x58/0x10c
> > > > > [ 12.211684] clk_core_round_rate_nolock+0xf4/0x10c
> > > > > [ 12.211686] clk_core_set_rate_nolock+0x194/0x2ac
> > > > > [ 12.211688] clk_set_rate+0x40/0x94
> > > > > [ 12.211691] _opp_config_clk_single+0x38/0xa0
> > > > > [ 12.211693] _set_opp+0x1b0/0x500
> > > > > [ 12.211695] dev_pm_opp_set_rate+0x120/0x290
> > > > > [ 12.211697] panfrost_devfreq_target+0x3c/0x50 [panfrost]
> > > > > [ 12.211705] devfreq_set_target+0x8c/0x2d0
> > > > > [ 12.211707] devfreq_update_target+0xcc/0xf4
> > > > > [ 12.211708] devfreq_monitor+0x40/0x1d0
> > > > > [ 12.211710] process_one_work+0x294/0x664
> > > > > [ 12.211712] worker_thread+0x7c/0x45c
> > > > > [ 12.211713] kthread+0x104/0x110
> > > > > [ 12.211716] ret_from_fork+0x10/0x20
> > > > > [ 12.211718] irq event stamp: 7102
> > > > > [ 12.211719] hardirqs last enabled at (7101): [<ffffdddf904ea5a0>] finish_task_switch.isra.0+0xec/0x2f0
> > > > > [ 12.211723] hardirqs last disabled at (7102): [<ffffdddf91794b74>] el1_dbg+0x24/0x90
> > > > > [ 12.211726] softirqs last enabled at (6716): [<ffffdddf90410be4>] __do_softirq+0x414/0x588
> > > > > [ 12.211728] softirqs last disabled at (6507): [<ffffdddf904171d8>] ____do_softirq+0x18/0x24
> > > > > [ 12.211730] ---[ end trace 0000000000000000 ]---
> > > >
> > > > ... Indeed, you shouldn't hit that warning at all. It happens in
> > > > clk_core_round_rate_nolock, which takes (before your patch) the
> > > > CLK_SET_RATE_PARENT branch. This indeed has been changed by the patch
> > > > you mentioned, and will call clk_core_forward_rate_req() now, that in
> > > > turn calls clk_core_init_rate_nolock().
> > > >
> > > > I think the warning you hit is because core->parent is NULL, which is
> > > > passed to clk_core_forward_rate_req() as the parent argument, and we'll
> > > > call clk_core_init_rate_req() with parent set as the core argument.
> > > >
> > > > In clk_core_init_rate_req(), the first thing we do is a WARN_ON(!core),
> > > > which is what you hit here I think.
> > > >
> > > > This is different to the previous behavior that was calling
> > > > clk_core_round_rate_nolock() with core->parent directly, and
> > > > clk_core_round_rate_nolock() if its core argument is NULL will set
> > > > req->rate to 0 and bail out without returning an error.
> > > >
> > > > Now, your patch probably works because now that you provide a
> > > > determine_rate implementation, clk_core_can_round() returns true and
> > > > you'll take a different branch in clk_core_round_rate_nolock(), avoiding
> > > > that issue entirely.
> > > >
> > > > Does that patch work better (on top of next-20221012)?
> > >
> > > I admit I didn't go too deep in the research, as my brain processed that as
> > > "this is a mux clock, not really different from a standard mux, this callback
> > > is missing, that's not optimal"... then that fixed it and called it a day.
> > >
> > > I should've prolonged my research for a better understanding of what was
> > > actually going on.
> >
> > No worries :)
> >
> > > What you said actually opened my mind and, with little surprise, your patch
> > > works as good as mine - no warnings and the clock scales as expected!
> >
> > I'm actually wondering if you didn't encounter two issues. What kernel
> > were you testing before? If it's older than today's next
> > (next-20221012), you're likely missing
> >
>
> I was testing next-20221011.
>
> > https://lore.kernel.org/linux-clk/[email protected]/
> >
> > Which is likely to be what fixed the clock scaling. And my patch only
> > fixed the warning. Could you test next-20221012? If I'm right, you
> > should only get the warning.
> >
>
> No, I am getting the same situation even after rebasing over next-20221012, without
> any of the two patches (determine_rate() for mtk mux, nor the one you shared for
> clk.c), when the warning happens, I get very slow GPU operation and the same "nice"
> timeout:
>
> [ 27.785514] panfrost 13000000.gpu: gpu sched timeout, js=1,
> config=0x7b00, status=0x0, head=0xa1cb180, tail=0xa1cb180,
> sched_job=00000000f07d39e3
>
> ...so I'm not encountering the same issue that you've fixed with the patches that
> got merged in next-20221012.
>
> Of course, as you were expecting, the warning is also still there and still
> the same:
>
> [ 27.750504] WARNING: CPU: 4 PID: 164 at drivers/clk/clk.c:1462
> clk_core_init_rate_req+0x84/0x90

Ok. I'm still a bit unsure why it actually fixes anything.

In the current next, clk_core_init_rate_req (through
clk_core_forward_rate_req) will bail out right away, but the patch will
clear the request and set the requested rate.

I don't think the req->rate assignment changes anything because our next
call will be to clk_core_round_rate_nolock that will clear it in the
!core path, but it might just be that we don't initialize the request
and end up with some garbage data in it?

Could you test, on top of next-20221012,

diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
index c3c3f8c07258..ffbee00bd7cf 100644
--- a/drivers/clk/clk.c
+++ b/drivers/clk/clk.c
@@ -1545,6 +1545,8 @@ static int clk_core_round_rate_nolock(struct clk_core *core,
if (core->flags & CLK_SET_RATE_PARENT) {
struct clk_rate_request parent_req;

+ memset(&parent_req, 0, sizeof(parent_req));
+
clk_core_forward_rate_req(core, req, core->parent, &parent_req, req->rate);
ret = clk_core_round_rate_nolock(core->parent, &parent_req);
if (ret)

Thanks!
Maxime

2022-10-12 12:47:23

by AngeloGioacchino Del Regno

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

Il 12/10/22 13:48, Maxime Ripard ha scritto:
> On Wed, Oct 12, 2022 at 11:57:15AM +0200, AngeloGioacchino Del Regno wrote:
>> Il 12/10/22 11:40, Maxime Ripard ha scritto:
>>> On Wed, Oct 12, 2022 at 11:09:59AM +0200, AngeloGioacchino Del Regno wrote:
>>>> Il 12/10/22 10:55, Maxime Ripard ha scritto:
>>>>> Hi,
>>>>>
>>>>> On Tue, Oct 11, 2022 at 03:55:48PM +0200, AngeloGioacchino Del Regno wrote:
>>>>>> Since commit 262ca38f4b6e ("clk: Stop forwarding clk_rate_requests
>>>>>> to the parent"), the clk_rate_request is .. as the title says, not
>>>>>> forwarded anymore to the parent:
>>>>>
>>>>> It's not entirely true, the rate request should still be forwarded, but
>>>>> we don't pass the same instance of clk_rate_request anymore.
>>>>>
>>>>>> this produces an issue with the MediaTek clock MUX driver during GPU
>>>>>> DVFS on MT8195, but not on MT8192 or others.
>>>>>>
>>>>>> This is because, differently from others, like MT8192 where all of
>>>>>> the clocks in the MFG parents tree are of mtk_mux type, but in the
>>>>>> parent tree of MT8195's MFG clock, we have one mtk_mux clock and
>>>>>> one (clk framework generic) mux clock, like so:
>>>>>>
>>>>>> names: mfg_bg3d -> mfg_ck_fast_ref -> top_mfg_core_tmp (or) mfgpll
>>>>>> types: mtk_gate -> mux -> mtk_mux (or) mtk_pll
>>>>>>
>>>>>> To solve this issue and also keep the GPU DVFS clocks code working
>>>>>> as expected, wire up a .determine_rate() callback for the mtk_mux
>>>>>> ops; for that, the standard clk_mux_determine_rate_flags() was used
>>>>>> as it was possible to.
>>>>>
>>>>> It probably fixes things indeed, but I'm a bit worried that it just
>>>>> works around the actual issue instead of fixing the actual bug...
>>>>>
>>>>>> This commit was successfully tested on MT6795 Xperia M5, MT8173 Elm,
>>>>>> MT8192 Spherion and MT8195 Tomato; no regressions were seen.
>>>>>>
>>>>>> For the sake of some more documentation about this issue here's the
>>>>>> trace of it:
>>>>>>
>>>>>> [ 12.211587] ------------[ cut here ]------------
>>>>>> [ 12.211589] WARNING: CPU: 6 PID: 78 at drivers/clk/clk.c:1462 clk_core_init_rate_req+0x84/0x90
>>>>>> [ 12.211593] Modules linked in: stp crct10dif_ce mtk_adsp_common llc rfkill snd_sof_xtensa_dsp
>>>>>> panfrost(+) sbs_battery cros_ec_lid_angle cros_ec_sensors snd_sof_of
>>>>>> cros_ec_sensors_core hid_multitouch cros_usbpd_logger snd_sof gpu_sched
>>>>>> snd_sof_utils fuse ipv6
>>>>>> [ 12.211614] CPU: 6 PID: 78 Comm: kworker/u16:2 Tainted: G W 6.0.0-next-20221011+ #58
>>>>>> [ 12.211616] Hardware name: Acer Tomato (rev2) board (DT)
>>>>>> [ 12.211617] Workqueue: devfreq_wq devfreq_monitor
>>>>>> [ 12.211620] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>> [ 12.211622] pc : clk_core_init_rate_req+0x84/0x90
>>>>>> [ 12.211625] lr : clk_core_forward_rate_req+0xa4/0xe4
>>>>>> [ 12.211627] sp : ffff80000893b8e0
>>>>>> [ 12.211628] x29: ffff80000893b8e0 x28: ffffdddf92f9b000 x27: ffff46a2c0e8bc05
>>>>>> [ 12.211632] x26: ffff46a2c1041200 x25: 0000000000000000 x24: 00000000173eed80
>>>>>> [ 12.211636] x23: ffff80000893b9c0 x22: ffff80000893b940 x21: 0000000000000000
>>>>>> [ 12.211641] x20: ffff46a2c1039f00 x19: ffff46a2c1039f00 x18: 0000000000000000
>>>>>> [ 12.211645] x17: 0000000000000038 x16: 000000000000d904 x15: 0000000000000003
>>>>>> [ 12.211649] x14: ffffdddf9357ce48 x13: ffffdddf935e71c8 x12: 000000000004803c
>>>>>> [ 12.211653] x11: 00000000a867d7ad x10: 00000000a867d7ad x9 : ffffdddf90c28df4
>>>>>> [ 12.211657] x8 : ffffdddf9357a980 x7 : 0000000000000000 x6 : 0000000000000004
>>>>>> [ 12.211661] x5 : ffffffffffffffc8 x4 : 00000000173eed80 x3 : ffff80000893b940
>>>>>> [ 12.211665] x2 : 00000000173eed80 x1 : ffff80000893b940 x0 : 0000000000000000
>>>>>> [ 12.211669] Call trace:
>>>>>> [ 12.211670] clk_core_init_rate_req+0x84/0x90
>>>>>> [ 12.211673] clk_core_round_rate_nolock+0xe8/0x10c
>>>>>> [ 12.211675] clk_mux_determine_rate_flags+0x174/0x1f0
>>>>>> [ 12.211677] clk_mux_determine_rate+0x1c/0x30
>>>>>> [ 12.211680] clk_core_determine_round_nolock+0x74/0x130
>>>>>> [ 12.211682] clk_core_round_rate_nolock+0x58/0x10c
>>>>>> [ 12.211684] clk_core_round_rate_nolock+0xf4/0x10c
>>>>>> [ 12.211686] clk_core_set_rate_nolock+0x194/0x2ac
>>>>>> [ 12.211688] clk_set_rate+0x40/0x94
>>>>>> [ 12.211691] _opp_config_clk_single+0x38/0xa0
>>>>>> [ 12.211693] _set_opp+0x1b0/0x500
>>>>>> [ 12.211695] dev_pm_opp_set_rate+0x120/0x290
>>>>>> [ 12.211697] panfrost_devfreq_target+0x3c/0x50 [panfrost]
>>>>>> [ 12.211705] devfreq_set_target+0x8c/0x2d0
>>>>>> [ 12.211707] devfreq_update_target+0xcc/0xf4
>>>>>> [ 12.211708] devfreq_monitor+0x40/0x1d0
>>>>>> [ 12.211710] process_one_work+0x294/0x664
>>>>>> [ 12.211712] worker_thread+0x7c/0x45c
>>>>>> [ 12.211713] kthread+0x104/0x110
>>>>>> [ 12.211716] ret_from_fork+0x10/0x20
>>>>>> [ 12.211718] irq event stamp: 7102
>>>>>> [ 12.211719] hardirqs last enabled at (7101): [<ffffdddf904ea5a0>] finish_task_switch.isra.0+0xec/0x2f0
>>>>>> [ 12.211723] hardirqs last disabled at (7102): [<ffffdddf91794b74>] el1_dbg+0x24/0x90
>>>>>> [ 12.211726] softirqs last enabled at (6716): [<ffffdddf90410be4>] __do_softirq+0x414/0x588
>>>>>> [ 12.211728] softirqs last disabled at (6507): [<ffffdddf904171d8>] ____do_softirq+0x18/0x24
>>>>>> [ 12.211730] ---[ end trace 0000000000000000 ]---
>>>>>
>>>>> ... Indeed, you shouldn't hit that warning at all. It happens in
>>>>> clk_core_round_rate_nolock, which takes (before your patch) the
>>>>> CLK_SET_RATE_PARENT branch. This indeed has been changed by the patch
>>>>> you mentioned, and will call clk_core_forward_rate_req() now, that in
>>>>> turn calls clk_core_init_rate_nolock().
>>>>>
>>>>> I think the warning you hit is because core->parent is NULL, which is
>>>>> passed to clk_core_forward_rate_req() as the parent argument, and we'll
>>>>> call clk_core_init_rate_req() with parent set as the core argument.
>>>>>
>>>>> In clk_core_init_rate_req(), the first thing we do is a WARN_ON(!core),
>>>>> which is what you hit here I think.
>>>>>
>>>>> This is different to the previous behavior that was calling
>>>>> clk_core_round_rate_nolock() with core->parent directly, and
>>>>> clk_core_round_rate_nolock() if its core argument is NULL will set
>>>>> req->rate to 0 and bail out without returning an error.
>>>>>
>>>>> Now, your patch probably works because now that you provide a
>>>>> determine_rate implementation, clk_core_can_round() returns true and
>>>>> you'll take a different branch in clk_core_round_rate_nolock(), avoiding
>>>>> that issue entirely.
>>>>>
>>>>> Does that patch work better (on top of next-20221012)?
>>>>
>>>> I admit I didn't go too deep in the research, as my brain processed that as
>>>> "this is a mux clock, not really different from a standard mux, this callback
>>>> is missing, that's not optimal"... then that fixed it and called it a day.
>>>>
>>>> I should've prolonged my research for a better understanding of what was
>>>> actually going on.
>>>
>>> No worries :)
>>>
>>>> What you said actually opened my mind and, with little surprise, your patch
>>>> works as good as mine - no warnings and the clock scales as expected!
>>>
>>> I'm actually wondering if you didn't encounter two issues. What kernel
>>> were you testing before? If it's older than today's next
>>> (next-20221012), you're likely missing
>>>
>>
>> I was testing next-20221011.
>>
>>> https://lore.kernel.org/linux-clk/[email protected]/
>>>
>>> Which is likely to be what fixed the clock scaling. And my patch only
>>> fixed the warning. Could you test next-20221012? If I'm right, you
>>> should only get the warning.
>>>
>>
>> No, I am getting the same situation even after rebasing over next-20221012, without
>> any of the two patches (determine_rate() for mtk mux, nor the one you shared for
>> clk.c), when the warning happens, I get very slow GPU operation and the same "nice"
>> timeout:
>>
>> [ 27.785514] panfrost 13000000.gpu: gpu sched timeout, js=1,
>> config=0x7b00, status=0x0, head=0xa1cb180, tail=0xa1cb180,
>> sched_job=00000000f07d39e3
>>
>> ...so I'm not encountering the same issue that you've fixed with the patches that
>> got merged in next-20221012.
>>
>> Of course, as you were expecting, the warning is also still there and still
>> the same:
>>
>> [ 27.750504] WARNING: CPU: 4 PID: 164 at drivers/clk/clk.c:1462
>> clk_core_init_rate_req+0x84/0x90
>
> Ok. I'm still a bit unsure why it actually fixes anything.
>
> In the current next, clk_core_init_rate_req (through
> clk_core_forward_rate_req) will bail out right away, but the patch will
> clear the request and set the requested rate.
>
> I don't think the req->rate assignment changes anything because our next
> call will be to clk_core_round_rate_nolock that will clear it in the
> !core path, but it might just be that we don't initialize the request
> and end up with some garbage data in it?
>
> Could you test, on top of next-20221012,

No that's not the case, this patch has no effect at all (yes I've tested it).

>
> diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
> index c3c3f8c07258..ffbee00bd7cf 100644
> --- a/drivers/clk/clk.c
> +++ b/drivers/clk/clk.c
> @@ -1545,6 +1545,8 @@ static int clk_core_round_rate_nolock(struct clk_core *core,
> if (core->flags & CLK_SET_RATE_PARENT) {
> struct clk_rate_request parent_req;
>
> + memset(&parent_req, 0, sizeof(parent_req));
> +
> clk_core_forward_rate_req(core, req, core->parent, &parent_req, req->rate);
> ret = clk_core_round_rate_nolock(core->parent, &parent_req);
> if (ret)
>
> Thanks!
> Maxime

2022-10-12 14:07:49

by Maxime Ripard

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

On Wed, Oct 12, 2022 at 02:14:39PM +0200, AngeloGioacchino Del Regno wrote:
> Il 12/10/22 13:48, Maxime Ripard ha scritto:
> > On Wed, Oct 12, 2022 at 11:57:15AM +0200, AngeloGioacchino Del Regno wrote:
> > > Il 12/10/22 11:40, Maxime Ripard ha scritto:
> > > > On Wed, Oct 12, 2022 at 11:09:59AM +0200, AngeloGioacchino Del Regno wrote:
> > > > > Il 12/10/22 10:55, Maxime Ripard ha scritto:
> > > > > > Hi,
> > > > > >
> > > > > > On Tue, Oct 11, 2022 at 03:55:48PM +0200, AngeloGioacchino Del Regno wrote:
> > > > > > > Since commit 262ca38f4b6e ("clk: Stop forwarding clk_rate_requests
> > > > > > > to the parent"), the clk_rate_request is .. as the title says, not
> > > > > > > forwarded anymore to the parent:
> > > > > >
> > > > > > It's not entirely true, the rate request should still be forwarded, but
> > > > > > we don't pass the same instance of clk_rate_request anymore.
> > > > > >
> > > > > > > this produces an issue with the MediaTek clock MUX driver during GPU
> > > > > > > DVFS on MT8195, but not on MT8192 or others.
> > > > > > >
> > > > > > > This is because, differently from others, like MT8192 where all of
> > > > > > > the clocks in the MFG parents tree are of mtk_mux type, but in the
> > > > > > > parent tree of MT8195's MFG clock, we have one mtk_mux clock and
> > > > > > > one (clk framework generic) mux clock, like so:
> > > > > > >
> > > > > > > names: mfg_bg3d -> mfg_ck_fast_ref -> top_mfg_core_tmp (or) mfgpll
> > > > > > > types: mtk_gate -> mux -> mtk_mux (or) mtk_pll
> > > > > > >
> > > > > > > To solve this issue and also keep the GPU DVFS clocks code working
> > > > > > > as expected, wire up a .determine_rate() callback for the mtk_mux
> > > > > > > ops; for that, the standard clk_mux_determine_rate_flags() was used
> > > > > > > as it was possible to.
> > > > > >
> > > > > > It probably fixes things indeed, but I'm a bit worried that it just
> > > > > > works around the actual issue instead of fixing the actual bug...
> > > > > >
> > > > > > > This commit was successfully tested on MT6795 Xperia M5, MT8173 Elm,
> > > > > > > MT8192 Spherion and MT8195 Tomato; no regressions were seen.
> > > > > > >
> > > > > > > For the sake of some more documentation about this issue here's the
> > > > > > > trace of it:
> > > > > > >
> > > > > > > [ 12.211587] ------------[ cut here ]------------
> > > > > > > [ 12.211589] WARNING: CPU: 6 PID: 78 at drivers/clk/clk.c:1462 clk_core_init_rate_req+0x84/0x90
> > > > > > > [ 12.211593] Modules linked in: stp crct10dif_ce mtk_adsp_common llc rfkill snd_sof_xtensa_dsp
> > > > > > > panfrost(+) sbs_battery cros_ec_lid_angle cros_ec_sensors snd_sof_of
> > > > > > > cros_ec_sensors_core hid_multitouch cros_usbpd_logger snd_sof gpu_sched
> > > > > > > snd_sof_utils fuse ipv6
> > > > > > > [ 12.211614] CPU: 6 PID: 78 Comm: kworker/u16:2 Tainted: G W 6.0.0-next-20221011+ #58
> > > > > > > [ 12.211616] Hardware name: Acer Tomato (rev2) board (DT)
> > > > > > > [ 12.211617] Workqueue: devfreq_wq devfreq_monitor
> > > > > > > [ 12.211620] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > > [ 12.211622] pc : clk_core_init_rate_req+0x84/0x90
> > > > > > > [ 12.211625] lr : clk_core_forward_rate_req+0xa4/0xe4
> > > > > > > [ 12.211627] sp : ffff80000893b8e0
> > > > > > > [ 12.211628] x29: ffff80000893b8e0 x28: ffffdddf92f9b000 x27: ffff46a2c0e8bc05
> > > > > > > [ 12.211632] x26: ffff46a2c1041200 x25: 0000000000000000 x24: 00000000173eed80
> > > > > > > [ 12.211636] x23: ffff80000893b9c0 x22: ffff80000893b940 x21: 0000000000000000
> > > > > > > [ 12.211641] x20: ffff46a2c1039f00 x19: ffff46a2c1039f00 x18: 0000000000000000
> > > > > > > [ 12.211645] x17: 0000000000000038 x16: 000000000000d904 x15: 0000000000000003
> > > > > > > [ 12.211649] x14: ffffdddf9357ce48 x13: ffffdddf935e71c8 x12: 000000000004803c
> > > > > > > [ 12.211653] x11: 00000000a867d7ad x10: 00000000a867d7ad x9 : ffffdddf90c28df4
> > > > > > > [ 12.211657] x8 : ffffdddf9357a980 x7 : 0000000000000000 x6 : 0000000000000004
> > > > > > > [ 12.211661] x5 : ffffffffffffffc8 x4 : 00000000173eed80 x3 : ffff80000893b940
> > > > > > > [ 12.211665] x2 : 00000000173eed80 x1 : ffff80000893b940 x0 : 0000000000000000
> > > > > > > [ 12.211669] Call trace:
> > > > > > > [ 12.211670] clk_core_init_rate_req+0x84/0x90
> > > > > > > [ 12.211673] clk_core_round_rate_nolock+0xe8/0x10c
> > > > > > > [ 12.211675] clk_mux_determine_rate_flags+0x174/0x1f0
> > > > > > > [ 12.211677] clk_mux_determine_rate+0x1c/0x30
> > > > > > > [ 12.211680] clk_core_determine_round_nolock+0x74/0x130
> > > > > > > [ 12.211682] clk_core_round_rate_nolock+0x58/0x10c
> > > > > > > [ 12.211684] clk_core_round_rate_nolock+0xf4/0x10c
> > > > > > > [ 12.211686] clk_core_set_rate_nolock+0x194/0x2ac
> > > > > > > [ 12.211688] clk_set_rate+0x40/0x94
> > > > > > > [ 12.211691] _opp_config_clk_single+0x38/0xa0
> > > > > > > [ 12.211693] _set_opp+0x1b0/0x500
> > > > > > > [ 12.211695] dev_pm_opp_set_rate+0x120/0x290
> > > > > > > [ 12.211697] panfrost_devfreq_target+0x3c/0x50 [panfrost]
> > > > > > > [ 12.211705] devfreq_set_target+0x8c/0x2d0
> > > > > > > [ 12.211707] devfreq_update_target+0xcc/0xf4
> > > > > > > [ 12.211708] devfreq_monitor+0x40/0x1d0
> > > > > > > [ 12.211710] process_one_work+0x294/0x664
> > > > > > > [ 12.211712] worker_thread+0x7c/0x45c
> > > > > > > [ 12.211713] kthread+0x104/0x110
> > > > > > > [ 12.211716] ret_from_fork+0x10/0x20
> > > > > > > [ 12.211718] irq event stamp: 7102
> > > > > > > [ 12.211719] hardirqs last enabled at (7101): [<ffffdddf904ea5a0>] finish_task_switch.isra.0+0xec/0x2f0
> > > > > > > [ 12.211723] hardirqs last disabled at (7102): [<ffffdddf91794b74>] el1_dbg+0x24/0x90
> > > > > > > [ 12.211726] softirqs last enabled at (6716): [<ffffdddf90410be4>] __do_softirq+0x414/0x588
> > > > > > > [ 12.211728] softirqs last disabled at (6507): [<ffffdddf904171d8>] ____do_softirq+0x18/0x24
> > > > > > > [ 12.211730] ---[ end trace 0000000000000000 ]---
> > > > > >
> > > > > > ... Indeed, you shouldn't hit that warning at all. It happens in
> > > > > > clk_core_round_rate_nolock, which takes (before your patch) the
> > > > > > CLK_SET_RATE_PARENT branch. This indeed has been changed by the patch
> > > > > > you mentioned, and will call clk_core_forward_rate_req() now, that in
> > > > > > turn calls clk_core_init_rate_nolock().
> > > > > >
> > > > > > I think the warning you hit is because core->parent is NULL, which is
> > > > > > passed to clk_core_forward_rate_req() as the parent argument, and we'll
> > > > > > call clk_core_init_rate_req() with parent set as the core argument.
> > > > > >
> > > > > > In clk_core_init_rate_req(), the first thing we do is a WARN_ON(!core),
> > > > > > which is what you hit here I think.
> > > > > >
> > > > > > This is different to the previous behavior that was calling
> > > > > > clk_core_round_rate_nolock() with core->parent directly, and
> > > > > > clk_core_round_rate_nolock() if its core argument is NULL will set
> > > > > > req->rate to 0 and bail out without returning an error.
> > > > > >
> > > > > > Now, your patch probably works because now that you provide a
> > > > > > determine_rate implementation, clk_core_can_round() returns true and
> > > > > > you'll take a different branch in clk_core_round_rate_nolock(), avoiding
> > > > > > that issue entirely.
> > > > > >
> > > > > > Does that patch work better (on top of next-20221012)?
> > > > >
> > > > > I admit I didn't go too deep in the research, as my brain processed that as
> > > > > "this is a mux clock, not really different from a standard mux, this callback
> > > > > is missing, that's not optimal"... then that fixed it and called it a day.
> > > > >
> > > > > I should've prolonged my research for a better understanding of what was
> > > > > actually going on.
> > > >
> > > > No worries :)
> > > >
> > > > > What you said actually opened my mind and, with little surprise, your patch
> > > > > works as good as mine - no warnings and the clock scales as expected!
> > > >
> > > > I'm actually wondering if you didn't encounter two issues. What kernel
> > > > were you testing before? If it's older than today's next
> > > > (next-20221012), you're likely missing
> > > >
> > >
> > > I was testing next-20221011.
> > >
> > > > https://lore.kernel.org/linux-clk/[email protected]/
> > > >
> > > > Which is likely to be what fixed the clock scaling. And my patch only
> > > > fixed the warning. Could you test next-20221012? If I'm right, you
> > > > should only get the warning.
> > > >
> > >
> > > No, I am getting the same situation even after rebasing over next-20221012, without
> > > any of the two patches (determine_rate() for mtk mux, nor the one you shared for
> > > clk.c), when the warning happens, I get very slow GPU operation and the same "nice"
> > > timeout:
> > >
> > > [ 27.785514] panfrost 13000000.gpu: gpu sched timeout, js=1,
> > > config=0x7b00, status=0x0, head=0xa1cb180, tail=0xa1cb180,
> > > sched_job=00000000f07d39e3
> > >
> > > ...so I'm not encountering the same issue that you've fixed with the patches that
> > > got merged in next-20221012.
> > >
> > > Of course, as you were expecting, the warning is also still there and still
> > > the same:
> > >
> > > [ 27.750504] WARNING: CPU: 4 PID: 164 at drivers/clk/clk.c:1462
> > > clk_core_init_rate_req+0x84/0x90
> >
> > Ok. I'm still a bit unsure why it actually fixes anything.
> >
> > In the current next, clk_core_init_rate_req (through
> > clk_core_forward_rate_req) will bail out right away, but the patch will
> > clear the request and set the requested rate.
> >
> > I don't think the req->rate assignment changes anything because our next
> > call will be to clk_core_round_rate_nolock that will clear it in the
> > !core path, but it might just be that we don't initialize the request
> > and end up with some garbage data in it?
> >
> > Could you test, on top of next-20221012,
>
> No that's not the case, this patch has no effect at all (yes I've tested it).

Ok. Too bad. I've tried to build a test case that was reproducing what
you experience, but I don't seem to be able to build one for now.

If we go back to your previous explanation on the clock tree, and what
we discussed then, there's still something a bit puzzling.

You were saying that are calling clk_set_rate on mfg_bg3d, which is a
mtk_gate, registered with mtk_clk_register_gate(), and with the
mtk_clk_gate_ops_setclr clk_ops. There's no determine_rate / round_rate
implementation which makes sense for a gate, and CLK_SET_RATE_PARENT is
set by mtk_clk_register_gate(). Everything's good.

The clk_set_rate call will thus be forwarded to its parent,
mfg_ck_fast_ref, which is a mux, registered with
devm_clk_hw_register_mux(), with CLK_SET_RATE_PARENT as well. Makes
sense too, any clk_set_rate() call will thus evaluate all of the
mfg_ck_fast_ref parents and will either adjust the rate of a parent, or
reparent.

mfg_ck_fast_ref then has two parents, top_mfg_core_tmp and mfgpll.
Judging from your patch, top_mfg_core_tmp is being used.
top_mfg_core_tmp is a mtk_mux, registered by mtk_clk_register_mux()
(through mtk_clk_register_muxes()), and using
mtk_mux_gate_clr_set_upd_ops. CLK_SET_RATE_PARENT is set by
mtk_clk_register_mux(). That flag still makes sense for a mux. However,
it's really not clear to me how a mux operates without a determine_rate
implementation to forward the rates to each of its parents.

It looks like we were relying before on the fact that the request was
indeed forwarded as is to the current parent. It looks like the parent
was not registered at all (core->parent being NULL), and then it was
working because of $REASON, possibly one of the muxes preferred to
switch to some other clock source.

I think we need to address this in multiple ways:

- If you have ftrace enabled on that platform, running with "tp_printk
trace_event=clk:*" in the kernel command line on a failing kernel would
be great, so that we can figure out what is happening exactly.

- We really need to merge your patch above as well;

- And, if we can, we should forbid to register a mux without a
determine_rate implementation. We have to take into account that some
muxes are going to be RO and won't need a determine_rate
implementation at all, but a clock with a set_parent and without a
determine_rate seems like a weird combination.

What do you think?
Maxime

Attachments:

(No filename) (12.62 kB)
signature.asc (235.00 B)
Download all attachments

2022-10-12 17:02:16

by Chen-Yu Tsai

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

On Thu, Oct 13, 2022 at 12:42 AM Maxime Ripard <[email protected]> wrote:
>
> On Wed, Oct 12, 2022 at 03:56:19PM +0200, Maxime Ripard wrote:
> > On Wed, Oct 12, 2022 at 02:14:39PM +0200, AngeloGioacchino Del Regno wrote:
> > > Il 12/10/22 13:48, Maxime Ripard ha scritto:
> > > > On Wed, Oct 12, 2022 at 11:57:15AM +0200, AngeloGioacchino Del Regno wrote:
> > > > > Il 12/10/22 11:40, Maxime Ripard ha scritto:
> > > > > > On Wed, Oct 12, 2022 at 11:09:59AM +0200, AngeloGioacchino Del Regno wrote:
> > > > > > > Il 12/10/22 10:55, Maxime Ripard ha scritto:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > On Tue, Oct 11, 2022 at 03:55:48PM +0200, AngeloGioacchino Del Regno wrote:
> > > > > > > > > Since commit 262ca38f4b6e ("clk: Stop forwarding clk_rate_requests
> > > > > > > > > to the parent"), the clk_rate_request is .. as the title says, not
> > > > > > > > > forwarded anymore to the parent:
> > > > > > > >
> > > > > > > > It's not entirely true, the rate request should still be forwarded, but
> > > > > > > > we don't pass the same instance of clk_rate_request anymore.
> > > > > > > >
> > > > > > > > > this produces an issue with the MediaTek clock MUX driver during GPU
> > > > > > > > > DVFS on MT8195, but not on MT8192 or others.
> > > > > > > > >
> > > > > > > > > This is because, differently from others, like MT8192 where all of
> > > > > > > > > the clocks in the MFG parents tree are of mtk_mux type, but in the
> > > > > > > > > parent tree of MT8195's MFG clock, we have one mtk_mux clock and
> > > > > > > > > one (clk framework generic) mux clock, like so:
> > > > > > > > >
> > > > > > > > > names: mfg_bg3d -> mfg_ck_fast_ref -> top_mfg_core_tmp (or) mfgpll
> > > > > > > > > types: mtk_gate -> mux -> mtk_mux (or) mtk_pll
> > > > > > > > >
> > > > > > > > > To solve this issue and also keep the GPU DVFS clocks code working
> > > > > > > > > as expected, wire up a .determine_rate() callback for the mtk_mux
> > > > > > > > > ops; for that, the standard clk_mux_determine_rate_flags() was used
> > > > > > > > > as it was possible to.
> > > > > > > >
> > > > > > > > It probably fixes things indeed, but I'm a bit worried that it just
> > > > > > > > works around the actual issue instead of fixing the actual bug...
> > > > > > > >
> > > > > > > > > This commit was successfully tested on MT6795 Xperia M5, MT8173 Elm,
> > > > > > > > > MT8192 Spherion and MT8195 Tomato; no regressions were seen.
> > > > > > > > >
> > > > > > > > > For the sake of some more documentation about this issue here's the
> > > > > > > > > trace of it:
> > > > > > > > >
> > > > > > > > > [ 12.211587] ------------[ cut here ]------------
> > > > > > > > > [ 12.211589] WARNING: CPU: 6 PID: 78 at drivers/clk/clk.c:1462 clk_core_init_rate_req+0x84/0x90
> > > > > > > > > [ 12.211593] Modules linked in: stp crct10dif_ce mtk_adsp_common llc rfkill snd_sof_xtensa_dsp
> > > > > > > > > panfrost(+) sbs_battery cros_ec_lid_angle cros_ec_sensors snd_sof_of
> > > > > > > > > cros_ec_sensors_core hid_multitouch cros_usbpd_logger snd_sof gpu_sched
> > > > > > > > > snd_sof_utils fuse ipv6
> > > > > > > > > [ 12.211614] CPU: 6 PID: 78 Comm: kworker/u16:2 Tainted: G W 6.0.0-next-20221011+ #58
> > > > > > > > > [ 12.211616] Hardware name: Acer Tomato (rev2) board (DT)
> > > > > > > > > [ 12.211617] Workqueue: devfreq_wq devfreq_monitor
> > > > > > > > > [ 12.211620] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > > > > [ 12.211622] pc : clk_core_init_rate_req+0x84/0x90
> > > > > > > > > [ 12.211625] lr : clk_core_forward_rate_req+0xa4/0xe4
> > > > > > > > > [ 12.211627] sp : ffff80000893b8e0
> > > > > > > > > [ 12.211628] x29: ffff80000893b8e0 x28: ffffdddf92f9b000 x27: ffff46a2c0e8bc05
> > > > > > > > > [ 12.211632] x26: ffff46a2c1041200 x25: 0000000000000000 x24: 00000000173eed80
> > > > > > > > > [ 12.211636] x23: ffff80000893b9c0 x22: ffff80000893b940 x21: 0000000000000000
> > > > > > > > > [ 12.211641] x20: ffff46a2c1039f00 x19: ffff46a2c1039f00 x18: 0000000000000000
> > > > > > > > > [ 12.211645] x17: 0000000000000038 x16: 000000000000d904 x15: 0000000000000003
> > > > > > > > > [ 12.211649] x14: ffffdddf9357ce48 x13: ffffdddf935e71c8 x12: 000000000004803c
> > > > > > > > > [ 12.211653] x11: 00000000a867d7ad x10: 00000000a867d7ad x9 : ffffdddf90c28df4
> > > > > > > > > [ 12.211657] x8 : ffffdddf9357a980 x7 : 0000000000000000 x6 : 0000000000000004
> > > > > > > > > [ 12.211661] x5 : ffffffffffffffc8 x4 : 00000000173eed80 x3 : ffff80000893b940
> > > > > > > > > [ 12.211665] x2 : 00000000173eed80 x1 : ffff80000893b940 x0 : 0000000000000000
> > > > > > > > > [ 12.211669] Call trace:
> > > > > > > > > [ 12.211670] clk_core_init_rate_req+0x84/0x90
> > > > > > > > > [ 12.211673] clk_core_round_rate_nolock+0xe8/0x10c
> > > > > > > > > [ 12.211675] clk_mux_determine_rate_flags+0x174/0x1f0
> > > > > > > > > [ 12.211677] clk_mux_determine_rate+0x1c/0x30
> > > > > > > > > [ 12.211680] clk_core_determine_round_nolock+0x74/0x130
> > > > > > > > > [ 12.211682] clk_core_round_rate_nolock+0x58/0x10c
> > > > > > > > > [ 12.211684] clk_core_round_rate_nolock+0xf4/0x10c
> > > > > > > > > [ 12.211686] clk_core_set_rate_nolock+0x194/0x2ac
> > > > > > > > > [ 12.211688] clk_set_rate+0x40/0x94
> > > > > > > > > [ 12.211691] _opp_config_clk_single+0x38/0xa0
> > > > > > > > > [ 12.211693] _set_opp+0x1b0/0x500
> > > > > > > > > [ 12.211695] dev_pm_opp_set_rate+0x120/0x290
> > > > > > > > > [ 12.211697] panfrost_devfreq_target+0x3c/0x50 [panfrost]
> > > > > > > > > [ 12.211705] devfreq_set_target+0x8c/0x2d0
> > > > > > > > > [ 12.211707] devfreq_update_target+0xcc/0xf4
> > > > > > > > > [ 12.211708] devfreq_monitor+0x40/0x1d0
> > > > > > > > > [ 12.211710] process_one_work+0x294/0x664
> > > > > > > > > [ 12.211712] worker_thread+0x7c/0x45c
> > > > > > > > > [ 12.211713] kthread+0x104/0x110
> > > > > > > > > [ 12.211716] ret_from_fork+0x10/0x20
> > > > > > > > > [ 12.211718] irq event stamp: 7102
> > > > > > > > > [ 12.211719] hardirqs last enabled at (7101): [<ffffdddf904ea5a0>] finish_task_switch.isra.0+0xec/0x2f0
> > > > > > > > > [ 12.211723] hardirqs last disabled at (7102): [<ffffdddf91794b74>] el1_dbg+0x24/0x90
> > > > > > > > > [ 12.211726] softirqs last enabled at (6716): [<ffffdddf90410be4>] __do_softirq+0x414/0x588
> > > > > > > > > [ 12.211728] softirqs last disabled at (6507): [<ffffdddf904171d8>] ____do_softirq+0x18/0x24
> > > > > > > > > [ 12.211730] ---[ end trace 0000000000000000 ]---
> > > > > > > >
> > > > > > > > ... Indeed, you shouldn't hit that warning at all. It happens in
> > > > > > > > clk_core_round_rate_nolock, which takes (before your patch) the
> > > > > > > > CLK_SET_RATE_PARENT branch. This indeed has been changed by the patch
> > > > > > > > you mentioned, and will call clk_core_forward_rate_req() now, that in
> > > > > > > > turn calls clk_core_init_rate_nolock().
> > > > > > > >
> > > > > > > > I think the warning you hit is because core->parent is NULL, which is
> > > > > > > > passed to clk_core_forward_rate_req() as the parent argument, and we'll
> > > > > > > > call clk_core_init_rate_req() with parent set as the core argument.
> > > > > > > >
> > > > > > > > In clk_core_init_rate_req(), the first thing we do is a WARN_ON(!core),
> > > > > > > > which is what you hit here I think.
> > > > > > > >
> > > > > > > > This is different to the previous behavior that was calling
> > > > > > > > clk_core_round_rate_nolock() with core->parent directly, and
> > > > > > > > clk_core_round_rate_nolock() if its core argument is NULL will set
> > > > > > > > req->rate to 0 and bail out without returning an error.
> > > > > > > >
> > > > > > > > Now, your patch probably works because now that you provide a
> > > > > > > > determine_rate implementation, clk_core_can_round() returns true and
> > > > > > > > you'll take a different branch in clk_core_round_rate_nolock(), avoiding
> > > > > > > > that issue entirely.
> > > > > > > >
> > > > > > > > Does that patch work better (on top of next-20221012)?
> > > > > > >
> > > > > > > I admit I didn't go too deep in the research, as my brain processed that as
> > > > > > > "this is a mux clock, not really different from a standard mux, this callback
> > > > > > > is missing, that's not optimal"... then that fixed it and called it a day.
> > > > > > >
> > > > > > > I should've prolonged my research for a better understanding of what was
> > > > > > > actually going on.
> > > > > >
> > > > > > No worries :)
> > > > > >
> > > > > > > What you said actually opened my mind and, with little surprise, your patch
> > > > > > > works as good as mine - no warnings and the clock scales as expected!
> > > > > >
> > > > > > I'm actually wondering if you didn't encounter two issues. What kernel
> > > > > > were you testing before? If it's older than today's next
> > > > > > (next-20221012), you're likely missing
> > > > > >
> > > > >
> > > > > I was testing next-20221011.
> > > > >
> > > > > > https://lore.kernel.org/linux-clk/[email protected]/
> > > > > >
> > > > > > Which is likely to be what fixed the clock scaling. And my patch only
> > > > > > fixed the warning. Could you test next-20221012? If I'm right, you
> > > > > > should only get the warning.
> > > > > >
> > > > >
> > > > > No, I am getting the same situation even after rebasing over next-20221012, without
> > > > > any of the two patches (determine_rate() for mtk mux, nor the one you shared for
> > > > > clk.c), when the warning happens, I get very slow GPU operation and the same "nice"
> > > > > timeout:
> > > > >
> > > > > [ 27.785514] panfrost 13000000.gpu: gpu sched timeout, js=1,
> > > > > config=0x7b00, status=0x0, head=0xa1cb180, tail=0xa1cb180,
> > > > > sched_job=00000000f07d39e3
> > > > >
> > > > > ...so I'm not encountering the same issue that you've fixed with the patches that
> > > > > got merged in next-20221012.
> > > > >
> > > > > Of course, as you were expecting, the warning is also still there and still
> > > > > the same:
> > > > >
> > > > > [ 27.750504] WARNING: CPU: 4 PID: 164 at drivers/clk/clk.c:1462
> > > > > clk_core_init_rate_req+0x84/0x90
> > > >
> > > > Ok. I'm still a bit unsure why it actually fixes anything.
> > > >
> > > > In the current next, clk_core_init_rate_req (through
> > > > clk_core_forward_rate_req) will bail out right away, but the patch will
> > > > clear the request and set the requested rate.
> > > >
> > > > I don't think the req->rate assignment changes anything because our next
> > > > call will be to clk_core_round_rate_nolock that will clear it in the
> > > > !core path, but it might just be that we don't initialize the request
> > > > and end up with some garbage data in it?
> > > >
> > > > Could you test, on top of next-20221012,
> > >
> > > No that's not the case, this patch has no effect at all (yes I've tested it).
> >
> > Ok. Too bad. I've tried to build a test case that was reproducing what
> > you experience, but I don't seem to be able to build one for now.
> >
> > If we go back to your previous explanation on the clock tree, and what
> > we discussed then, there's still something a bit puzzling.
> >
> > You were saying that are calling clk_set_rate on mfg_bg3d, which is a
> > mtk_gate, registered with mtk_clk_register_gate(), and with the
> > mtk_clk_gate_ops_setclr clk_ops. There's no determine_rate / round_rate
> > implementation which makes sense for a gate, and CLK_SET_RATE_PARENT is
> > set by mtk_clk_register_gate(). Everything's good.
> >
> > The clk_set_rate call will thus be forwarded to its parent,
> > mfg_ck_fast_ref, which is a mux, registered with
> > devm_clk_hw_register_mux(), with CLK_SET_RATE_PARENT as well. Makes
> > sense too, any clk_set_rate() call will thus evaluate all of the
> > mfg_ck_fast_ref parents and will either adjust the rate of a parent, or
> > reparent.
> >
> > mfg_ck_fast_ref then has two parents, top_mfg_core_tmp and mfgpll.
> > Judging from your patch, top_mfg_core_tmp is being used.
> > top_mfg_core_tmp is a mtk_mux, registered by mtk_clk_register_mux()
> > (through mtk_clk_register_muxes()), and using
> > mtk_mux_gate_clr_set_upd_ops. CLK_SET_RATE_PARENT is set by
> > mtk_clk_register_mux(). That flag still makes sense for a mux. However,
> > it's really not clear to me how a mux operates without a determine_rate
> > implementation to forward the rates to each of its parents.
> >
> > It looks like we were relying before on the fact that the request was
> > indeed forwarded as is to the current parent. It looks like the parent
> > was not registered at all (core->parent being NULL), and then it was
> > working because of $REASON, possibly one of the muxes preferred to
> > switch to some other clock source.
> >
> > I think we need to address this in multiple ways:
> >
> > - If you have ftrace enabled on that platform, running with "tp_printk
> > trace_event=clk:*" in the kernel command line on a failing kernel would
> > be great, so that we can figure out what is happening exactly.
>
> I spent more time trying to create a setup similar to yours but couldn't
> find anything obviously wrong. In addition to the above, could you start
> a faulty kernel with that patch:
>
> diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
> index 884a58a98b6b..774953901a35 100644
> --- a/drivers/clk/clk.c
> +++ b/drivers/clk/clk.c
> @@ -262,6 +262,9 @@ EXPORT_SYMBOL_GPL(__clk_get_name);
>
> const char *clk_hw_get_name(const struct clk_hw *hw)
> {
> + if (!hw || !hw->core)
> + return "(null)";
> +
> return hw->core->name;
> }
> EXPORT_SYMBOL_GPL(clk_hw_get_name);
> @@ -1471,6 +1474,17 @@ static bool clk_core_can_round(struct clk_core * const core)
> return core->ops->determine_rate || core->ops->round_rate;
> }
>
> +static void clk_core_request_dump(const struct clk_core *core,
> + const struct clk_rate_request *req)
> +{
> + pr_crit("%s\n", core ? core->name : "(null)");
> + pr_crit("\trate %lu\n", req->rate);
> + pr_crit("\tmin %lu, max %lu\n", req->min_rate, req->max_rate);
> + pr_crit("\tbest parent %s, rate %lu\n",
> + clk_hw_get_name(req->best_parent_hw),
> + req->best_parent_rate);
> +}
> +
> static int clk_core_round_rate_nolock(struct clk_core *core,
> struct clk_rate_request *req)
> {
> @@ -2254,8 +2268,12 @@ static unsigned long clk_core_req_round_rate_nolock(struct clk_core *core,
>
> clk_core_init_rate_req(core, &req, req_rate);
>
> + clk_core_request_dump(core, &req);
> +
> ret = clk_core_round_rate_nolock(core, &req);
>
> + clk_core_request_dump(core, &req);
> +
> /* restore the protection */
> clk_core_rate_restore_protect(core, cnt);
>
>
> Alternatively, can I easily get a MT8195 device to test more easily?

The Acer Spin Chromebook 513, specifically CP513-2H, should be available
on Amazon in both the US and UK. No idea about France though. And it's
quite pricey.

ChenYu

2022-10-12 17:15:29

by Maxime Ripard

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

On Wed, Oct 12, 2022 at 03:56:19PM +0200, Maxime Ripard wrote:
> On Wed, Oct 12, 2022 at 02:14:39PM +0200, AngeloGioacchino Del Regno wrote:
> > Il 12/10/22 13:48, Maxime Ripard ha scritto:
> > > On Wed, Oct 12, 2022 at 11:57:15AM +0200, AngeloGioacchino Del Regno wrote:
> > > > Il 12/10/22 11:40, Maxime Ripard ha scritto:
> > > > > On Wed, Oct 12, 2022 at 11:09:59AM +0200, AngeloGioacchino Del Regno wrote:
> > > > > > Il 12/10/22 10:55, Maxime Ripard ha scritto:
> > > > > > > Hi,
> > > > > > >
> > > > > > > On Tue, Oct 11, 2022 at 03:55:48PM +0200, AngeloGioacchino Del Regno wrote:
> > > > > > > > Since commit 262ca38f4b6e ("clk: Stop forwarding clk_rate_requests
> > > > > > > > to the parent"), the clk_rate_request is .. as the title says, not
> > > > > > > > forwarded anymore to the parent:
> > > > > > >
> > > > > > > It's not entirely true, the rate request should still be forwarded, but
> > > > > > > we don't pass the same instance of clk_rate_request anymore.
> > > > > > >
> > > > > > > > this produces an issue with the MediaTek clock MUX driver during GPU
> > > > > > > > DVFS on MT8195, but not on MT8192 or others.
> > > > > > > >
> > > > > > > > This is because, differently from others, like MT8192 where all of
> > > > > > > > the clocks in the MFG parents tree are of mtk_mux type, but in the
> > > > > > > > parent tree of MT8195's MFG clock, we have one mtk_mux clock and
> > > > > > > > one (clk framework generic) mux clock, like so:
> > > > > > > >
> > > > > > > > names: mfg_bg3d -> mfg_ck_fast_ref -> top_mfg_core_tmp (or) mfgpll
> > > > > > > > types: mtk_gate -> mux -> mtk_mux (or) mtk_pll
> > > > > > > >
> > > > > > > > To solve this issue and also keep the GPU DVFS clocks code working
> > > > > > > > as expected, wire up a .determine_rate() callback for the mtk_mux
> > > > > > > > ops; for that, the standard clk_mux_determine_rate_flags() was used
> > > > > > > > as it was possible to.
> > > > > > >
> > > > > > > It probably fixes things indeed, but I'm a bit worried that it just
> > > > > > > works around the actual issue instead of fixing the actual bug...
> > > > > > >
> > > > > > > > This commit was successfully tested on MT6795 Xperia M5, MT8173 Elm,
> > > > > > > > MT8192 Spherion and MT8195 Tomato; no regressions were seen.
> > > > > > > >
> > > > > > > > For the sake of some more documentation about this issue here's the
> > > > > > > > trace of it:
> > > > > > > >
> > > > > > > > [ 12.211587] ------------[ cut here ]------------
> > > > > > > > [ 12.211589] WARNING: CPU: 6 PID: 78 at drivers/clk/clk.c:1462 clk_core_init_rate_req+0x84/0x90
> > > > > > > > [ 12.211593] Modules linked in: stp crct10dif_ce mtk_adsp_common llc rfkill snd_sof_xtensa_dsp
> > > > > > > > panfrost(+) sbs_battery cros_ec_lid_angle cros_ec_sensors snd_sof_of
> > > > > > > > cros_ec_sensors_core hid_multitouch cros_usbpd_logger snd_sof gpu_sched
> > > > > > > > snd_sof_utils fuse ipv6
> > > > > > > > [ 12.211614] CPU: 6 PID: 78 Comm: kworker/u16:2 Tainted: G W 6.0.0-next-20221011+ #58
> > > > > > > > [ 12.211616] Hardware name: Acer Tomato (rev2) board (DT)
> > > > > > > > [ 12.211617] Workqueue: devfreq_wq devfreq_monitor
> > > > > > > > [ 12.211620] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > > > [ 12.211622] pc : clk_core_init_rate_req+0x84/0x90
> > > > > > > > [ 12.211625] lr : clk_core_forward_rate_req+0xa4/0xe4
> > > > > > > > [ 12.211627] sp : ffff80000893b8e0
> > > > > > > > [ 12.211628] x29: ffff80000893b8e0 x28: ffffdddf92f9b000 x27: ffff46a2c0e8bc05
> > > > > > > > [ 12.211632] x26: ffff46a2c1041200 x25: 0000000000000000 x24: 00000000173eed80
> > > > > > > > [ 12.211636] x23: ffff80000893b9c0 x22: ffff80000893b940 x21: 0000000000000000
> > > > > > > > [ 12.211641] x20: ffff46a2c1039f00 x19: ffff46a2c1039f00 x18: 0000000000000000
> > > > > > > > [ 12.211645] x17: 0000000000000038 x16: 000000000000d904 x15: 0000000000000003
> > > > > > > > [ 12.211649] x14: ffffdddf9357ce48 x13: ffffdddf935e71c8 x12: 000000000004803c
> > > > > > > > [ 12.211653] x11: 00000000a867d7ad x10: 00000000a867d7ad x9 : ffffdddf90c28df4
> > > > > > > > [ 12.211657] x8 : ffffdddf9357a980 x7 : 0000000000000000 x6 : 0000000000000004
> > > > > > > > [ 12.211661] x5 : ffffffffffffffc8 x4 : 00000000173eed80 x3 : ffff80000893b940
> > > > > > > > [ 12.211665] x2 : 00000000173eed80 x1 : ffff80000893b940 x0 : 0000000000000000
> > > > > > > > [ 12.211669] Call trace:
> > > > > > > > [ 12.211670] clk_core_init_rate_req+0x84/0x90
> > > > > > > > [ 12.211673] clk_core_round_rate_nolock+0xe8/0x10c
> > > > > > > > [ 12.211675] clk_mux_determine_rate_flags+0x174/0x1f0
> > > > > > > > [ 12.211677] clk_mux_determine_rate+0x1c/0x30
> > > > > > > > [ 12.211680] clk_core_determine_round_nolock+0x74/0x130
> > > > > > > > [ 12.211682] clk_core_round_rate_nolock+0x58/0x10c
> > > > > > > > [ 12.211684] clk_core_round_rate_nolock+0xf4/0x10c
> > > > > > > > [ 12.211686] clk_core_set_rate_nolock+0x194/0x2ac
> > > > > > > > [ 12.211688] clk_set_rate+0x40/0x94
> > > > > > > > [ 12.211691] _opp_config_clk_single+0x38/0xa0
> > > > > > > > [ 12.211693] _set_opp+0x1b0/0x500
> > > > > > > > [ 12.211695] dev_pm_opp_set_rate+0x120/0x290
> > > > > > > > [ 12.211697] panfrost_devfreq_target+0x3c/0x50 [panfrost]
> > > > > > > > [ 12.211705] devfreq_set_target+0x8c/0x2d0
> > > > > > > > [ 12.211707] devfreq_update_target+0xcc/0xf4
> > > > > > > > [ 12.211708] devfreq_monitor+0x40/0x1d0
> > > > > > > > [ 12.211710] process_one_work+0x294/0x664
> > > > > > > > [ 12.211712] worker_thread+0x7c/0x45c
> > > > > > > > [ 12.211713] kthread+0x104/0x110
> > > > > > > > [ 12.211716] ret_from_fork+0x10/0x20
> > > > > > > > [ 12.211718] irq event stamp: 7102
> > > > > > > > [ 12.211719] hardirqs last enabled at (7101): [<ffffdddf904ea5a0>] finish_task_switch.isra.0+0xec/0x2f0
> > > > > > > > [ 12.211723] hardirqs last disabled at (7102): [<ffffdddf91794b74>] el1_dbg+0x24/0x90
> > > > > > > > [ 12.211726] softirqs last enabled at (6716): [<ffffdddf90410be4>] __do_softirq+0x414/0x588
> > > > > > > > [ 12.211728] softirqs last disabled at (6507): [<ffffdddf904171d8>] ____do_softirq+0x18/0x24
> > > > > > > > [ 12.211730] ---[ end trace 0000000000000000 ]---
> > > > > > >
> > > > > > > ... Indeed, you shouldn't hit that warning at all. It happens in
> > > > > > > clk_core_round_rate_nolock, which takes (before your patch) the
> > > > > > > CLK_SET_RATE_PARENT branch. This indeed has been changed by the patch
> > > > > > > you mentioned, and will call clk_core_forward_rate_req() now, that in
> > > > > > > turn calls clk_core_init_rate_nolock().
> > > > > > >
> > > > > > > I think the warning you hit is because core->parent is NULL, which is
> > > > > > > passed to clk_core_forward_rate_req() as the parent argument, and we'll
> > > > > > > call clk_core_init_rate_req() with parent set as the core argument.
> > > > > > >
> > > > > > > In clk_core_init_rate_req(), the first thing we do is a WARN_ON(!core),
> > > > > > > which is what you hit here I think.
> > > > > > >
> > > > > > > This is different to the previous behavior that was calling
> > > > > > > clk_core_round_rate_nolock() with core->parent directly, and
> > > > > > > clk_core_round_rate_nolock() if its core argument is NULL will set
> > > > > > > req->rate to 0 and bail out without returning an error.
> > > > > > >
> > > > > > > Now, your patch probably works because now that you provide a
> > > > > > > determine_rate implementation, clk_core_can_round() returns true and
> > > > > > > you'll take a different branch in clk_core_round_rate_nolock(), avoiding
> > > > > > > that issue entirely.
> > > > > > >
> > > > > > > Does that patch work better (on top of next-20221012)?
> > > > > >
> > > > > > I admit I didn't go too deep in the research, as my brain processed that as
> > > > > > "this is a mux clock, not really different from a standard mux, this callback
> > > > > > is missing, that's not optimal"... then that fixed it and called it a day.
> > > > > >
> > > > > > I should've prolonged my research for a better understanding of what was
> > > > > > actually going on.
> > > > >
> > > > > No worries :)
> > > > >
> > > > > > What you said actually opened my mind and, with little surprise, your patch
> > > > > > works as good as mine - no warnings and the clock scales as expected!
> > > > >
> > > > > I'm actually wondering if you didn't encounter two issues. What kernel
> > > > > were you testing before? If it's older than today's next
> > > > > (next-20221012), you're likely missing
> > > > >
> > > >
> > > > I was testing next-20221011.
> > > >
> > > > > https://lore.kernel.org/linux-clk/[email protected]/
> > > > >
> > > > > Which is likely to be what fixed the clock scaling. And my patch only
> > > > > fixed the warning. Could you test next-20221012? If I'm right, you
> > > > > should only get the warning.
> > > > >
> > > >
> > > > No, I am getting the same situation even after rebasing over next-20221012, without
> > > > any of the two patches (determine_rate() for mtk mux, nor the one you shared for
> > > > clk.c), when the warning happens, I get very slow GPU operation and the same "nice"
> > > > timeout:
> > > >
> > > > [ 27.785514] panfrost 13000000.gpu: gpu sched timeout, js=1,
> > > > config=0x7b00, status=0x0, head=0xa1cb180, tail=0xa1cb180,
> > > > sched_job=00000000f07d39e3
> > > >
> > > > ...so I'm not encountering the same issue that you've fixed with the patches that
> > > > got merged in next-20221012.
> > > >
> > > > Of course, as you were expecting, the warning is also still there and still
> > > > the same:
> > > >
> > > > [ 27.750504] WARNING: CPU: 4 PID: 164 at drivers/clk/clk.c:1462
> > > > clk_core_init_rate_req+0x84/0x90
> > >
> > > Ok. I'm still a bit unsure why it actually fixes anything.
> > >
> > > In the current next, clk_core_init_rate_req (through
> > > clk_core_forward_rate_req) will bail out right away, but the patch will
> > > clear the request and set the requested rate.
> > >
> > > I don't think the req->rate assignment changes anything because our next
> > > call will be to clk_core_round_rate_nolock that will clear it in the
> > > !core path, but it might just be that we don't initialize the request
> > > and end up with some garbage data in it?
> > >
> > > Could you test, on top of next-20221012,
> >
> > No that's not the case, this patch has no effect at all (yes I've tested it).
>
> Ok. Too bad. I've tried to build a test case that was reproducing what
> you experience, but I don't seem to be able to build one for now.
>
> If we go back to your previous explanation on the clock tree, and what
> we discussed then, there's still something a bit puzzling.
>
> You were saying that are calling clk_set_rate on mfg_bg3d, which is a
> mtk_gate, registered with mtk_clk_register_gate(), and with the
> mtk_clk_gate_ops_setclr clk_ops. There's no determine_rate / round_rate
> implementation which makes sense for a gate, and CLK_SET_RATE_PARENT is
> set by mtk_clk_register_gate(). Everything's good.
>
> The clk_set_rate call will thus be forwarded to its parent,
> mfg_ck_fast_ref, which is a mux, registered with
> devm_clk_hw_register_mux(), with CLK_SET_RATE_PARENT as well. Makes
> sense too, any clk_set_rate() call will thus evaluate all of the
> mfg_ck_fast_ref parents and will either adjust the rate of a parent, or
> reparent.
>
> mfg_ck_fast_ref then has two parents, top_mfg_core_tmp and mfgpll.
> Judging from your patch, top_mfg_core_tmp is being used.
> top_mfg_core_tmp is a mtk_mux, registered by mtk_clk_register_mux()
> (through mtk_clk_register_muxes()), and using
> mtk_mux_gate_clr_set_upd_ops. CLK_SET_RATE_PARENT is set by
> mtk_clk_register_mux(). That flag still makes sense for a mux. However,
> it's really not clear to me how a mux operates without a determine_rate
> implementation to forward the rates to each of its parents.
>
> It looks like we were relying before on the fact that the request was
> indeed forwarded as is to the current parent. It looks like the parent
> was not registered at all (core->parent being NULL), and then it was
> working because of $REASON, possibly one of the muxes preferred to
> switch to some other clock source.
>
> I think we need to address this in multiple ways:
>
> - If you have ftrace enabled on that platform, running with "tp_printk
> trace_event=clk:*" in the kernel command line on a failing kernel would
> be great, so that we can figure out what is happening exactly.

I spent more time trying to create a setup similar to yours but couldn't
find anything obviously wrong. In addition to the above, could you start
a faulty kernel with that patch:

diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
index 884a58a98b6b..774953901a35 100644
--- a/drivers/clk/clk.c
+++ b/drivers/clk/clk.c
@@ -262,6 +262,9 @@ EXPORT_SYMBOL_GPL(__clk_get_name);

const char *clk_hw_get_name(const struct clk_hw *hw)
{
+ if (!hw || !hw->core)
+ return "(null)";
+
return hw->core->name;
}
EXPORT_SYMBOL_GPL(clk_hw_get_name);
@@ -1471,6 +1474,17 @@ static bool clk_core_can_round(struct clk_core * const core)
return core->ops->determine_rate || core->ops->round_rate;
}

+static void clk_core_request_dump(const struct clk_core *core,
+ const struct clk_rate_request *req)
+{
+ pr_crit("%s\n", core ? core->name : "(null)");
+ pr_crit("\trate %lu\n", req->rate);
+ pr_crit("\tmin %lu, max %lu\n", req->min_rate, req->max_rate);
+ pr_crit("\tbest parent %s, rate %lu\n",
+ clk_hw_get_name(req->best_parent_hw),
+ req->best_parent_rate);
+}
+
static int clk_core_round_rate_nolock(struct clk_core *core,
struct clk_rate_request *req)
{
@@ -2254,8 +2268,12 @@ static unsigned long clk_core_req_round_rate_nolock(struct clk_core *core,

clk_core_init_rate_req(core, &req, req_rate);

+ clk_core_request_dump(core, &req);
+
ret = clk_core_round_rate_nolock(core, &req);

+ clk_core_request_dump(core, &req);
+
/* restore the protection */
clk_core_rate_restore_protect(core, cnt);

Alternatively, can I easily get a MT8195 device to test more easily?

Thanks!
Maxime

Attachments:

(No filename) (14.33 kB)
signature.asc (235.00 B)
Download all attachments

2022-10-13 07:33:02

by Maxime Ripard

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

On Thu, Oct 13, 2022 at 12:52:41AM +0800, Chen-Yu Tsai wrote:
> On Thu, Oct 13, 2022 at 12:42 AM Maxime Ripard <[email protected]> wrote:
> >
> > On Wed, Oct 12, 2022 at 03:56:19PM +0200, Maxime Ripard wrote:
> > > On Wed, Oct 12, 2022 at 02:14:39PM +0200, AngeloGioacchino Del Regno wrote:
> > > > Il 12/10/22 13:48, Maxime Ripard ha scritto:
> > > > > On Wed, Oct 12, 2022 at 11:57:15AM +0200, AngeloGioacchino Del Regno wrote:
> > > > > > Il 12/10/22 11:40, Maxime Ripard ha scritto:
> > > > > > > On Wed, Oct 12, 2022 at 11:09:59AM +0200, AngeloGioacchino Del Regno wrote:
> > > > > > > > Il 12/10/22 10:55, Maxime Ripard ha scritto:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > On Tue, Oct 11, 2022 at 03:55:48PM +0200, AngeloGioacchino Del Regno wrote:
> > > > > > > > > > Since commit 262ca38f4b6e ("clk: Stop forwarding clk_rate_requests
> > > > > > > > > > to the parent"), the clk_rate_request is .. as the title says, not
> > > > > > > > > > forwarded anymore to the parent:
> > > > > > > > >
> > > > > > > > > It's not entirely true, the rate request should still be forwarded, but
> > > > > > > > > we don't pass the same instance of clk_rate_request anymore.
> > > > > > > > >
> > > > > > > > > > this produces an issue with the MediaTek clock MUX driver during GPU
> > > > > > > > > > DVFS on MT8195, but not on MT8192 or others.
> > > > > > > > > >
> > > > > > > > > > This is because, differently from others, like MT8192 where all of
> > > > > > > > > > the clocks in the MFG parents tree are of mtk_mux type, but in the
> > > > > > > > > > parent tree of MT8195's MFG clock, we have one mtk_mux clock and
> > > > > > > > > > one (clk framework generic) mux clock, like so:
> > > > > > > > > >
> > > > > > > > > > names: mfg_bg3d -> mfg_ck_fast_ref -> top_mfg_core_tmp (or) mfgpll
> > > > > > > > > > types: mtk_gate -> mux -> mtk_mux (or) mtk_pll
> > > > > > > > > >
> > > > > > > > > > To solve this issue and also keep the GPU DVFS clocks code working
> > > > > > > > > > as expected, wire up a .determine_rate() callback for the mtk_mux
> > > > > > > > > > ops; for that, the standard clk_mux_determine_rate_flags() was used
> > > > > > > > > > as it was possible to.
> > > > > > > > >
> > > > > > > > > It probably fixes things indeed, but I'm a bit worried that it just
> > > > > > > > > works around the actual issue instead of fixing the actual bug...
> > > > > > > > >
> > > > > > > > > > This commit was successfully tested on MT6795 Xperia M5, MT8173 Elm,
> > > > > > > > > > MT8192 Spherion and MT8195 Tomato; no regressions were seen.
> > > > > > > > > >
> > > > > > > > > > For the sake of some more documentation about this issue here's the
> > > > > > > > > > trace of it:
> > > > > > > > > >
> > > > > > > > > > [ 12.211587] ------------[ cut here ]------------
> > > > > > > > > > [ 12.211589] WARNING: CPU: 6 PID: 78 at drivers/clk/clk.c:1462 clk_core_init_rate_req+0x84/0x90
> > > > > > > > > > [ 12.211593] Modules linked in: stp crct10dif_ce mtk_adsp_common llc rfkill snd_sof_xtensa_dsp
> > > > > > > > > > panfrost(+) sbs_battery cros_ec_lid_angle cros_ec_sensors snd_sof_of
> > > > > > > > > > cros_ec_sensors_core hid_multitouch cros_usbpd_logger snd_sof gpu_sched
> > > > > > > > > > snd_sof_utils fuse ipv6
> > > > > > > > > > [ 12.211614] CPU: 6 PID: 78 Comm: kworker/u16:2 Tainted: G W 6.0.0-next-20221011+ #58
> > > > > > > > > > [ 12.211616] Hardware name: Acer Tomato (rev2) board (DT)
> > > > > > > > > > [ 12.211617] Workqueue: devfreq_wq devfreq_monitor
> > > > > > > > > > [ 12.211620] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > > > > > [ 12.211622] pc : clk_core_init_rate_req+0x84/0x90
> > > > > > > > > > [ 12.211625] lr : clk_core_forward_rate_req+0xa4/0xe4
> > > > > > > > > > [ 12.211627] sp : ffff80000893b8e0
> > > > > > > > > > [ 12.211628] x29: ffff80000893b8e0 x28: ffffdddf92f9b000 x27: ffff46a2c0e8bc05
> > > > > > > > > > [ 12.211632] x26: ffff46a2c1041200 x25: 0000000000000000 x24: 00000000173eed80
> > > > > > > > > > [ 12.211636] x23: ffff80000893b9c0 x22: ffff80000893b940 x21: 0000000000000000
> > > > > > > > > > [ 12.211641] x20: ffff46a2c1039f00 x19: ffff46a2c1039f00 x18: 0000000000000000
> > > > > > > > > > [ 12.211645] x17: 0000000000000038 x16: 000000000000d904 x15: 0000000000000003
> > > > > > > > > > [ 12.211649] x14: ffffdddf9357ce48 x13: ffffdddf935e71c8 x12: 000000000004803c
> > > > > > > > > > [ 12.211653] x11: 00000000a867d7ad x10: 00000000a867d7ad x9 : ffffdddf90c28df4
> > > > > > > > > > [ 12.211657] x8 : ffffdddf9357a980 x7 : 0000000000000000 x6 : 0000000000000004
> > > > > > > > > > [ 12.211661] x5 : ffffffffffffffc8 x4 : 00000000173eed80 x3 : ffff80000893b940
> > > > > > > > > > [ 12.211665] x2 : 00000000173eed80 x1 : ffff80000893b940 x0 : 0000000000000000
> > > > > > > > > > [ 12.211669] Call trace:
> > > > > > > > > > [ 12.211670] clk_core_init_rate_req+0x84/0x90
> > > > > > > > > > [ 12.211673] clk_core_round_rate_nolock+0xe8/0x10c
> > > > > > > > > > [ 12.211675] clk_mux_determine_rate_flags+0x174/0x1f0
> > > > > > > > > > [ 12.211677] clk_mux_determine_rate+0x1c/0x30
> > > > > > > > > > [ 12.211680] clk_core_determine_round_nolock+0x74/0x130
> > > > > > > > > > [ 12.211682] clk_core_round_rate_nolock+0x58/0x10c
> > > > > > > > > > [ 12.211684] clk_core_round_rate_nolock+0xf4/0x10c
> > > > > > > > > > [ 12.211686] clk_core_set_rate_nolock+0x194/0x2ac
> > > > > > > > > > [ 12.211688] clk_set_rate+0x40/0x94
> > > > > > > > > > [ 12.211691] _opp_config_clk_single+0x38/0xa0
> > > > > > > > > > [ 12.211693] _set_opp+0x1b0/0x500
> > > > > > > > > > [ 12.211695] dev_pm_opp_set_rate+0x120/0x290
> > > > > > > > > > [ 12.211697] panfrost_devfreq_target+0x3c/0x50 [panfrost]
> > > > > > > > > > [ 12.211705] devfreq_set_target+0x8c/0x2d0
> > > > > > > > > > [ 12.211707] devfreq_update_target+0xcc/0xf4
> > > > > > > > > > [ 12.211708] devfreq_monitor+0x40/0x1d0
> > > > > > > > > > [ 12.211710] process_one_work+0x294/0x664
> > > > > > > > > > [ 12.211712] worker_thread+0x7c/0x45c
> > > > > > > > > > [ 12.211713] kthread+0x104/0x110
> > > > > > > > > > [ 12.211716] ret_from_fork+0x10/0x20
> > > > > > > > > > [ 12.211718] irq event stamp: 7102
> > > > > > > > > > [ 12.211719] hardirqs last enabled at (7101): [<ffffdddf904ea5a0>] finish_task_switch.isra.0+0xec/0x2f0
> > > > > > > > > > [ 12.211723] hardirqs last disabled at (7102): [<ffffdddf91794b74>] el1_dbg+0x24/0x90
> > > > > > > > > > [ 12.211726] softirqs last enabled at (6716): [<ffffdddf90410be4>] __do_softirq+0x414/0x588
> > > > > > > > > > [ 12.211728] softirqs last disabled at (6507): [<ffffdddf904171d8>] ____do_softirq+0x18/0x24
> > > > > > > > > > [ 12.211730] ---[ end trace 0000000000000000 ]---
> > > > > > > > >
> > > > > > > > > ... Indeed, you shouldn't hit that warning at all. It happens in
> > > > > > > > > clk_core_round_rate_nolock, which takes (before your patch) the
> > > > > > > > > CLK_SET_RATE_PARENT branch. This indeed has been changed by the patch
> > > > > > > > > you mentioned, and will call clk_core_forward_rate_req() now, that in
> > > > > > > > > turn calls clk_core_init_rate_nolock().
> > > > > > > > >
> > > > > > > > > I think the warning you hit is because core->parent is NULL, which is
> > > > > > > > > passed to clk_core_forward_rate_req() as the parent argument, and we'll
> > > > > > > > > call clk_core_init_rate_req() with parent set as the core argument.
> > > > > > > > >
> > > > > > > > > In clk_core_init_rate_req(), the first thing we do is a WARN_ON(!core),
> > > > > > > > > which is what you hit here I think.
> > > > > > > > >
> > > > > > > > > This is different to the previous behavior that was calling
> > > > > > > > > clk_core_round_rate_nolock() with core->parent directly, and
> > > > > > > > > clk_core_round_rate_nolock() if its core argument is NULL will set
> > > > > > > > > req->rate to 0 and bail out without returning an error.
> > > > > > > > >
> > > > > > > > > Now, your patch probably works because now that you provide a
> > > > > > > > > determine_rate implementation, clk_core_can_round() returns true and
> > > > > > > > > you'll take a different branch in clk_core_round_rate_nolock(), avoiding
> > > > > > > > > that issue entirely.
> > > > > > > > >
> > > > > > > > > Does that patch work better (on top of next-20221012)?
> > > > > > > >
> > > > > > > > I admit I didn't go too deep in the research, as my brain processed that as
> > > > > > > > "this is a mux clock, not really different from a standard mux, this callback
> > > > > > > > is missing, that's not optimal"... then that fixed it and called it a day.
> > > > > > > >
> > > > > > > > I should've prolonged my research for a better understanding of what was
> > > > > > > > actually going on.
> > > > > > >
> > > > > > > No worries :)
> > > > > > >
> > > > > > > > What you said actually opened my mind and, with little surprise, your patch
> > > > > > > > works as good as mine - no warnings and the clock scales as expected!
> > > > > > >
> > > > > > > I'm actually wondering if you didn't encounter two issues. What kernel
> > > > > > > were you testing before? If it's older than today's next
> > > > > > > (next-20221012), you're likely missing
> > > > > > >
> > > > > >
> > > > > > I was testing next-20221011.
> > > > > >
> > > > > > > https://lore.kernel.org/linux-clk/[email protected]/
> > > > > > >
> > > > > > > Which is likely to be what fixed the clock scaling. And my patch only
> > > > > > > fixed the warning. Could you test next-20221012? If I'm right, you
> > > > > > > should only get the warning.
> > > > > > >
> > > > > >
> > > > > > No, I am getting the same situation even after rebasing over next-20221012, without
> > > > > > any of the two patches (determine_rate() for mtk mux, nor the one you shared for
> > > > > > clk.c), when the warning happens, I get very slow GPU operation and the same "nice"
> > > > > > timeout:
> > > > > >
> > > > > > [ 27.785514] panfrost 13000000.gpu: gpu sched timeout, js=1,
> > > > > > config=0x7b00, status=0x0, head=0xa1cb180, tail=0xa1cb180,
> > > > > > sched_job=00000000f07d39e3
> > > > > >
> > > > > > ...so I'm not encountering the same issue that you've fixed with the patches that
> > > > > > got merged in next-20221012.
> > > > > >
> > > > > > Of course, as you were expecting, the warning is also still there and still
> > > > > > the same:
> > > > > >
> > > > > > [ 27.750504] WARNING: CPU: 4 PID: 164 at drivers/clk/clk.c:1462
> > > > > > clk_core_init_rate_req+0x84/0x90
> > > > >
> > > > > Ok. I'm still a bit unsure why it actually fixes anything.
> > > > >
> > > > > In the current next, clk_core_init_rate_req (through
> > > > > clk_core_forward_rate_req) will bail out right away, but the patch will
> > > > > clear the request and set the requested rate.
> > > > >
> > > > > I don't think the req->rate assignment changes anything because our next
> > > > > call will be to clk_core_round_rate_nolock that will clear it in the
> > > > > !core path, but it might just be that we don't initialize the request
> > > > > and end up with some garbage data in it?
> > > > >
> > > > > Could you test, on top of next-20221012,
> > > >
> > > > No that's not the case, this patch has no effect at all (yes I've tested it).
> > >
> > > Ok. Too bad. I've tried to build a test case that was reproducing what
> > > you experience, but I don't seem to be able to build one for now.
> > >
> > > If we go back to your previous explanation on the clock tree, and what
> > > we discussed then, there's still something a bit puzzling.
> > >
> > > You were saying that are calling clk_set_rate on mfg_bg3d, which is a
> > > mtk_gate, registered with mtk_clk_register_gate(), and with the
> > > mtk_clk_gate_ops_setclr clk_ops. There's no determine_rate / round_rate
> > > implementation which makes sense for a gate, and CLK_SET_RATE_PARENT is
> > > set by mtk_clk_register_gate(). Everything's good.
> > >
> > > The clk_set_rate call will thus be forwarded to its parent,
> > > mfg_ck_fast_ref, which is a mux, registered with
> > > devm_clk_hw_register_mux(), with CLK_SET_RATE_PARENT as well. Makes
> > > sense too, any clk_set_rate() call will thus evaluate all of the
> > > mfg_ck_fast_ref parents and will either adjust the rate of a parent, or
> > > reparent.
> > >
> > > mfg_ck_fast_ref then has two parents, top_mfg_core_tmp and mfgpll.
> > > Judging from your patch, top_mfg_core_tmp is being used.
> > > top_mfg_core_tmp is a mtk_mux, registered by mtk_clk_register_mux()
> > > (through mtk_clk_register_muxes()), and using
> > > mtk_mux_gate_clr_set_upd_ops. CLK_SET_RATE_PARENT is set by
> > > mtk_clk_register_mux(). That flag still makes sense for a mux. However,
> > > it's really not clear to me how a mux operates without a determine_rate
> > > implementation to forward the rates to each of its parents.
> > >
> > > It looks like we were relying before on the fact that the request was
> > > indeed forwarded as is to the current parent. It looks like the parent
> > > was not registered at all (core->parent being NULL), and then it was
> > > working because of $REASON, possibly one of the muxes preferred to
> > > switch to some other clock source.
> > >
> > > I think we need to address this in multiple ways:
> > >
> > > - If you have ftrace enabled on that platform, running with "tp_printk
> > > trace_event=clk:*" in the kernel command line on a failing kernel would
> > > be great, so that we can figure out what is happening exactly.
> >
> > I spent more time trying to create a setup similar to yours but couldn't
> > find anything obviously wrong. In addition to the above, could you start
> > a faulty kernel with that patch:
> >
> > diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
> > index 884a58a98b6b..774953901a35 100644
> > --- a/drivers/clk/clk.c
> > +++ b/drivers/clk/clk.c
> > @@ -262,6 +262,9 @@ EXPORT_SYMBOL_GPL(__clk_get_name);
> >
> > const char *clk_hw_get_name(const struct clk_hw *hw)
> > {
> > + if (!hw || !hw->core)
> > + return "(null)";
> > +
> > return hw->core->name;
> > }
> > EXPORT_SYMBOL_GPL(clk_hw_get_name);
> > @@ -1471,6 +1474,17 @@ static bool clk_core_can_round(struct clk_core * const core)
> > return core->ops->determine_rate || core->ops->round_rate;
> > }
> >
> > +static void clk_core_request_dump(const struct clk_core *core,
> > + const struct clk_rate_request *req)
> > +{
> > + pr_crit("%s\n", core ? core->name : "(null)");
> > + pr_crit("\trate %lu\n", req->rate);
> > + pr_crit("\tmin %lu, max %lu\n", req->min_rate, req->max_rate);
> > + pr_crit("\tbest parent %s, rate %lu\n",
> > + clk_hw_get_name(req->best_parent_hw),
> > + req->best_parent_rate);
> > +}
> > +
> > static int clk_core_round_rate_nolock(struct clk_core *core,
> > struct clk_rate_request *req)
> > {
> > @@ -2254,8 +2268,12 @@ static unsigned long clk_core_req_round_rate_nolock(struct clk_core *core,
> >
> > clk_core_init_rate_req(core, &req, req_rate);
> >
> > + clk_core_request_dump(core, &req);
> > +
> > ret = clk_core_round_rate_nolock(core, &req);
> >
> > + clk_core_request_dump(core, &req);
> > +
> > /* restore the protection */
> > clk_core_rate_restore_protect(core, cnt);
> >
> >
> > Alternatively, can I easily get a MT8195 device to test more easily?
>
> The Acer Spin Chromebook 513, specifically CP513-2H, should be available
> on Amazon in both the US and UK. No idea about France though. And it's
> quite pricey.

Thanks, but I was more expecting some kind of SBC, and it's a bit too
expensive for me indeed :/

Maxime

Attachments:

(No filename) (15.89 kB)
signature.asc (235.00 B)
Download all attachments

2022-10-13 08:43:56

by Maxime Ripard

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

Hi

Thanks again for testing, it's very appreciated

On Thu, Oct 13, 2022 at 10:13:04AM +0200, AngeloGioacchino Del Regno wrote:
> Il 13/10/22 09:19, Maxime Ripard ha scritto:
> > On Thu, Oct 13, 2022 at 12:52:41AM +0800, Chen-Yu Tsai wrote:
> > > On Thu, Oct 13, 2022 at 12:42 AM Maxime Ripard <[email protected]> wrote:
> > > >
> > > > On Wed, Oct 12, 2022 at 03:56:19PM +0200, Maxime Ripard wrote:
> > > > > On Wed, Oct 12, 2022 at 02:14:39PM +0200, AngeloGioacchino Del Regno wrote:
> > > > > > Il 12/10/22 13:48, Maxime Ripard ha scritto:
> > > > > > > On Wed, Oct 12, 2022 at 11:57:15AM +0200, AngeloGioacchino Del Regno wrote:
> > > > > > > > Il 12/10/22 11:40, Maxime Ripard ha scritto:
> > > > > > > > > On Wed, Oct 12, 2022 at 11:09:59AM +0200, AngeloGioacchino Del Regno wrote:
> > > > > > > > > > Il 12/10/22 10:55, Maxime Ripard ha scritto:
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Oct 11, 2022 at 03:55:48PM +0200, AngeloGioacchino Del Regno wrote:
> > > > > > > > > > > > Since commit 262ca38f4b6e ("clk: Stop forwarding clk_rate_requests
> > > > > > > > > > > > to the parent"), the clk_rate_request is .. as the title says, not
> > > > > > > > > > > > forwarded anymore to the parent:
> > > > > > > > > > >
> > > > > > > > > > > It's not entirely true, the rate request should still be forwarded, but
> > > > > > > > > > > we don't pass the same instance of clk_rate_request anymore.
> > > > > > > > > > >
> > > > > > > > > > > > this produces an issue with the MediaTek clock MUX driver during GPU
> > > > > > > > > > > > DVFS on MT8195, but not on MT8192 or others.
> > > > > > > > > > > >
> > > > > > > > > > > > This is because, differently from others, like MT8192 where all of
> > > > > > > > > > > > the clocks in the MFG parents tree are of mtk_mux type, but in the
> > > > > > > > > > > > parent tree of MT8195's MFG clock, we have one mtk_mux clock and
> > > > > > > > > > > > one (clk framework generic) mux clock, like so:
> > > > > > > > > > > >
> > > > > > > > > > > > names: mfg_bg3d -> mfg_ck_fast_ref -> top_mfg_core_tmp (or) mfgpll
> > > > > > > > > > > > types: mtk_gate -> mux -> mtk_mux (or) mtk_pll
> > > > > > > > > > > >
> > > > > > > > > > > > To solve this issue and also keep the GPU DVFS clocks code working
> > > > > > > > > > > > as expected, wire up a .determine_rate() callback for the mtk_mux
> > > > > > > > > > > > ops; for that, the standard clk_mux_determine_rate_flags() was used
> > > > > > > > > > > > as it was possible to.
> > > > > > > > > > >
> > > > > > > > > > > It probably fixes things indeed, but I'm a bit worried that it just
> > > > > > > > > > > works around the actual issue instead of fixing the actual bug...
> > > > > > > > > > >
> > > > > > > > > > > > This commit was successfully tested on MT6795 Xperia M5, MT8173 Elm,
> > > > > > > > > > > > MT8192 Spherion and MT8195 Tomato; no regressions were seen.
> > > > > > > > > > > >
> > > > > > > > > > > > For the sake of some more documentation about this issue here's the
> > > > > > > > > > > > trace of it:
> > > > > > > > > > > >
> > > > > > > > > > > > [ 12.211587] ------------[ cut here ]------------
> > > > > > > > > > > > [ 12.211589] WARNING: CPU: 6 PID: 78 at drivers/clk/clk.c:1462 clk_core_init_rate_req+0x84/0x90
> > > > > > > > > > > > [ 12.211593] Modules linked in: stp crct10dif_ce mtk_adsp_common llc rfkill snd_sof_xtensa_dsp
> > > > > > > > > > > > panfrost(+) sbs_battery cros_ec_lid_angle cros_ec_sensors snd_sof_of
> > > > > > > > > > > > cros_ec_sensors_core hid_multitouch cros_usbpd_logger snd_sof gpu_sched
> > > > > > > > > > > > snd_sof_utils fuse ipv6
> > > > > > > > > > > > [ 12.211614] CPU: 6 PID: 78 Comm: kworker/u16:2 Tainted: G W 6.0.0-next-20221011+ #58
> > > > > > > > > > > > [ 12.211616] Hardware name: Acer Tomato (rev2) board (DT)
> > > > > > > > > > > > [ 12.211617] Workqueue: devfreq_wq devfreq_monitor
> > > > > > > > > > > > [ 12.211620] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > > > > > > > [ 12.211622] pc : clk_core_init_rate_req+0x84/0x90
> > > > > > > > > > > > [ 12.211625] lr : clk_core_forward_rate_req+0xa4/0xe4
> > > > > > > > > > > > [ 12.211627] sp : ffff80000893b8e0
> > > > > > > > > > > > [ 12.211628] x29: ffff80000893b8e0 x28: ffffdddf92f9b000 x27: ffff46a2c0e8bc05
> > > > > > > > > > > > [ 12.211632] x26: ffff46a2c1041200 x25: 0000000000000000 x24: 00000000173eed80
> > > > > > > > > > > > [ 12.211636] x23: ffff80000893b9c0 x22: ffff80000893b940 x21: 0000000000000000
> > > > > > > > > > > > [ 12.211641] x20: ffff46a2c1039f00 x19: ffff46a2c1039f00 x18: 0000000000000000
> > > > > > > > > > > > [ 12.211645] x17: 0000000000000038 x16: 000000000000d904 x15: 0000000000000003
> > > > > > > > > > > > [ 12.211649] x14: ffffdddf9357ce48 x13: ffffdddf935e71c8 x12: 000000000004803c
> > > > > > > > > > > > [ 12.211653] x11: 00000000a867d7ad x10: 00000000a867d7ad x9 : ffffdddf90c28df4
> > > > > > > > > > > > [ 12.211657] x8 : ffffdddf9357a980 x7 : 0000000000000000 x6 : 0000000000000004
> > > > > > > > > > > > [ 12.211661] x5 : ffffffffffffffc8 x4 : 00000000173eed80 x3 : ffff80000893b940
> > > > > > > > > > > > [ 12.211665] x2 : 00000000173eed80 x1 : ffff80000893b940 x0 : 0000000000000000
> > > > > > > > > > > > [ 12.211669] Call trace:
> > > > > > > > > > > > [ 12.211670] clk_core_init_rate_req+0x84/0x90
> > > > > > > > > > > > [ 12.211673] clk_core_round_rate_nolock+0xe8/0x10c
> > > > > > > > > > > > [ 12.211675] clk_mux_determine_rate_flags+0x174/0x1f0
> > > > > > > > > > > > [ 12.211677] clk_mux_determine_rate+0x1c/0x30
> > > > > > > > > > > > [ 12.211680] clk_core_determine_round_nolock+0x74/0x130
> > > > > > > > > > > > [ 12.211682] clk_core_round_rate_nolock+0x58/0x10c
> > > > > > > > > > > > [ 12.211684] clk_core_round_rate_nolock+0xf4/0x10c
> > > > > > > > > > > > [ 12.211686] clk_core_set_rate_nolock+0x194/0x2ac
> > > > > > > > > > > > [ 12.211688] clk_set_rate+0x40/0x94
> > > > > > > > > > > > [ 12.211691] _opp_config_clk_single+0x38/0xa0
> > > > > > > > > > > > [ 12.211693] _set_opp+0x1b0/0x500
> > > > > > > > > > > > [ 12.211695] dev_pm_opp_set_rate+0x120/0x290
> > > > > > > > > > > > [ 12.211697] panfrost_devfreq_target+0x3c/0x50 [panfrost]
> > > > > > > > > > > > [ 12.211705] devfreq_set_target+0x8c/0x2d0
> > > > > > > > > > > > [ 12.211707] devfreq_update_target+0xcc/0xf4
> > > > > > > > > > > > [ 12.211708] devfreq_monitor+0x40/0x1d0
> > > > > > > > > > > > [ 12.211710] process_one_work+0x294/0x664
> > > > > > > > > > > > [ 12.211712] worker_thread+0x7c/0x45c
> > > > > > > > > > > > [ 12.211713] kthread+0x104/0x110
> > > > > > > > > > > > [ 12.211716] ret_from_fork+0x10/0x20
> > > > > > > > > > > > [ 12.211718] irq event stamp: 7102
> > > > > > > > > > > > [ 12.211719] hardirqs last enabled at (7101): [<ffffdddf904ea5a0>] finish_task_switch.isra.0+0xec/0x2f0
> > > > > > > > > > > > [ 12.211723] hardirqs last disabled at (7102): [<ffffdddf91794b74>] el1_dbg+0x24/0x90
> > > > > > > > > > > > [ 12.211726] softirqs last enabled at (6716): [<ffffdddf90410be4>] __do_softirq+0x414/0x588
> > > > > > > > > > > > [ 12.211728] softirqs last disabled at (6507): [<ffffdddf904171d8>] ____do_softirq+0x18/0x24
> > > > > > > > > > > > [ 12.211730] ---[ end trace 0000000000000000 ]---
> > > > > > > > > > >
> > > > > > > > > > > ... Indeed, you shouldn't hit that warning at all. It happens in
> > > > > > > > > > > clk_core_round_rate_nolock, which takes (before your patch) the
> > > > > > > > > > > CLK_SET_RATE_PARENT branch. This indeed has been changed by the patch
> > > > > > > > > > > you mentioned, and will call clk_core_forward_rate_req() now, that in
> > > > > > > > > > > turn calls clk_core_init_rate_nolock().
> > > > > > > > > > >
> > > > > > > > > > > I think the warning you hit is because core->parent is NULL, which is
> > > > > > > > > > > passed to clk_core_forward_rate_req() as the parent argument, and we'll
> > > > > > > > > > > call clk_core_init_rate_req() with parent set as the core argument.
> > > > > > > > > > >
> > > > > > > > > > > In clk_core_init_rate_req(), the first thing we do is a WARN_ON(!core),
> > > > > > > > > > > which is what you hit here I think.
> > > > > > > > > > >
> > > > > > > > > > > This is different to the previous behavior that was calling
> > > > > > > > > > > clk_core_round_rate_nolock() with core->parent directly, and
> > > > > > > > > > > clk_core_round_rate_nolock() if its core argument is NULL will set
> > > > > > > > > > > req->rate to 0 and bail out without returning an error.
> > > > > > > > > > >
> > > > > > > > > > > Now, your patch probably works because now that you provide a
> > > > > > > > > > > determine_rate implementation, clk_core_can_round() returns true and
> > > > > > > > > > > you'll take a different branch in clk_core_round_rate_nolock(), avoiding
> > > > > > > > > > > that issue entirely.
> > > > > > > > > > >
> > > > > > > > > > > Does that patch work better (on top of next-20221012)?
> > > > > > > > > >
> > > > > > > > > > I admit I didn't go too deep in the research, as my brain processed that as
> > > > > > > > > > "this is a mux clock, not really different from a standard mux, this callback
> > > > > > > > > > is missing, that's not optimal"... then that fixed it and called it a day.
> > > > > > > > > >
> > > > > > > > > > I should've prolonged my research for a better understanding of what was
> > > > > > > > > > actually going on.
> > > > > > > > >
> > > > > > > > > No worries :)
> > > > > > > > >
> > > > > > > > > > What you said actually opened my mind and, with little surprise, your patch
> > > > > > > > > > works as good as mine - no warnings and the clock scales as expected!
> > > > > > > > >
> > > > > > > > > I'm actually wondering if you didn't encounter two issues. What kernel
> > > > > > > > > were you testing before? If it's older than today's next
> > > > > > > > > (next-20221012), you're likely missing
> > > > > > > > >
> > > > > > > >
> > > > > > > > I was testing next-20221011.
> > > > > > > >
> > > > > > > > > https://lore.kernel.org/linux-clk/[email protected]/
> > > > > > > > >
> > > > > > > > > Which is likely to be what fixed the clock scaling. And my patch only
> > > > > > > > > fixed the warning. Could you test next-20221012? If I'm right, you
> > > > > > > > > should only get the warning.
> > > > > > > > >
> > > > > > > >
> > > > > > > > No, I am getting the same situation even after rebasing over next-20221012, without
> > > > > > > > any of the two patches (determine_rate() for mtk mux, nor the one you shared for
> > > > > > > > clk.c), when the warning happens, I get very slow GPU operation and the same "nice"
> > > > > > > > timeout:
> > > > > > > >
> > > > > > > > [ 27.785514] panfrost 13000000.gpu: gpu sched timeout, js=1,
> > > > > > > > config=0x7b00, status=0x0, head=0xa1cb180, tail=0xa1cb180,
> > > > > > > > sched_job=00000000f07d39e3
> > > > > > > >
> > > > > > > > ...so I'm not encountering the same issue that you've fixed with the patches that
> > > > > > > > got merged in next-20221012.
> > > > > > > >
> > > > > > > > Of course, as you were expecting, the warning is also still there and still
> > > > > > > > the same:
> > > > > > > >
> > > > > > > > [ 27.750504] WARNING: CPU: 4 PID: 164 at drivers/clk/clk.c:1462
> > > > > > > > clk_core_init_rate_req+0x84/0x90
> > > > > > >
> > > > > > > Ok. I'm still a bit unsure why it actually fixes anything.
> > > > > > >
> > > > > > > In the current next, clk_core_init_rate_req (through
> > > > > > > clk_core_forward_rate_req) will bail out right away, but the patch will
> > > > > > > clear the request and set the requested rate.
> > > > > > >
> > > > > > > I don't think the req->rate assignment changes anything because our next
> > > > > > > call will be to clk_core_round_rate_nolock that will clear it in the
> > > > > > > !core path, but it might just be that we don't initialize the request
> > > > > > > and end up with some garbage data in it?
> > > > > > >
> > > > > > > Could you test, on top of next-20221012,
> > > > > >
> > > > > > No that's not the case, this patch has no effect at all (yes I've tested it).
> > > > >
> > > > > Ok. Too bad. I've tried to build a test case that was reproducing what
> > > > > you experience, but I don't seem to be able to build one for now.
> > > > >
> > > > > If we go back to your previous explanation on the clock tree, and what
> > > > > we discussed then, there's still something a bit puzzling.
> > > > >
> > > > > You were saying that are calling clk_set_rate on mfg_bg3d, which is a
> > > > > mtk_gate, registered with mtk_clk_register_gate(), and with the
> > > > > mtk_clk_gate_ops_setclr clk_ops. There's no determine_rate / round_rate
> > > > > implementation which makes sense for a gate, and CLK_SET_RATE_PARENT is
> > > > > set by mtk_clk_register_gate(). Everything's good.
> > > > >
> > > > > The clk_set_rate call will thus be forwarded to its parent,
> > > > > mfg_ck_fast_ref, which is a mux, registered with
> > > > > devm_clk_hw_register_mux(), with CLK_SET_RATE_PARENT as well. Makes
> > > > > sense too, any clk_set_rate() call will thus evaluate all of the
> > > > > mfg_ck_fast_ref parents and will either adjust the rate of a parent, or
> > > > > reparent.
> > > > >
> > > > > mfg_ck_fast_ref then has two parents, top_mfg_core_tmp and mfgpll.
> > > > > Judging from your patch, top_mfg_core_tmp is being used.
> > > > > top_mfg_core_tmp is a mtk_mux, registered by mtk_clk_register_mux()
> > > > > (through mtk_clk_register_muxes()), and using
> > > > > mtk_mux_gate_clr_set_upd_ops. CLK_SET_RATE_PARENT is set by
> > > > > mtk_clk_register_mux(). That flag still makes sense for a mux. However,
> > > > > it's really not clear to me how a mux operates without a determine_rate
> > > > > implementation to forward the rates to each of its parents.
> > > > >
> > > > > It looks like we were relying before on the fact that the request was
> > > > > indeed forwarded as is to the current parent. It looks like the parent
> > > > > was not registered at all (core->parent being NULL), and then it was
> > > > > working because of $REASON, possibly one of the muxes preferred to
> > > > > switch to some other clock source.
> > > > >
> > > > > I think we need to address this in multiple ways:
> > > > >
> > > > > - If you have ftrace enabled on that platform, running with "tp_printk
> > > > > trace_event=clk:*" in the kernel command line on a failing kernel would
> > > > > be great, so that we can figure out what is happening exactly.
> > > >
> > > > I spent more time trying to create a setup similar to yours but couldn't
> > > > find anything obviously wrong. In addition to the above, could you start
> > > > a faulty kernel with that patch:
> > > >
> > > > diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
> > > > index 884a58a98b6b..774953901a35 100644
> > > > --- a/drivers/clk/clk.c
> > > > +++ b/drivers/clk/clk.c
> > > > @@ -262,6 +262,9 @@ EXPORT_SYMBOL_GPL(__clk_get_name);
> > > >
> > > > const char *clk_hw_get_name(const struct clk_hw *hw)
> > > > {
> > > > + if (!hw || !hw->core)
> > > > + return "(null)";
> > > > +
> > > > return hw->core->name;
> > > > }
> > > > EXPORT_SYMBOL_GPL(clk_hw_get_name);
> > > > @@ -1471,6 +1474,17 @@ static bool clk_core_can_round(struct clk_core * const core)
> > > > return core->ops->determine_rate || core->ops->round_rate;
> > > > }
> > > >
> > > > +static void clk_core_request_dump(const struct clk_core *core,
> > > > + const struct clk_rate_request *req)
> > > > +{
> > > > + pr_crit("%s\n", core ? core->name : "(null)");
> > > > + pr_crit("\trate %lu\n", req->rate);
> > > > + pr_crit("\tmin %lu, max %lu\n", req->min_rate, req->max_rate);
> > > > + pr_crit("\tbest parent %s, rate %lu\n",
> > > > + clk_hw_get_name(req->best_parent_hw),
> > > > + req->best_parent_rate);
> > > > +}
> > > > +
> > > > static int clk_core_round_rate_nolock(struct clk_core *core,
> > > > struct clk_rate_request *req)
> > > > {
> > > > @@ -2254,8 +2268,12 @@ static unsigned long clk_core_req_round_rate_nolock(struct clk_core *core,
> > > >
> > > > clk_core_init_rate_req(core, &req, req_rate);
> > > >
> > > > + clk_core_request_dump(core, &req);
> > > > +
> > > > ret = clk_core_round_rate_nolock(core, &req);
> > > >
> > > > + clk_core_request_dump(core, &req);
> > > > +
> > > > /* restore the protection */
> > > > clk_core_rate_restore_protect(core, cnt);
> > > >
> > > >
> > > > Alternatively, can I easily get a MT8195 device to test more easily?
> > >
> > > The Acer Spin Chromebook 513, specifically CP513-2H, should be available
> > > on Amazon in both the US and UK. No idea about France though. And it's
> > > quite pricey.
> >
> > Thanks, but I was more expecting some kind of SBC, and it's a bit too
> > expensive for me indeed :/
> >
>
> Yeah that's the one - and it's exactly the device that I'm using here.
>
> Btw, I will avoid pasting the same trace over and over again and skip to
> the output that you added with the patch.
>
> [ 11.447922] panfrost 13000000.gpu: clock rate = 699999695
> [ 11.453402] panfrost 13000000.gpu: Looking up mali-supply from device tree
> [ 11.461842] mfg_bg3d
> [ 11.471067] rate 700000000
> [ 11.473857] min 0, max 18446744073709551615
> [ 11.478123] best parent mfg_ck_fast_ref, rate 699999695
>
> ...
>
> [ 11.826631] mfg_bg3d
> [ 11.829249] rate 699999695
> [ 11.832472] min 0, max 18446744073709551615
> [ 11.837159] best parent mfg_ck_fast_ref, rate 699999695
>
> ...
>
> [ 12.190365] mfg_bg3d
> [ 12.198676] rate 390000000
> [ 12.206494] min 0, max 18446744073709551615
> [ 12.236155] best parent mfg_ck_fast_ref, rate 699999695
>
> ...
>
> [ 12.592687] mfg_bg3d
> [ 12.592693] rate 390000000
> [ 12.592698] min 0, max 18446744073709551615
> [ 12.592702] best parent mfg_ck_fast_ref, rate 390000000

Yeah, so it all seems to make sense. The parent is still the same, and
the rate is equal or close enough.

It would be interesting to get the traces after the second
clk_core_request_dump() call to see what the CCF is doing, because
there's probably some difference there. Could you enable the traces like
I pointed above, and paste them with and without the patch?

Maxime

Attachments:

(No filename) (18.67 kB)
signature.asc (235.00 B)
Download all attachments

2022-10-13 09:04:22

by AngeloGioacchino Del Regno

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

Il 13/10/22 09:19, Maxime Ripard ha scritto:
> On Thu, Oct 13, 2022 at 12:52:41AM +0800, Chen-Yu Tsai wrote:
>> On Thu, Oct 13, 2022 at 12:42 AM Maxime Ripard <[email protected]> wrote:
>>>
>>> On Wed, Oct 12, 2022 at 03:56:19PM +0200, Maxime Ripard wrote:
>>>> On Wed, Oct 12, 2022 at 02:14:39PM +0200, AngeloGioacchino Del Regno wrote:
>>>>> Il 12/10/22 13:48, Maxime Ripard ha scritto:
>>>>>> On Wed, Oct 12, 2022 at 11:57:15AM +0200, AngeloGioacchino Del Regno wrote:
>>>>>>> Il 12/10/22 11:40, Maxime Ripard ha scritto:
>>>>>>>> On Wed, Oct 12, 2022 at 11:09:59AM +0200, AngeloGioacchino Del Regno wrote:
>>>>>>>>> Il 12/10/22 10:55, Maxime Ripard ha scritto:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 11, 2022 at 03:55:48PM +0200, AngeloGioacchino Del Regno wrote:
>>>>>>>>>>> Since commit 262ca38f4b6e ("clk: Stop forwarding clk_rate_requests
>>>>>>>>>>> to the parent"), the clk_rate_request is .. as the title says, not
>>>>>>>>>>> forwarded anymore to the parent:
>>>>>>>>>>
>>>>>>>>>> It's not entirely true, the rate request should still be forwarded, but
>>>>>>>>>> we don't pass the same instance of clk_rate_request anymore.
>>>>>>>>>>
>>>>>>>>>>> this produces an issue with the MediaTek clock MUX driver during GPU
>>>>>>>>>>> DVFS on MT8195, but not on MT8192 or others.
>>>>>>>>>>>
>>>>>>>>>>> This is because, differently from others, like MT8192 where all of
>>>>>>>>>>> the clocks in the MFG parents tree are of mtk_mux type, but in the
>>>>>>>>>>> parent tree of MT8195's MFG clock, we have one mtk_mux clock and
>>>>>>>>>>> one (clk framework generic) mux clock, like so:
>>>>>>>>>>>
>>>>>>>>>>> names: mfg_bg3d -> mfg_ck_fast_ref -> top_mfg_core_tmp (or) mfgpll
>>>>>>>>>>> types: mtk_gate -> mux -> mtk_mux (or) mtk_pll
>>>>>>>>>>>
>>>>>>>>>>> To solve this issue and also keep the GPU DVFS clocks code working
>>>>>>>>>>> as expected, wire up a .determine_rate() callback for the mtk_mux
>>>>>>>>>>> ops; for that, the standard clk_mux_determine_rate_flags() was used
>>>>>>>>>>> as it was possible to.
>>>>>>>>>>
>>>>>>>>>> It probably fixes things indeed, but I'm a bit worried that it just
>>>>>>>>>> works around the actual issue instead of fixing the actual bug...
>>>>>>>>>>
>>>>>>>>>>> This commit was successfully tested on MT6795 Xperia M5, MT8173 Elm,
>>>>>>>>>>> MT8192 Spherion and MT8195 Tomato; no regressions were seen.
>>>>>>>>>>>
>>>>>>>>>>> For the sake of some more documentation about this issue here's the
>>>>>>>>>>> trace of it:
>>>>>>>>>>>
>>>>>>>>>>> [ 12.211587] ------------[ cut here ]------------
>>>>>>>>>>> [ 12.211589] WARNING: CPU: 6 PID: 78 at drivers/clk/clk.c:1462 clk_core_init_rate_req+0x84/0x90
>>>>>>>>>>> [ 12.211593] Modules linked in: stp crct10dif_ce mtk_adsp_common llc rfkill snd_sof_xtensa_dsp
>>>>>>>>>>> panfrost(+) sbs_battery cros_ec_lid_angle cros_ec_sensors snd_sof_of
>>>>>>>>>>> cros_ec_sensors_core hid_multitouch cros_usbpd_logger snd_sof gpu_sched
>>>>>>>>>>> snd_sof_utils fuse ipv6
>>>>>>>>>>> [ 12.211614] CPU: 6 PID: 78 Comm: kworker/u16:2 Tainted: G W 6.0.0-next-20221011+ #58
>>>>>>>>>>> [ 12.211616] Hardware name: Acer Tomato (rev2) board (DT)
>>>>>>>>>>> [ 12.211617] Workqueue: devfreq_wq devfreq_monitor
>>>>>>>>>>> [ 12.211620] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>>>>>>> [ 12.211622] pc : clk_core_init_rate_req+0x84/0x90
>>>>>>>>>>> [ 12.211625] lr : clk_core_forward_rate_req+0xa4/0xe4
>>>>>>>>>>> [ 12.211627] sp : ffff80000893b8e0
>>>>>>>>>>> [ 12.211628] x29: ffff80000893b8e0 x28: ffffdddf92f9b000 x27: ffff46a2c0e8bc05
>>>>>>>>>>> [ 12.211632] x26: ffff46a2c1041200 x25: 0000000000000000 x24: 00000000173eed80
>>>>>>>>>>> [ 12.211636] x23: ffff80000893b9c0 x22: ffff80000893b940 x21: 0000000000000000
>>>>>>>>>>> [ 12.211641] x20: ffff46a2c1039f00 x19: ffff46a2c1039f00 x18: 0000000000000000
>>>>>>>>>>> [ 12.211645] x17: 0000000000000038 x16: 000000000000d904 x15: 0000000000000003
>>>>>>>>>>> [ 12.211649] x14: ffffdddf9357ce48 x13: ffffdddf935e71c8 x12: 000000000004803c
>>>>>>>>>>> [ 12.211653] x11: 00000000a867d7ad x10: 00000000a867d7ad x9 : ffffdddf90c28df4
>>>>>>>>>>> [ 12.211657] x8 : ffffdddf9357a980 x7 : 0000000000000000 x6 : 0000000000000004
>>>>>>>>>>> [ 12.211661] x5 : ffffffffffffffc8 x4 : 00000000173eed80 x3 : ffff80000893b940
>>>>>>>>>>> [ 12.211665] x2 : 00000000173eed80 x1 : ffff80000893b940 x0 : 0000000000000000
>>>>>>>>>>> [ 12.211669] Call trace:
>>>>>>>>>>> [ 12.211670] clk_core_init_rate_req+0x84/0x90
>>>>>>>>>>> [ 12.211673] clk_core_round_rate_nolock+0xe8/0x10c
>>>>>>>>>>> [ 12.211675] clk_mux_determine_rate_flags+0x174/0x1f0
>>>>>>>>>>> [ 12.211677] clk_mux_determine_rate+0x1c/0x30
>>>>>>>>>>> [ 12.211680] clk_core_determine_round_nolock+0x74/0x130
>>>>>>>>>>> [ 12.211682] clk_core_round_rate_nolock+0x58/0x10c
>>>>>>>>>>> [ 12.211684] clk_core_round_rate_nolock+0xf4/0x10c
>>>>>>>>>>> [ 12.211686] clk_core_set_rate_nolock+0x194/0x2ac
>>>>>>>>>>> [ 12.211688] clk_set_rate+0x40/0x94
>>>>>>>>>>> [ 12.211691] _opp_config_clk_single+0x38/0xa0
>>>>>>>>>>> [ 12.211693] _set_opp+0x1b0/0x500
>>>>>>>>>>> [ 12.211695] dev_pm_opp_set_rate+0x120/0x290
>>>>>>>>>>> [ 12.211697] panfrost_devfreq_target+0x3c/0x50 [panfrost]
>>>>>>>>>>> [ 12.211705] devfreq_set_target+0x8c/0x2d0
>>>>>>>>>>> [ 12.211707] devfreq_update_target+0xcc/0xf4
>>>>>>>>>>> [ 12.211708] devfreq_monitor+0x40/0x1d0
>>>>>>>>>>> [ 12.211710] process_one_work+0x294/0x664
>>>>>>>>>>> [ 12.211712] worker_thread+0x7c/0x45c
>>>>>>>>>>> [ 12.211713] kthread+0x104/0x110
>>>>>>>>>>> [ 12.211716] ret_from_fork+0x10/0x20
>>>>>>>>>>> [ 12.211718] irq event stamp: 7102
>>>>>>>>>>> [ 12.211719] hardirqs last enabled at (7101): [<ffffdddf904ea5a0>] finish_task_switch.isra.0+0xec/0x2f0
>>>>>>>>>>> [ 12.211723] hardirqs last disabled at (7102): [<ffffdddf91794b74>] el1_dbg+0x24/0x90
>>>>>>>>>>> [ 12.211726] softirqs last enabled at (6716): [<ffffdddf90410be4>] __do_softirq+0x414/0x588
>>>>>>>>>>> [ 12.211728] softirqs last disabled at (6507): [<ffffdddf904171d8>] ____do_softirq+0x18/0x24
>>>>>>>>>>> [ 12.211730] ---[ end trace 0000000000000000 ]---
>>>>>>>>>>
>>>>>>>>>> ... Indeed, you shouldn't hit that warning at all. It happens in
>>>>>>>>>> clk_core_round_rate_nolock, which takes (before your patch) the
>>>>>>>>>> CLK_SET_RATE_PARENT branch. This indeed has been changed by the patch
>>>>>>>>>> you mentioned, and will call clk_core_forward_rate_req() now, that in
>>>>>>>>>> turn calls clk_core_init_rate_nolock().
>>>>>>>>>>
>>>>>>>>>> I think the warning you hit is because core->parent is NULL, which is
>>>>>>>>>> passed to clk_core_forward_rate_req() as the parent argument, and we'll
>>>>>>>>>> call clk_core_init_rate_req() with parent set as the core argument.
>>>>>>>>>>
>>>>>>>>>> In clk_core_init_rate_req(), the first thing we do is a WARN_ON(!core),
>>>>>>>>>> which is what you hit here I think.
>>>>>>>>>>
>>>>>>>>>> This is different to the previous behavior that was calling
>>>>>>>>>> clk_core_round_rate_nolock() with core->parent directly, and
>>>>>>>>>> clk_core_round_rate_nolock() if its core argument is NULL will set
>>>>>>>>>> req->rate to 0 and bail out without returning an error.
>>>>>>>>>>
>>>>>>>>>> Now, your patch probably works because now that you provide a
>>>>>>>>>> determine_rate implementation, clk_core_can_round() returns true and
>>>>>>>>>> you'll take a different branch in clk_core_round_rate_nolock(), avoiding
>>>>>>>>>> that issue entirely.
>>>>>>>>>>
>>>>>>>>>> Does that patch work better (on top of next-20221012)?
>>>>>>>>>
>>>>>>>>> I admit I didn't go too deep in the research, as my brain processed that as
>>>>>>>>> "this is a mux clock, not really different from a standard mux, this callback
>>>>>>>>> is missing, that's not optimal"... then that fixed it and called it a day.
>>>>>>>>>
>>>>>>>>> I should've prolonged my research for a better understanding of what was
>>>>>>>>> actually going on.
>>>>>>>>
>>>>>>>> No worries :)
>>>>>>>>
>>>>>>>>> What you said actually opened my mind and, with little surprise, your patch
>>>>>>>>> works as good as mine - no warnings and the clock scales as expected!
>>>>>>>>
>>>>>>>> I'm actually wondering if you didn't encounter two issues. What kernel
>>>>>>>> were you testing before? If it's older than today's next
>>>>>>>> (next-20221012), you're likely missing
>>>>>>>>
>>>>>>>
>>>>>>> I was testing next-20221011.
>>>>>>>
>>>>>>>> https://lore.kernel.org/linux-clk/[email protected]/
>>>>>>>>
>>>>>>>> Which is likely to be what fixed the clock scaling. And my patch only
>>>>>>>> fixed the warning. Could you test next-20221012? If I'm right, you
>>>>>>>> should only get the warning.
>>>>>>>>
>>>>>>>
>>>>>>> No, I am getting the same situation even after rebasing over next-20221012, without
>>>>>>> any of the two patches (determine_rate() for mtk mux, nor the one you shared for
>>>>>>> clk.c), when the warning happens, I get very slow GPU operation and the same "nice"
>>>>>>> timeout:
>>>>>>>
>>>>>>> [ 27.785514] panfrost 13000000.gpu: gpu sched timeout, js=1,
>>>>>>> config=0x7b00, status=0x0, head=0xa1cb180, tail=0xa1cb180,
>>>>>>> sched_job=00000000f07d39e3
>>>>>>>
>>>>>>> ...so I'm not encountering the same issue that you've fixed with the patches that
>>>>>>> got merged in next-20221012.
>>>>>>>
>>>>>>> Of course, as you were expecting, the warning is also still there and still
>>>>>>> the same:
>>>>>>>
>>>>>>> [ 27.750504] WARNING: CPU: 4 PID: 164 at drivers/clk/clk.c:1462
>>>>>>> clk_core_init_rate_req+0x84/0x90
>>>>>>
>>>>>> Ok. I'm still a bit unsure why it actually fixes anything.
>>>>>>
>>>>>> In the current next, clk_core_init_rate_req (through
>>>>>> clk_core_forward_rate_req) will bail out right away, but the patch will
>>>>>> clear the request and set the requested rate.
>>>>>>
>>>>>> I don't think the req->rate assignment changes anything because our next
>>>>>> call will be to clk_core_round_rate_nolock that will clear it in the
>>>>>> !core path, but it might just be that we don't initialize the request
>>>>>> and end up with some garbage data in it?
>>>>>>
>>>>>> Could you test, on top of next-20221012,
>>>>>
>>>>> No that's not the case, this patch has no effect at all (yes I've tested it).
>>>>
>>>> Ok. Too bad. I've tried to build a test case that was reproducing what
>>>> you experience, but I don't seem to be able to build one for now.
>>>>
>>>> If we go back to your previous explanation on the clock tree, and what
>>>> we discussed then, there's still something a bit puzzling.
>>>>
>>>> You were saying that are calling clk_set_rate on mfg_bg3d, which is a
>>>> mtk_gate, registered with mtk_clk_register_gate(), and with the
>>>> mtk_clk_gate_ops_setclr clk_ops. There's no determine_rate / round_rate
>>>> implementation which makes sense for a gate, and CLK_SET_RATE_PARENT is
>>>> set by mtk_clk_register_gate(). Everything's good.
>>>>
>>>> The clk_set_rate call will thus be forwarded to its parent,
>>>> mfg_ck_fast_ref, which is a mux, registered with
>>>> devm_clk_hw_register_mux(), with CLK_SET_RATE_PARENT as well. Makes
>>>> sense too, any clk_set_rate() call will thus evaluate all of the
>>>> mfg_ck_fast_ref parents and will either adjust the rate of a parent, or
>>>> reparent.
>>>>
>>>> mfg_ck_fast_ref then has two parents, top_mfg_core_tmp and mfgpll.
>>>> Judging from your patch, top_mfg_core_tmp is being used.
>>>> top_mfg_core_tmp is a mtk_mux, registered by mtk_clk_register_mux()
>>>> (through mtk_clk_register_muxes()), and using
>>>> mtk_mux_gate_clr_set_upd_ops. CLK_SET_RATE_PARENT is set by
>>>> mtk_clk_register_mux(). That flag still makes sense for a mux. However,
>>>> it's really not clear to me how a mux operates without a determine_rate
>>>> implementation to forward the rates to each of its parents.
>>>>
>>>> It looks like we were relying before on the fact that the request was
>>>> indeed forwarded as is to the current parent. It looks like the parent
>>>> was not registered at all (core->parent being NULL), and then it was
>>>> working because of $REASON, possibly one of the muxes preferred to
>>>> switch to some other clock source.
>>>>
>>>> I think we need to address this in multiple ways:
>>>>
>>>> - If you have ftrace enabled on that platform, running with "tp_printk
>>>> trace_event=clk:*" in the kernel command line on a failing kernel would
>>>> be great, so that we can figure out what is happening exactly.
>>>
>>> I spent more time trying to create a setup similar to yours but couldn't
>>> find anything obviously wrong. In addition to the above, could you start
>>> a faulty kernel with that patch:
>>>
>>> diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
>>> index 884a58a98b6b..774953901a35 100644
>>> --- a/drivers/clk/clk.c
>>> +++ b/drivers/clk/clk.c
>>> @@ -262,6 +262,9 @@ EXPORT_SYMBOL_GPL(__clk_get_name);
>>>
>>> const char *clk_hw_get_name(const struct clk_hw *hw)
>>> {
>>> + if (!hw || !hw->core)
>>> + return "(null)";
>>> +
>>> return hw->core->name;
>>> }
>>> EXPORT_SYMBOL_GPL(clk_hw_get_name);
>>> @@ -1471,6 +1474,17 @@ static bool clk_core_can_round(struct clk_core * const core)
>>> return core->ops->determine_rate || core->ops->round_rate;
>>> }
>>>
>>> +static void clk_core_request_dump(const struct clk_core *core,
>>> + const struct clk_rate_request *req)
>>> +{
>>> + pr_crit("%s\n", core ? core->name : "(null)");
>>> + pr_crit("\trate %lu\n", req->rate);
>>> + pr_crit("\tmin %lu, max %lu\n", req->min_rate, req->max_rate);
>>> + pr_crit("\tbest parent %s, rate %lu\n",
>>> + clk_hw_get_name(req->best_parent_hw),
>>> + req->best_parent_rate);
>>> +}
>>> +
>>> static int clk_core_round_rate_nolock(struct clk_core *core,
>>> struct clk_rate_request *req)
>>> {
>>> @@ -2254,8 +2268,12 @@ static unsigned long clk_core_req_round_rate_nolock(struct clk_core *core,
>>>
>>> clk_core_init_rate_req(core, &req, req_rate);
>>>
>>> + clk_core_request_dump(core, &req);
>>> +
>>> ret = clk_core_round_rate_nolock(core, &req);
>>>
>>> + clk_core_request_dump(core, &req);
>>> +
>>> /* restore the protection */
>>> clk_core_rate_restore_protect(core, cnt);
>>>
>>>
>>> Alternatively, can I easily get a MT8195 device to test more easily?
>>
>> The Acer Spin Chromebook 513, specifically CP513-2H, should be available
>> on Amazon in both the US and UK. No idea about France though. And it's
>> quite pricey.
>
> Thanks, but I was more expecting some kind of SBC, and it's a bit too
> expensive for me indeed :/
>

Yeah that's the one - and it's exactly the device that I'm using here.

Btw, I will avoid pasting the same trace over and over again and skip to
the output that you added with the patch.

[ 11.447922] panfrost 13000000.gpu: clock rate = 699999695
[ 11.453402] panfrost 13000000.gpu: Looking up mali-supply from device tree
[ 11.461842] mfg_bg3d
[ 11.471067] rate 700000000
[ 11.473857] min 0, max 18446744073709551615
[ 11.478123] best parent mfg_ck_fast_ref, rate 699999695

...

[ 11.826631] mfg_bg3d
[ 11.829249] rate 699999695
[ 11.832472] min 0, max 18446744073709551615
[ 11.837159] best parent mfg_ck_fast_ref, rate 699999695

...

[ 12.190365] mfg_bg3d
[ 12.198676] rate 390000000
[ 12.206494] min 0, max 18446744073709551615
[ 12.236155] best parent mfg_ck_fast_ref, rate 699999695

...

[ 12.592687] mfg_bg3d
[ 12.592693] rate 390000000
[ 12.592698] min 0, max 18446744073709551615
[ 12.592702] best parent mfg_ck_fast_ref, rate 390000000

Cheers

2022-10-14 20:01:21

by Stephen Boyd

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

Quoting Maxime Ripard (2022-10-12 06:56:19)
>
> I think we need to address this in multiple ways:
>
> - If you have ftrace enabled on that platform, running with "tp_printk
> trace_event=clk:*" in the kernel command line on a failing kernel would
> be great, so that we can figure out what is happening exactly.
>
> - We really need to merge your patch above as well;
>
> - And, if we can, we should forbid to register a mux without a
> determine_rate implementation. We have to take into account that some
> muxes are going to be RO and won't need a determine_rate
> implementation at all, but a clock with a set_parent and without a
> determine_rate seems like a weird combination.
>

So should I apply this patch now on clk-next? Given this regression I'm
leaning towards not sending off the clk rate request this merge window
and letting it bake another cycle.

2022-10-14 21:24:32

by Stephen Boyd

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

Quoting AngeloGioacchino Del Regno (2022-10-11 06:55:48)
> Since commit 262ca38f4b6e ("clk: Stop forwarding clk_rate_requests
> to the parent"), the clk_rate_request is .. as the title says, not
> forwarded anymore to the parent: this produces an issue with the
> MediaTek clock MUX driver during GPU DVFS on MT8195, but not on
> MT8192 or others.
>
> This is because, differently from others, like MT8192 where all of
> the clocks in the MFG parents tree are of mtk_mux type, but in the
> parent tree of MT8195's MFG clock, we have one mtk_mux clock and
> one (clk framework generic) mux clock, like so:
>
> names: mfg_bg3d -> mfg_ck_fast_ref -> top_mfg_core_tmp (or) mfgpll
> types: mtk_gate -> mux -> mtk_mux (or) mtk_pll
>
> To solve this issue and also keep the GPU DVFS clocks code working
> as expected, wire up a .determine_rate() callback for the mtk_mux
> ops; for that, the standard clk_mux_determine_rate_flags() was used
> as it was possible to.
>
> This commit was successfully tested on MT6795 Xperia M5, MT8173 Elm,
> MT8192 Spherion and MT8195 Tomato; no regressions were seen.
>
> For the sake of some more documentation about this issue here's the
> trace of it:
>

Applied to clk-next

2022-10-18 07:12:22

by Maxime Ripard

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

Hi Stephen,

On Fri, Oct 14, 2022 at 12:36:50PM -0700, Stephen Boyd wrote:
> Quoting Maxime Ripard (2022-10-12 06:56:19)
> >
> > I think we need to address this in multiple ways:
> >
> > - If you have ftrace enabled on that platform, running with "tp_printk
> > trace_event=clk:*" in the kernel command line on a failing kernel would
> > be great, so that we can figure out what is happening exactly.
> >
> > - We really need to merge your patch above as well;
> >
> > - And, if we can, we should forbid to register a mux without a
> > determine_rate implementation. We have to take into account that some
> > muxes are going to be RO and won't need a determine_rate
> > implementation at all, but a clock with a set_parent and without a
> > determine_rate seems like a weird combination.
> >
>
> So should I apply this patch now on clk-next? Given this regression I'm
> leaning towards not sending off the clk rate request this merge window
> and letting it bake another cycle.

I saw that you sent the PR, thanks

We spent some time with Angelo yesterday debugging this, and it's still
not clear to me what happens.

He provided a significant amount of traces, and we first checked that
the round_rate part of set_rate was returning something meaningful, and
it does. The rate is fine, the parent is too, everything's great.

Next we checked the decisions by clk_calc_new_rates, and it does return
the proper top clock, and its proper rate.

Finally, we hooked into clk_change_rate() to see what kind of decision
it was enforcing, and it seems to be ok as well. It doesn't change
parent, and it sets the proper rate, in both cases.

There's still one thing we haven't checked: one of the clock in the tree
(the parent of the one we want to change the rate on, and it has
SET_RATE_PARENT) has a notifier. As we've had a bug recently over this
I've not ruled out that this could be a similar bug.

I don't really think it is though, since the notifier callback doesn't
use the data provided by the framework:
https://elixir.bootlin.com/linux/v6.1-rc1/source/drivers/clk/mediatek/clk-mux.c#L279

I've pushed a branch for Angelo to test, just to confirm.

So... yeah. I can't explain the regression at this point. Do you have an
idea?

The good news is, since you merged this patch the regression is
invisible now to that platform. We still could encounter it on another
platform, but maybe it will also have a more obvious setup to replicate?

Thanks!
Maxime

2022-10-28 00:33:04

by Stephen Boyd

[permalink] [raw]

Subject: Re: [PATCH v2] clk: mediatek: clk-mux: Add .determine_rate() callback

Quoting Maxime Ripard (2022-10-18 00:06:53)
>
> We spent some time with Angelo yesterday debugging this, and it's still
> not clear to me what happens.
>
> He provided a significant amount of traces, and we first checked that
> the round_rate part of set_rate was returning something meaningful, and
> it does. The rate is fine, the parent is too, everything's great.
>
> Next we checked the decisions by clk_calc_new_rates, and it does return
> the proper top clock, and its proper rate.
>
> Finally, we hooked into clk_change_rate() to see what kind of decision
> it was enforcing, and it seems to be ok as well. It doesn't change
> parent, and it sets the proper rate, in both cases.
>
> There's still one thing we haven't checked: one of the clock in the tree
> (the parent of the one we want to change the rate on, and it has
> SET_RATE_PARENT) has a notifier. As we've had a bug recently over this
> I've not ruled out that this could be a similar bug.
>
> I don't really think it is though, since the notifier callback doesn't
> use the data provided by the framework:
> https://elixir.bootlin.com/linux/v6.1-rc1/source/drivers/clk/mediatek/clk-mux.c#L279
>
> I've pushed a branch for Angelo to test, just to confirm.
>
> So... yeah. I can't explain the regression at this point. Do you have an
> idea?

I don't really know. If the removal of WARN_ON() helps then it sounds
like a console related problem where we hang the system trying to print
the warning to the console. Did you try replacing that case with a
trace_printk()?

>
> The good news is, since you merged this patch the regression is
> invisible now to that platform. We still could encounter it on another
> platform, but maybe it will also have a more obvious setup to replicate?
>

I guess that's good news :-/