Currently, the only way to boot a Kernel with drivers built as modules on embedded
devices like HiKey 970 is to pass clk_ignore_unused=true as a modprobe parameter.
There are two separate issues:
1. the clk's core calls clk_disable_unused() too early. By the time this
function is called, only the builtin drivers were already probed/initialized.
Drivers built as modules will only be probed afterwards.
This cause a race condition and boot instability, as the clk core will try
to disable clocks while the drivers built as modules are still being
probed and initialized.
I suspect that the same problem used to happen at the regulator's core,
as there's a code that waits for 30 seconds before disabling unused
regulators;
2. there are some gate clocks defined at HiKey 970 that should always be on,
as otherwise the system will hang, or the filesystem I/O will stop.
Ps.:
I submitted already 3 or 4 versions of patches for HiKey 970 clock, but
they're all unreliable, due to the race conditions at the clk core due to (1).
Patch 1 solves the issue with the clk core.
Patch 2 solves the HiKey 970 specific issues.
Mauro Carvalho Chehab (2):
clk: wait for extra time before disabling unused clocks
clk: clk-hi3670: mark some clocks as CLK_IS_CRITICAL
drivers/clk/clk.c | 51 +++++++++++++++++++-----------
drivers/clk/hisilicon/clk-hi3670.c | 24 +++++++-------
2 files changed, 44 insertions(+), 31 deletions(-)
--
2.31.1
Hi Mauro,
On Thu, Oct 07, 2021 at 02:06:53PM +0200, Mauro Carvalho Chehab wrote:
> Currently, the only way to boot a Kernel with drivers built as modules on embedded
> devices like HiKey 970 is to pass clk_ignore_unused=true as a modprobe parameter.
>
> There are two separate issues:
>
> 1. the clk's core calls clk_disable_unused() too early. By the time this
> function is called, only the builtin drivers were already probed/initialized.
> Drivers built as modules will only be probed afterwards.
>
> This cause a race condition and boot instability, as the clk core will try
> to disable clocks while the drivers built as modules are still being
> probed and initialized.
So you are mentioning a "race" condition here but it is not mentioned in the
actual patch. If the issue you are seeing is because the clocks used by the
modules are disabled before they are probed, why can't they just enable the
clocks during the probe time?
Am I missing something?
Thanks,
Mani
>
> I suspect that the same problem used to happen at the regulator's core,
> as there's a code that waits for 30 seconds before disabling unused
> regulators;
>
> 2. there are some gate clocks defined at HiKey 970 that should always be on,
> as otherwise the system will hang, or the filesystem I/O will stop.
>
> Ps.:
> I submitted already 3 or 4 versions of patches for HiKey 970 clock, but
> they're all unreliable, due to the race conditions at the clk core due to (1).
>
> Patch 1 solves the issue with the clk core.
> Patch 2 solves the HiKey 970 specific issues.
>
> Mauro Carvalho Chehab (2):
> clk: wait for extra time before disabling unused clocks
> clk: clk-hi3670: mark some clocks as CLK_IS_CRITICAL
>
> drivers/clk/clk.c | 51 +++++++++++++++++++-----------
> drivers/clk/hisilicon/clk-hi3670.c | 24 +++++++-------
> 2 files changed, 44 insertions(+), 31 deletions(-)
>
> --
> 2.31.1
>
>
Em Mon, 11 Oct 2021 11:47:18 +0530
Manivannan Sadhasivam <[email protected]> escreveu:
> Hi Mauro,
>
> On Thu, Oct 07, 2021 at 02:06:53PM +0200, Mauro Carvalho Chehab wrote:
> > Currently, the only way to boot a Kernel with drivers built as modules on embedded
> > devices like HiKey 970 is to pass clk_ignore_unused=true as a modprobe parameter.
> >
> > There are two separate issues:
> >
> > 1. the clk's core calls clk_disable_unused() too early. By the time this
> > function is called, only the builtin drivers were already probed/initialized.
> > Drivers built as modules will only be probed afterwards.
> >
> > This cause a race condition and boot instability, as the clk core will try
> > to disable clocks while the drivers built as modules are still being
> > probed and initialized.
>
> So you are mentioning a "race" condition here but it is not mentioned in the
> actual patch.
Patch 1 explains it...
> If the issue you are seeing is because the clocks used by the
> modules are disabled before they are probed, why can't they just enable the
> clocks during the probe time?
>
> Am I missing something?
What happens is that such clocks are enabled when the system boots,
and, when those are disabled, very bad things happen, as those
interrupt clocks used by several parts of the system.
Most of the problems happen because the ARM SoC produce SError NMI
interrupts when some such clocks are disabled, which calls panic().
Other clocks disable some key components of the system that aren't
directly related with a driver, but, instead, controls some core
part of the device, making the SoC to wait forever for an I/O event
that will never happen.
A small set of clocks make the system unreliable, causing drivers
to fail probing. Those can either lead to panic() or break support
for a peripheral, like WiFi, USB and/or PCI.
The core issue is that clk_disable_unused() happens too early.
This is called at late_initcall_sync() time, which is triggered
before the probe/init code of the drivers compiled as modules
to be called. So, what happens is:
BIOS enables clocks that are needed for the device to boot
|
+-> Linux start booting
|
+-> builtin drivers are probed
|
+--------------------------------\
| |
+-> late_initcall_sync() calls +-> Modules start probing
| clk_disable_unused) |
| +-> Some drivers are probed
| | before their needed clks
| | got disabled
| |
+-> Clocks are disabled |
| |
+-> SError -> panic() |
\ (several drivers weren't
probed/initialized)
The only fix for that is to postpone clk_disable_unused() to happen
after all driver probe/init are called, or to completely disable
it.
The current distributions recommended at:
https://www.96boards.org/product/hikey970/
pass clk_ignore_unused as a boot parameter, which disables the call
to clk_disable_unused().
The only sane way to get rid of that is to fix the core to let the
drivers to finish probe/init before disabling clocks.
See, the regulators logic that disables unused power lines also
do the same: it waits for 30 seconds after late_initcall_sync()
before calling Runtime PM suspend logic.
Regards,
Mauro
Stephen/Michael,
Gentile ping.
Regards,
Mauro
Em Thu, 7 Oct 2021 14:06:53 +0200
Mauro Carvalho Chehab <[email protected]> escreveu:
> Currently, the only way to boot a Kernel with drivers built as modules on embedded
> devices like HiKey 970 is to pass clk_ignore_unused=true as a modprobe parameter.
>
> There are two separate issues:
>
> 1. the clk's core calls clk_disable_unused() too early. By the time this
> function is called, only the builtin drivers were already probed/initialized.
> Drivers built as modules will only be probed afterwards.
>
> This cause a race condition and boot instability, as the clk core will try
> to disable clocks while the drivers built as modules are still being
> probed and initialized.
>
> I suspect that the same problem used to happen at the regulator's core,
> as there's a code that waits for 30 seconds before disabling unused
> regulators;
>
> 2. there are some gate clocks defined at HiKey 970 that should always be on,
> as otherwise the system will hang, or the filesystem I/O will stop.
>
> Ps.:
> I submitted already 3 or 4 versions of patches for HiKey 970 clock, but
> they're all unreliable, due to the race conditions at the clk core due to (1).
>
> Patch 1 solves the issue with the clk core.
> Patch 2 solves the HiKey 970 specific issues.
>
> Mauro Carvalho Chehab (2):
> clk: wait for extra time before disabling unused clocks
> clk: clk-hi3670: mark some clocks as CLK_IS_CRITICAL
>
> drivers/clk/clk.c | 51 +++++++++++++++++++-----------
> drivers/clk/hisilicon/clk-hi3670.c | 24 +++++++-------
> 2 files changed, 44 insertions(+), 31 deletions(-)
>