MIME-Version: 1.0
References: <20230227130542.GM32097@pengutronix.de> <Y/yt8nzVuO4sUg3v@sirena.org.uk>
In-Reply-To: <Y/yt8nzVuO4sUg3v@sirena.org.uk>
From:   Matti Vaittinen <mazziesaccount@gmail.com>
Date:   Mon, 27 Feb 2023 18:19:19 +0200
Message-ID: <CANhJrGMRoR5BfoTswmF8RtLbBd1RxQ77o_W4bNyAgEqZSzdU7A@mail.gmail.com>
Subject: Re: About regulator error events
To:     Mark Brown <broonie@kernel.org>
Cc:     Sascha Hauer <sha@pengutronix.de>, linux-kernel@vger.kernel.org,
        Liam Girdwood <lgirdwood@gmail.com>, kernel@pengutronix.de,
        linux-arm-kernel@lists.infradead.org
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

ma 27. helmik. 2023 klo 15.19 Mark Brown (broonie@kernel.org) kirjoitti:
>
> On Mon, Feb 27, 2023 at 02:05:42PM +0100, Sascha Hauer wrote:
> > I have a board here which has some current limited power switches on it
> > and I wonder if I can do something reasonable with the error interrupt
> > pins these switches have.
>
> Just noticed that Matti (who's been doing a bunch of work here) wasn't
> CCed so adding him.

Thanks. I just hoped I had more to say...


> > The devices do not have a communication channel, instead they only have
> > an enable pin and an error interrupt pin. See
> > https://www.diodes.com/assets/Datasheets/AP22652_53_52A_53A.pdf for a
> > datasheet.  The devices come in two variants, one goes into current
> > limiting mode in case of overcurrent and the other variant switches off
> > until it gets re-enabled again.
> >
> > At first sight it seemed logical to me to wire up the error interrupt
> > pins to REGULATOR_EVENT_OVER_CURRENT events. That was easy to do, but
> > now the question is: What can a regulator consumer do with these events?

This is a question I have asked too.

I was asked to create a driver for a ROHM BD9576 PMIC - which has the
usual configurable over-current, over-voltage, under-voltage and
over-temperature protection limits. (Well, the temperature limit is
fixed). This means that when limits are exceeded - the PMIC shuts down
the power outputs by hardware.

What was new is that the BD9576 also had configurable warning-level
limits (stricter than the protection limits) - and when these were
exceeded only a 'warning IRQ' was issued. This setup was requested
from ROHM by a customer - and the information I received was the
customer had a use-case where they wanted to do 'mitigation actions'
before things get more severely off. Unfortunately, I never received
the information about these 'mitigation actions' when I tried to ask
what those could be. I am under impression that either out HW
colleagues did not know the customer use-case in details, or that this
information was 'top secret' (which seems to be the case pretty often
:( )

> > The strategy I had in mind was to disable the regulator, enable it again
> > to see if the errors persists and if it does, permanently disable the
> > device.  Disabling the regulator only works though when there's only one
> > consumer.

If it is obvious that disabling the regulator is required for
preventing the damage, then it might be justified to use the
regulator_force_disable()? Now, the question when this is obvious is
hard. I think it is the board designer who should be evaluating this -
and as such, I would say that the information about the severity of
error should come from hardware properties - eg. from device-tree.
Now, I am not really sure but I have a feeling that ideally the
regulator driver (provider, not the consumer) should have this
information about the severity level in device-tree and select the use
of notifier flag based on this. If an ERROR level event is emitted, it
should mean the hardware has really failed and forced disable could be
justified. If a WARNING level event is sent, then the handling should
be more graceful - probably done by some very system specific driver.

My problem here is that I don't have the visibility or understanding
regarding current use of those events. Not sure if all the hell would
break loose if the regulators are forcibly shut down. By the very
least I would expect such a consumer handling to be disabled by
default - either via configs or via some runtime enable/disable
mechanism.

 With multiple consumers only the enable count decreases, but
> > the regulator itself stays enabled. This means implementing such a
> > policy at the consumer side is not generally possible. Implementing a
> > policy in the regulator core seems awkward as well, as a good strategy
> > likely differs between different consumers.
> >
> > A first good step might be to notify the user somehow. While we can get
> > the overcurrent status of a regulator from
> > /sys/class/regulator/*/over_current there doesn't seem to be any way to
> > get a regulator event in userspace, right?  Would patches changing that
> > be welcomed?
> >
> > There doesn't seem to be much prior art for handling regulator error
> > events in the kernel. It would be great to get some input what others do
> > in this situation, or to get some ideas what they would do if they had
> > the time to do so ;)

I did submit a talk proposal about these regulator notifications to
ELCE next summer. I don't really have much to say - but I do hope that
I could gain some insight exactly to the question: "[where / how]
these notifications [are / could be] used?" With a lot of luck the
talk proposal gets accepted - and with a bit of luck people in the
audience know more than I do ;)

I think it was qcom-labibb-regulator.c which does try to disable the
failing regulators and then call BUG_ON() if even that fails. Back
when I did implement the regulator irq_helper, I did decide to try
supporting similar logic for drivers which explicitly request failures
to handle these protection IRQs to be treated as fatal errors - except
that instead of calling BUG_ON() (which to my understanding does not
necessarily lead to reset on all configs) - I try to run the poweroff.
This, however, is probably not a good generic solution.

I am all for adding notification sending to user-space. Still, I am
not against a kernel-level protection either - that can be used in
cases where user-space or user-space handling is not available / to be
trusted for a reason or other. What should be known then in-kernel is
whether the handling should be left to user-space (or other consumers)
- or if the kernel driver should take some protective action(s).

Yep, I know this was not much - sorry :( As I said, I do lack the
insight to real-world use cases...

Yours,
    -- Matti

--

Matti Vaittinen
Linux kernel developer at ROHM Semiconductors
Oulu Finland

~~ When things go utterly wrong vim users can always type :help! ~~

Discuss - Estimate - Plan - Report and finally accomplish this:
void do_work(int time) __attribute__ ((const));