2020-09-07 16:28:24

by Markus Boehme

[permalink] [raw]
Subject: [PATCH 3/3] ipmi: Add timeout waiting for channel information

We have observed hosts with misbehaving BMCs that receive a Get Channel
Info command but don't respond. This leads to an indefinite wait in the
ipmi_msghandler's __scan_channels function, showing up as hung task
messages for modprobe.

Add a timeout waiting for the channel scan to complete. If the scan
fails to complete within that time, treat that like IPMI 1.0 and only
assume the presence of the primary IPMB channel at channel number 0.

Signed-off-by: Stefan Nuernberger <[email protected]>
Signed-off-by: Markus Boehme <[email protected]>
---
drivers/char/ipmi/ipmi_msghandler.c | 72 ++++++++++++++++++++-----------------
1 file changed, 39 insertions(+), 33 deletions(-)

diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
index 2a2e8b2..9de9ba6 100644
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -3315,46 +3315,52 @@ channel_handler(struct ipmi_smi *intf, struct ipmi_recv_msg *msg)
*/
static int __scan_channels(struct ipmi_smi *intf, struct ipmi_device_id *id)
{
- int rv;
+ long rv;
+ unsigned int set;

- if (ipmi_version_major(id) > 1
- || (ipmi_version_major(id) == 1
- && ipmi_version_minor(id) >= 5)) {
- unsigned int set;
+ if (ipmi_version_major(id) == 1 && ipmi_version_minor(id) < 5) {
+ set = intf->curr_working_cset;
+ goto single_ipmb_channel;
+ }

- /*
- * Start scanning the channels to see what is
- * available.
- */
- set = !intf->curr_working_cset;
- intf->curr_working_cset = set;
- memset(&intf->wchannels[set], 0,
- sizeof(struct ipmi_channel_set));
-
- intf->null_user_handler = channel_handler;
- intf->curr_channel = 0;
- rv = send_channel_info_cmd(intf, 0);
- if (rv) {
- dev_warn(intf->si_dev,
- "Error sending channel information for channel 0, %d\n",
- rv);
- intf->null_user_handler = NULL;
- return -EIO;
- }
+ /*
+ * Start scanning the channels to see what is
+ * available.
+ */
+ set = !intf->curr_working_cset;
+ intf->curr_working_cset = set;
+ memset(&intf->wchannels[set], 0, sizeof(struct ipmi_channel_set));

- /* Wait for the channel info to be read. */
- wait_event(intf->waitq, intf->channels_ready);
+ intf->null_user_handler = channel_handler;
+ intf->curr_channel = 0;
+ rv = send_channel_info_cmd(intf, 0);
+ if (rv) {
+ dev_warn(intf->si_dev,
+ "Error sending channel information for channel 0, %ld\n",
+ rv);
intf->null_user_handler = NULL;
- } else {
- unsigned int set = intf->curr_working_cset;
+ return -EIO;
+ }

- /* Assume a single IPMB channel at zero. */
- intf->wchannels[set].c[0].medium = IPMI_CHANNEL_MEDIUM_IPMB;
- intf->wchannels[set].c[0].protocol = IPMI_CHANNEL_PROTOCOL_IPMB;
- intf->channel_list = intf->wchannels + set;
- intf->channels_ready = true;
+ /* Wait for the channel info to be read. */
+ rv = wait_event_timeout(intf->waitq, intf->channels_ready, 5 * HZ);
+ if (rv == 0) {
+ dev_warn(intf->si_dev,
+ "Timed out waiting for channel information. Assuming a single IPMB channel at 0.\n");
+ goto single_ipmb_channel;
}

+ goto out;
+
+single_ipmb_channel:
+ /* Assume a single IPMB channel at zero. */
+ intf->wchannels[set].c[0].medium = IPMI_CHANNEL_MEDIUM_IPMB;
+ intf->wchannels[set].c[0].protocol = IPMI_CHANNEL_PROTOCOL_IPMB;
+ intf->channel_list = intf->wchannels + set;
+ intf->channels_ready = true;
+
+out:
+ intf->null_user_handler = NULL;
return 0;
}

--
2.7.4


2020-09-08 00:35:30

by Corey Minyard

[permalink] [raw]
Subject: Re: [PATCH 3/3] ipmi: Add timeout waiting for channel information

On Mon, Sep 07, 2020 at 06:25:37PM +0200, Markus Boehme wrote:
> We have observed hosts with misbehaving BMCs that receive a Get Channel
> Info command but don't respond. This leads to an indefinite wait in the
> ipmi_msghandler's __scan_channels function, showing up as hung task
> messages for modprobe.
>
> Add a timeout waiting for the channel scan to complete. If the scan
> fails to complete within that time, treat that like IPMI 1.0 and only
> assume the presence of the primary IPMB channel at channel number 0.

This patch is a significant rewrite of the function. This makes me a
little uncomfortable. It's generally better to fix the bug in a minimal
patch. It would be easier to read if you had two patches, one to
restructure the code and one to fix the bug.

One comment inline, but it doesn't matter, because...

While thinking about this, I realized an issue with these patches.
There should be timers in the lower layers that detect that the BMC does
not respond and should return an error response. This is supposed to be
guaranteed by the lower layer, you shouldn't need timers here. In fact,
if you abort with a timer here, you should get a lower layer reponds
later, causing other issues.

So, this is wrong. If you are never getting a response, there is a bug
in the lower layer. If you are not getting the error response as
quickly as you would like, I'm not sure what to do about that.

The first patch, of course, is an obvious bug fix. I'll include that.

-corey

>
> Signed-off-by: Stefan Nuernberger <[email protected]>
> Signed-off-by: Markus Boehme <[email protected]>
> ---
> drivers/char/ipmi/ipmi_msghandler.c | 72 ++++++++++++++++++++-----------------
> 1 file changed, 39 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
> index 2a2e8b2..9de9ba6 100644
> --- a/drivers/char/ipmi/ipmi_msghandler.c
> +++ b/drivers/char/ipmi/ipmi_msghandler.c
> @@ -3315,46 +3315,52 @@ channel_handler(struct ipmi_smi *intf, struct ipmi_recv_msg *msg)
> */
> static int __scan_channels(struct ipmi_smi *intf, struct ipmi_device_id *id)
> {
> - int rv;
> + long rv;
> + unsigned int set;
>
> - if (ipmi_version_major(id) > 1
> - || (ipmi_version_major(id) == 1
> - && ipmi_version_minor(id) >= 5)) {
> - unsigned int set;
> + if (ipmi_version_major(id) == 1 && ipmi_version_minor(id) < 5) {

This is incorrect, it will not correctly handle IPMI 0.x BMCs. Yes,
they exist.

> + set = intf->curr_working_cset;
> + goto single_ipmb_channel;
> + }
>
> - /*
> - * Start scanning the channels to see what is
> - * available.
> - */
> - set = !intf->curr_working_cset;
> - intf->curr_working_cset = set;
> - memset(&intf->wchannels[set], 0,
> - sizeof(struct ipmi_channel_set));
> -
> - intf->null_user_handler = channel_handler;
> - intf->curr_channel = 0;
> - rv = send_channel_info_cmd(intf, 0);
> - if (rv) {
> - dev_warn(intf->si_dev,
> - "Error sending channel information for channel 0, %d\n",
> - rv);
> - intf->null_user_handler = NULL;
> - return -EIO;
> - }
> + /*
> + * Start scanning the channels to see what is
> + * available.
> + */
> + set = !intf->curr_working_cset;
> + intf->curr_working_cset = set;
> + memset(&intf->wchannels[set], 0, sizeof(struct ipmi_channel_set));
>
> - /* Wait for the channel info to be read. */
> - wait_event(intf->waitq, intf->channels_ready);
> + intf->null_user_handler = channel_handler;
> + intf->curr_channel = 0;
> + rv = send_channel_info_cmd(intf, 0);
> + if (rv) {
> + dev_warn(intf->si_dev,
> + "Error sending channel information for channel 0, %ld\n",
> + rv);
> intf->null_user_handler = NULL;
> - } else {
> - unsigned int set = intf->curr_working_cset;
> + return -EIO;
> + }
>
> - /* Assume a single IPMB channel at zero. */
> - intf->wchannels[set].c[0].medium = IPMI_CHANNEL_MEDIUM_IPMB;
> - intf->wchannels[set].c[0].protocol = IPMI_CHANNEL_PROTOCOL_IPMB;
> - intf->channel_list = intf->wchannels + set;
> - intf->channels_ready = true;
> + /* Wait for the channel info to be read. */
> + rv = wait_event_timeout(intf->waitq, intf->channels_ready, 5 * HZ);
> + if (rv == 0) {
> + dev_warn(intf->si_dev,
> + "Timed out waiting for channel information. Assuming a single IPMB channel at 0.\n");
> + goto single_ipmb_channel;
> }
>
> + goto out;
> +
> +single_ipmb_channel:
> + /* Assume a single IPMB channel at zero. */
> + intf->wchannels[set].c[0].medium = IPMI_CHANNEL_MEDIUM_IPMB;
> + intf->wchannels[set].c[0].protocol = IPMI_CHANNEL_PROTOCOL_IPMB;
> + intf->channel_list = intf->wchannels + set;
> + intf->channels_ready = true;
> +
> +out:
> + intf->null_user_handler = NULL;
> return 0;
> }
>
> --
> 2.7.4
>

2020-09-10 11:16:39

by Boehme, Markus

[permalink] [raw]
Subject: Re: [PATCH 3/3] ipmi: Add timeout waiting for channel information

Hey Corey, thanks for taking a look!

On Mon, 2020-09-07 at 19:34 -0500, Corey Minyard wrote:
> On Mon, Sep 07, 2020 at 06:25:37PM +0200, Markus Boehme wrote:
> >
> > We have observed hosts with misbehaving BMCs that receive a Get Channel
> > Info command but don't respond. This leads to an indefinite wait in the
> > ipmi_msghandler's __scan_channels function, showing up as hung task
> > messages for modprobe.
> >
> > Add a timeout waiting for the channel scan to complete. If the scan
> > fails to complete within that time, treat that like IPMI 1.0 and only
> > assume the presence of the primary IPMB channel at channel number 0.
> [...]
> While thinking about this, I realized an issue with these patches.
> There should be timers in the lower layers that detect that the BMC does
> not respond and should return an error response.  This is supposed to be
> guaranteed by the lower layer, you shouldn't need timers here.  In fact,
> if you abort with a timer here, you should get a lower layer reponds
> later, causing other issues.
>
> So, this is wrong.  If you are never getting a response, there is a bug
> in the lower layer.  If you are not getting the error response as
> quickly as you would like, I'm not sure what to do about that.
>

I see. I might not be able to get hold of more hosts behaving this way,
but if I do, I'll dig deeper into why the lower layer timeouts didn't
save us here. Thanks for the pointer.



> > Signed-off-by: Stefan Nuernberger <[email protected]>
> > Signed-off-by: Markus Boehme <[email protected]>
> > ---
> >  drivers/char/ipmi/ipmi_msghandler.c | 72 ++++++++++++++++++++-----------------
> >  1 file changed, 39 insertions(+), 33 deletions(-)
> >
> > diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
> > index 2a2e8b2..9de9ba6 100644
> > --- a/drivers/char/ipmi/ipmi_msghandler.c
> > +++ b/drivers/char/ipmi/ipmi_msghandler.c
> > @@ -3315,46 +3315,52 @@ channel_handler(struct ipmi_smi *intf, struct ipmi_recv_msg *msg)
> >   */
> >  static int __scan_channels(struct ipmi_smi *intf, struct ipmi_device_id *id)
> >  {
> > -     int rv;
> > +     long rv;
> > +     unsigned int set;
> >
> > -     if (ipmi_version_major(id) > 1
> > -                     || (ipmi_version_major(id) == 1
> > -                         && ipmi_version_minor(id) >= 5)) {
> > -             unsigned int set;
> > +     if (ipmi_version_major(id) == 1 && ipmi_version_minor(id) < 5) {
> This is incorrect, it will not correctly handle IPMI 0.x BMCs.  Yes,
> they exist.

Interesting! I wasn't aware of those. Searching the web doesn't turn up
much and the spec doesn't mention them either. Are these pre-release
implementations of the IPMI 1.0 spec or some kind of "IPMI light"?

Markus



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


2020-10-07 18:47:16

by Corey Minyard

[permalink] [raw]
Subject: Re: [Openipmi-developer] [PATCH 3/3] ipmi: Add timeout waiting for channel information

On Thu, Sep 10, 2020 at 11:08:40AM +0000, Boehme, Markus via Openipmi-developer wrote:
> > > -                         && ipmi_version_minor(id) >= 5)) {
> > > -             unsigned int set;
> > > +     if (ipmi_version_major(id) == 1 && ipmi_version_minor(id) < 5) {
> > This is incorrect, it will not correctly handle IPMI 0.x BMCs.  Yes,
> > they exist.
>
> Interesting! I wasn't aware of those. Searching the web doesn't turn up
> much and the spec doesn't mention them either. Are these pre-release
> implementations of the IPMI 1.0 spec or some kind of "IPMI light"?

There was an 0.9 version of the spec that some machines implemented.
It's not really a "light" version, it's just a really early version. I
don't know how many machine out there still implement it, but I try to
keep them working if I can.

Thanks,

-corey

>
> Markus