2018-08-21 22:52:33

by Andrew Banman

[permalink] [raw]
Subject: [RFC] IPMI state machine regression

Dear IPMI supporters,

We observe a window in IPMI BT's opportunistic get capabilities request,
wherein GET_DEVICE_GUID and GET_DEVICE_ID requests may start while the BT state
machine is in WR_CONSUME. Following this, the 0xD5 error code is forced in
bt_start_transaction, IPMI fails to initialize, and the interface is torn down.
There is no mechanism to retry bringing up the interface in open() /dev/ipmi.
This leaves IPMI hosed until you reload modules. Looks to happen after we call
schedule().

The problem is intermittent, but we observe ~50% fail rate on 4.18
kernels on our UV4 systems.

Here's the relevant snippet (the action starts at 120.367379):

[ 118.316764] IPMI System Interface driver.
[ 118.354246] ipmi_si dmi-ipmi-si.0: ipmi_platform: probing via SMBIOS
[ 118.354249] ipmi_si: SMBIOS: mem 0xce4 regsize 1 spacing 1 irq 6
[ 118.354251] ipmi_si: Adding SMBIOS-specified bt state machine
[ 118.354311] ipmi_si IPI0001:00: ipmi_platform: probing via ACPI
[ 118.381164] ipmi_si IPI0001:00: [io 0x0ce4-0x0ce6] regsize 1 spacing 1 irq 6
[ 118.381167] ipmi_si: Adding ACPI-specified bt state machine
[ 118.382811] ipmi_si: Trying SMBIOS-specified bt state machine at mem address 0xce4, slave address 0x20, irq 6
[ 118.382815] ipmi_si dmi-ipmi-si.0: Could not set up I/O space
[ 118.382816] ipmi_si: Trying ACPI-specified bt state machine at i/o address 0xce4, slave address 0x0, irq 6
[ 118.382911] IPMI BT: flag reset [ ]
[ 118.383008] BT: XACTION [ ] TO=5000000 - 0
[ 118.383059] BT: WR_BYTES [ ] TO=5000000 - 0
[ 118.383272] BT: WR_CONSUME [ H2B ] TO=5000000 - 0
[ 118.388034] BT: WR_CONSUME [ ] TO=5000000 - 4000
[ 118.388059] BT: RD_WAIT [ ] TO=4996000 - 0
[ 118.396017] BT: RD_WAIT [ ] TO=4996000 - 4000
[ 118.492015] BT: RD_WAIT [ ] TO=4992000 - 4000
[ 118.520073] BT: RD_WAIT [ ] TO=4988000 - 4000
[ 118.528079] BT: RD_WAIT [ ] TO=4984000 - 4000
[ 118.540049] BT: RD_WAIT [ ] TO=4980000 - 4000
[ 118.548011] BT: RD_WAIT [ ] TO=4976000 - 4000
[ 118.564033] BT: RD_WAIT [ ] TO=4972000 - 4000
[ 118.572109] BT: RD_WAIT [ ] TO=4968000 - 4000
[ 118.580013] BT: RD_WAIT [ ] TO=4964000 - 4000
[ 118.588017] BT: RD_WAIT [ ] TO=4960000 - 4000
[ 118.608014] BT: RD_WAIT [ ] TO=4956000 - 4000
[ 118.628058] BT: RD_WAIT [ ] TO=4952000 - 4000
[ 118.636053] BT: RD_WAIT [ ] TO=4948000 - 4000
[ 118.648016] BT: RD_WAIT [ ] TO=4944000 - 4000
[ 118.672028] BT: RD_WAIT [ ] TO=4940000 - 4000
[ 118.684056] BT: RD_WAIT [ ] TO=4936000 - 4000
[ 118.696021] BT: RD_WAIT [ ] TO=4932000 - 4000
[ 118.708015] BT: RD_WAIT [ ] TO=4928000 - 4000
[ 118.720018] BT: RD_WAIT [ ] TO=4924000 - 4000
[ 118.732063] BT: RD_WAIT [ ] TO=4920000 - 4000
[ 118.744021] BT: RD_WAIT [ ] TO=4916000 - 4000
[ 118.752089] BT: RD_WAIT [ ] TO=4912000 - 4000
[ 118.764011] BT: RD_WAIT [ ] TO=4908000 - 4000
[ 118.784009] BT: RD_WAIT [ ] TO=4904000 - 4000
[ 118.808011] BT: RD_WAIT [ ] TO=4900000 - 4000
[ 118.832010] BT: RD_WAIT [ ] TO=4896000 - 4000
[ 118.844011] BT: RD_WAIT [ ] TO=4892000 - 4000
[ 118.860036] BT: RD_WAIT [ ] TO=4888000 - 4000
[ 118.868102] BT: RD_WAIT [ ] TO=4884000 - 4000
[ 118.876012] BT: RD_WAIT [ ] TO=4880000 - 4000
[ 118.904011] BT: RD_WAIT [ ] TO=4876000 - 4000
[ 118.924013] BT: RD_WAIT [ ] TO=4872000 - 4000
[ 118.948016] BT: RD_WAIT [ ] TO=4868000 - 4000
[ 118.984043] BT: RD_WAIT [ ] TO=4864000 - 4000
[ 118.992059] BT: RD_WAIT [ ] TO=4860000 - 4000
[ 118.995422] for 2000000
[ 118.995650]
[ 119.000026] BT: RD_WAIT [ ] TO=4856000 - 4000
[ 119.012018] BT: RD_WAIT [ ] TO=4852000 - 4000
[ 119.032010] BT: RD_WAIT [ ] TO=4848000 - 4000
[ 119.048013] BT: RD_WAIT [ ] TO=4844000 - 4000
[ 119.060013] BT: RD_WAIT [ ] TO=4840000 - 4000
[ 119.080014] BT: RD_WAIT [ ] TO=4836000 - 4000
[ 119.096021] BT: RD_WAIT [ ] TO=4832000 - 4000
[ 119.120009] BT: RD_WAIT [ ] TO=4828000 - 4000
[ 119.128017] BT: RD_WAIT [ ] TO=4824000 - 4000
[ 119.136011] BT: RD_WAIT [ ] TO=4820000 - 4000
[ 119.144013] BT: RD_WAIT [ ] TO=4816000 - 4000
[ 119.164008] BT: RD_WAIT [ ] TO=4812000 - 4000
[ 119.184010] BT: RD_WAIT [ ] TO=4808000 - 4000
[ 119.204015] BT: RD_WAIT [ ] TO=4804000 - 4000
[ 119.264034] BT: RD_WAIT [ ] TO=4800000 - 4000
[ 119.272013] BT: RD_WAIT [ ] TO=4796000 - 4000
[ 119.288013] BT: RD_WAIT [ ] TO=4792000 - 4000
[ 119.308011] BT: RD_WAIT [ ] TO=4788000 - 4000
[ 119.324011] BT: RD_WAIT [ ] TO=4784000 - 4000
[ 119.344010] BT: RD_WAIT [ ] TO=4780000 - 4000
[ 119.364027] BT: RD_WAIT [ ] TO=4776000 - 4000
[ 119.376011] BT: RD_WAIT [ ] TO=4772000 - 4000
[ 119.396010] BT: RD_WAIT [ ] TO=4768000 - 4000
[ 119.408013] BT: RD_WAIT [ ] TO=4764000 - 4000
[ 119.424012] BT: RD_WAIT [ B2H ] TO=4760000 - 4000
[ 119.434845] BT: CLEAR_B2H [ H_BUSY ] TO=4756000 - 0
[ 119.592204] BT: RD_BYTES [ H_BUSY ] TO=4756000 - 0
[ 119.650481] BT: XACTION [ ] TO=5000000 - 0
[ 119.676062] BT: WR_BYTES [ ] TO=5000000 - 0
[ 119.734596] BT: WR_CONSUME [ H2B ] TO=5000000 - 0
[ 119.740019] BT: WR_CONSUME [ ] TO=5000000 - 4000
[ 119.740036] BT: RD_WAIT [ ] TO=4996000 - 0
[ 119.748032] BT: RD_WAIT [ B2H ] TO=4996000 - 4000
[ 119.748101] BT: CLEAR_B2H [ H_BUSY ] TO=4992000 - 0
[ 119.748124] BT: RD_BYTES [ H_BUSY ] TO=4992000 - 0
[ 119.748348] BT: XACTION [ ] TO=5000000 - 0
[ 119.748382] BT: WR_BYTES [ ] TO=5000000 - 0
[ 119.748537] BT: WR_CONSUME [ H2B ] TO=5000000 - 0
[ 119.760013] BT: WR_CONSUME [ B2H ] TO=5000000 - 4000
[ 119.806271] BT: RD_WAIT [ B2H ] TO=4996000 - 0
[ 119.806282] BT: CLEAR_B2H [ H_BUSY ] TO=4996000 - 0
[ 119.816909] BT: RD_BYTES [ H_BUSY ] TO=4996000 - 0
[ 119.816944] BT: XACTION [ ] TO=5000000 - 0
[ 119.827061] BT: WR_BYTES [ ] TO=5000000 - 0
[ 119.827088] BT: WR_CONSUME [ H2B ] TO=5000000 - 0
[ 119.852024] BT: WR_CONSUME [ B2H ] TO=5000000 - 4000
[ 119.875752] BT: RD_WAIT [ B2H ] TO=4996000 - 0
[ 119.895099] BT: CLEAR_B2H [ H_BUSY ] TO=4996000 - 0
[ 119.991768] BT: RD_BYTES [ H_BUSY ] TO=4996000 - 0
[ 119.991805] BT: XACTION [ ] TO=5000000 - 0
[ 120.004993] BT: WR_BYTES [ ] TO=5000000 - 0
[ 120.027171] BT: WR_CONSUME [ H2B ] TO=5000000 - 0
[ 120.048024] BT: WR_CONSUME [ ] TO=5000000 - 4000
[ 120.072949] BT: RD_WAIT [ B2H ] TO=4996000 - 0
[ 120.085892] BT: CLEAR_B2H [ H_BUSY ] TO=4996000 - 0
[ 120.097374] BT: RD_BYTES [ H_BUSY ] TO=4996000 - 0
[ 120.097406] BT: XACTION [ ] TO=5000000 - 0
[ 120.125568] BT: WR_BYTES [ ] TO=5000000 - 0
[ 120.141827] BT: WR_CONSUME [ H2B ] TO=5000000 - 0
[ 120.264026] BT: WR_CONSUME [ B2H ] TO=5000000 - 4000
[ 120.287267] BT: RD_WAIT [ B2H ] TO=4996000 - 0
[ 120.287276] BT: CLEAR_B2H [ H_BUSY ] TO=4996000 - 0
[ 120.287280] BT: RD_BYTES [ H_BUSY ] TO=4996000 - 0
[ 120.287429] ipmi_si IPI0001:00: Using irq 6
[ 120.287447] **Interrupt: 1534434704.287445059
[ 120.287457] BT: XACTION [ ] TO=5000000 - 0
[ 120.287466] BT: WR_BYTES [ ] TO=5000000 - 0
[ 120.287500] BT: WR_CONSUME [ H2B ] TO=5000000 - 0
[ 120.297445] **Interrupt: 1534434704.297445691
[ 120.332534] BT: WR_CONSUME [ B2H ] TO=5000000 - 0
[ 120.332538] BT: RD_WAIT [ B2H ] TO=5000000 - 0
[ 120.342956] BT: CLEAR_B2H [ H_BUSY ] TO=5000000 - 0
[ 120.342969] BT: RD_BYTES [ H_BUSY ] TO=5000000 - 0
[ 120.343244] **Done: 1534434704.343244675
[ 120.367379] BT: IDLE [ ] TO=5000000 - 0
[ 120.367384] BT: CAP_BEGIN [ ] TO=5000000 - 0
[ 120.376634] **Timer: 1534434704.376633453
[ 120.376646] BT: XACTION [ ] TO=5000000 - 92000
[ 120.376683] BT: WR_BYTES [ ] TO=4908000 - 0
[ 120.376995] BT: WR_CONSUME [ H2B ] TO=4908000 - 0
[ 120.377025] Send: 18 08
[ 120.377027] **Enqueue: 1534434704.377028726
[ 120.377028] **Start2: 1534434704.377029859
[ 120.377061] BT: WR_CONSUME [ ] TO=4908000 - 0
[ 120.377137] BT: RD_WAIT [ ] TO=4908000 - 0
[ 120.377140] Recv:: 1c 08 d5
[ 120.377156] Send: 18 01
[ 120.377161] **Enqueue: 1534434704.377161047
[ 120.377164] **Start2: 1534434704.377164751
[ 120.377226] BT: RD_WAIT [ ] TO=4908000 - 0
[ 120.377237] Recv:: 1c 01 d5
[ 120.377242] ipmi_si IPI0001:00: IPMI message handler: device id demangle failed: -22
[ 120.377252] ipmi_si IPI0001:00: Unable to get the device id: -5
[ 120.377252] ipmi_si IPI0001:00: Unable to register device: error -5
[ 120.569865] IPMI SSIF Interface driver

Also FYI,

Bad cleanup code causes a null pointer and adds a race condition; this
is addressed by "[PATCH] Remove redundant cleanup in ipmi_register_smi"
from Justin Ernst <[email protected]>, which I paste here:

diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
index 51832b8a2c62..3b0b50c4f064 100644
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -3395,12 +3395,12 @@ int ipmi_register_smi(const struct ipmi_smi_handlers *handlers,

out:
if (rv) {
- ipmi_bmc_unregister(intf);
- list_del_rcu(&intf->link);
+ /*
+ * ipmi_unregister_smi must be called to clean up after
+ * failure. We unlock the mutex to allow ipmi_unregister_smi
+ * to lock it and perform cleanup.
+ */
mutex_unlock(&ipmi_interfaces_mutex);
- synchronize_srcu(&ipmi_interfaces_srcu);
- cleanup_srcu_struct(&intf->users_srcu);
- kref_put(&intf->refcount, intf_free);
} else {
/*
* Keep memory order straight for RCU readers. Make
--


2018-08-22 16:16:50

by Corey Minyard

[permalink] [raw]
Subject: Re: [RFC] IPMI state machine regression

On 08/21/2018 05:14 PM, Andrew Banman wrote:
> Dear IPMI supporters,
>
> We observe a window in IPMI BT's opportunistic get capabilities request,
> wherein GET_DEVICE_GUID and GET_DEVICE_ID requests may start while the BT state
> machine is in WR_CONSUME. Following this, the 0xD5 error code is forced in
> bt_start_transaction, IPMI fails to initialize, and the interface is torn down.
> There is no mechanism to retry bringing up the interface in open() /dev/ipmi.
> This leaves IPMI hosed until you reload modules. Looks to happen after we call
> schedule().

When was the latest kernel where this worked properly?  Also, what
hardware is this?

BTW, you can use the "hotmod" capability of the IPMI driver to add the
device
dynamically.

-corey

2018-08-22 16:39:21

by Andrew Banman

[permalink] [raw]
Subject: Re: [RFC] IPMI state machine regression

On Wed, Aug 22, 2018 at 11:14:52AM -0500, Corey Minyard wrote:
> On 08/21/2018 05:14 PM, Andrew Banman wrote:
> > Dear IPMI supporters,
> >
> > We observe a window in IPMI BT's opportunistic get capabilities request,
> > wherein GET_DEVICE_GUID and GET_DEVICE_ID requests may start while the BT state
> > machine is in WR_CONSUME. Following this, the 0xD5 error code is forced in
> > bt_start_transaction, IPMI fails to initialize, and the interface is torn down.
> > There is no mechanism to retry bringing up the interface in open() /dev/ipmi.
> > This leaves IPMI hosed until you reload modules. Looks to happen after we call
> > schedule().
>
> When was the latest kernel where this worked properly?? Also, what hardware
> is this?

This is UV4.

First known bad commit, but I am not sure if the timing issue predates
it:

commit aa9c9ab2443e3b9562c6c7cfc245a9e43b557d14
Author: Jeremy Kerr <[email protected]>
Date: Fri Aug 25 15:47:24 2017 +0800

ipmi: allow dynamic BMC version information

Hits less frequently with older kernels so I didn't see it until
recently when it became more frequent.

>
> BTW, you can use the "hotmod" capability of the IPMI driver to add the
> device
> dynamically.
>
> -corey

2018-08-23 16:31:58

by Corey Minyard

[permalink] [raw]
Subject: Re: [RFC] IPMI state machine regression

On 08/22/2018 11:23 AM, Andrew Banman wrote:
> On Wed, Aug 22, 2018 at 11:14:52AM -0500, Corey Minyard wrote:
>> On 08/21/2018 05:14 PM, Andrew Banman wrote:
>>> Dear IPMI supporters,
>>>
>>> We observe a window in IPMI BT's opportunistic get capabilities request,
>>> wherein GET_DEVICE_GUID and GET_DEVICE_ID requests may start while the BT state
>>> machine is in WR_CONSUME. Following this, the 0xD5 error code is forced in
>>> bt_start_transaction, IPMI fails to initialize, and the interface is torn down.
>>> There is no mechanism to retry bringing up the interface in open() /dev/ipmi.
>>> This leaves IPMI hosed until you reload modules. Looks to happen after we call
>>> schedule().
>> When was the latest kernel where this worked properly?  Also, what hardware
>> is this?
> This is UV4.
>
> First known bad commit, but I am not sure if the timing issue predates
> it:
>
> commit aa9c9ab2443e3b9562c6c7cfc245a9e43b557d14
> Author: Jeremy Kerr <[email protected]>
> Date: Fri Aug 25 15:47:24 2017 +0800
>
> ipmi: allow dynamic BMC version information
>
> Hits less frequently with older kernels so I didn't see it until
> recently when it became more frequent.

Ok, that's for the crash, which makes sense.  But that's an easy problem
to fix.
I would like a "Tested-by" on that, if you get to test it, though I was
able to
simulate various failures there to test it out.

So reading between the lines ("more frequent") I'm guessing that this still
happened with older kernels, but is becoming annoying with newer kernels.
I would guess recent changes causes it to happen more often due to changes
in the way the upper layer interacts with the lower layers, you will
have more
messages at startup, and the timing is somewhat different.

The BT code itself hasn't changed much in over 10 years.  Nothing that
looks like it would cause an issue like this.  So I would guess this is an
issue that has been around for a while.

I don't have any real hardware with a BT interface, just the one in qemu,
but I've never seen it there.

It actually looks like the state machine is working ok.  But the BMC is
responding to a "Get Device ID" command with:

Recv:: 1c 08 d5


That's an error response with D5, which is "Cannot execute command.
Command, or request parameter(s), not supported in present state."
That's an error response from your BMC.  That particular command
shouldn't ever respond with that error, so I think the bug here is
with your BMC.

-corey



2018-08-23 16:55:00

by Andrew Banman

[permalink] [raw]
Subject: Re: [RFC] IPMI state machine regression

On Thu, Aug 23, 2018 at 11:22:58AM -0500, Corey Minyard wrote:
> On 08/22/2018 11:23 AM, Andrew Banman wrote:
> > On Wed, Aug 22, 2018 at 11:14:52AM -0500, Corey Minyard wrote:
> > > On 08/21/2018 05:14 PM, Andrew Banman wrote:
> > > > Dear IPMI supporters,
> > > >
> > > > We observe a window in IPMI BT's opportunistic get capabilities request,
> > > > wherein GET_DEVICE_GUID and GET_DEVICE_ID requests may start while the BT state
> > > > machine is in WR_CONSUME. Following this, the 0xD5 error code is forced in
> > > > bt_start_transaction, IPMI fails to initialize, and the interface is torn down.
> > > > There is no mechanism to retry bringing up the interface in open() /dev/ipmi.
> > > > This leaves IPMI hosed until you reload modules. Looks to happen after we call
> > > > schedule().
> > > When was the latest kernel where this worked properly?? Also, what hardware
> > > is this?
> > This is UV4.
> >
> > First known bad commit, but I am not sure if the timing issue predates
> > it:
> >
> > commit aa9c9ab2443e3b9562c6c7cfc245a9e43b557d14
> > Author: Jeremy Kerr <[email protected]>
> > Date: Fri Aug 25 15:47:24 2017 +0800
> >
> > ipmi: allow dynamic BMC version information
> >
> > Hits less frequently with older kernels so I didn't see it until
> > recently when it became more frequent.
>
> Ok, that's for the crash, which makes sense.? But that's an easy problem to
> fix.
> I would like a "Tested-by" on that, if you get to test it, though I was able
> to
> simulate various failures there to test it out.

Testing it today. We'll respond to the patch you sent out. Thanks for
fixing that!

>
> So reading between the lines ("more frequent") I'm guessing that this still
> happened with older kernels, but is becoming annoying with newer kernels.
> I would guess recent changes causes it to happen more often due to changes
> in the way the upper layer interacts with the lower layers, you will have
> more
> messages at startup, and the timing is somewhat different.
>
> The BT code itself hasn't changed much in over 10 years.? Nothing that
> looks like it would cause an issue like this.? So I would guess this is an
> issue that has been around for a while.

That's what I suspect as well.

>
> I don't have any real hardware with a BT interface, just the one in qemu,
> but I've never seen it there.
>
> It actually looks like the state machine is working ok.? But the BMC is
> responding to a "Get Device ID" command with:
>
> Recv:: 1c 08 d5

That completion code is spoofed in after bt_start_transaction when bt->state
is in WR_CONSUME. This is pretty visible from the message trace, but here
is another with an explicit debug printk (below).

Maybe I'm totally reading this wrong, but I think that err gets passed
on to the upper layer:

start_transaction: return 0xD5 -> return_hosed_msg -> deliver_recv_msg
-> ipmi_smi_msg_received


[ 120.367379] BT: IDLE [ ] TO=5000000 - 0
[ 120.367384] BT: CAP_BEGIN [ ] TO=5000000 - 0
[ 120.376634] **Timer: 1534434704.376633453
[ 120.376646] BT: XACTION [ ] TO=5000000 - 92000
[ 120.376683] BT: WR_BYTES [ ] TO=4908000 - 0
[ 120.376995] BT: WR_CONSUME [ H2B ] TO=4908000 - 0
[ 120.377025] Send: 18 08
[ 120.377027] **Enqueue: 1534434704.377028726
[ 120.377028] **Start2: 1534434704.377029859
# Added DBG # [ 120.377028] bt_start_transaction returned IPMI_NOT_IN_MY_STATE_ERR 0xd5
# Added DBG # [ 120.377030] 18
# Added DBG # [ 120.377030] 08
# Added DBG # [ 120.377030]
[ 120.377061] BT: WR_CONSUME [ ] TO=4908000 - 0
[ 120.377137] BT: RD_WAIT [ ] TO=4908000 - 0
[ 120.377140] Recv:: 1c 08 d5
[ 120.377156] Send: 18 01
[ 120.377161] **Enqueue: 1534434704.377161047
[ 120.377164] **Start2: 1534434704.377164751
# Added DBG # [ 120.377165] bt_start_transaction returned IPMI_NOT_IN_MY_STATE_ERR 0xd5
# Added DBG # [ 120.377166] 18
# Added DBG # [ 120.377167] 01
# Added DBG # [ 120.377168]
[ 120.377226] BT: RD_WAIT [ ] TO=4908000 - 0
[ 120.377237] Recv:: 1c 01 d5
[ 120.377242] ipmi_si IPI0001:00: IPMI message handler: device id demangle failed: -22
[ 120.377252] ipmi_si IPI0001:00: Unable to get the device id: -5
[ 120.377252] ipmi_si IPI0001:00: Unable to register device: error -5


diff --git a/drivers/char/ipmi/ipmi_bt_sm.c b/drivers/char/ipmi/ipmi_bt_sm.c
index d99d744..1be4de9 100644
--- a/drivers/char/ipmi/ipmi_bt_sm.c
+++ b/drivers/char/ipmi/ipmi_bt_sm.c
@@ -24,7 +24,7 @@
* value
*/

-static int bt_debug = 4; /* 0 == BT_DEBUG_OFF */
+static int bt_debug = 2; /* 0 == BT_DEBUG_OFF */

module_param(bt_debug, int, 0644);
MODULE_PARM_DESC(bt_debug, "debug bitmask, 1=enable, 2=messages, 4=states");
@@ -217,8 +217,14 @@ static int bt_start_transaction(struct si_sm_data *bt,
if (bt->state == BT_STATE_LONG_BUSY)
return IPMI_NODE_BUSY_ERR;

- if (bt->state != BT_STATE_IDLE)
+ if (bt->state != BT_STATE_IDLE) {
+ pr_crit("bt->state = %d : bt_start_transaction returned "
+ "IPMI_NOT_IN_MY_STATE_ERR 0xd5\n", bt->state);
+ for (i = 0; i < size; i ++)
+ printk(" %02x", data[i]);
+ printk("\n");
return IPMI_NOT_IN_MY_STATE_ERR;
+ }

if (bt_debug & BT_DEBUG_MSG) {
printk(KERN_WARNING "BT: +++++++++++++++++ New command\n");

>
>
> That's an error response with D5, which is "Cannot execute command.
> Command, or request parameter(s), not supported in present state."
> That's an error response from your BMC.? That particular command
> shouldn't ever respond with that error, so I think the bug here is
> with your BMC.

My BMC engineers and I cannot reproduce the problem running ipmi commands
direct from the BMC.

>
> -corey
>
>

Thank you,

Andrew

2018-08-23 19:08:40

by Corey Minyard

[permalink] [raw]
Subject: Re: [RFC] IPMI state machine regression

On 08/23/2018 11:52 AM, Andrew Banman wrote:
>
>> I don't have any real hardware with a BT interface, just the one in qemu,
>> but I've never seen it there.
>>
>> It actually looks like the state machine is working ok.  But the BMC is
>> responding to a "Get Device ID" command with:
>>
>> Recv:: 1c 08 d5
> That completion code is spoofed in after bt_start_transaction when bt->state
> is in WR_CONSUME. This is pretty visible from the message trace, but here
> is another with an explicit debug printk (below).
>
> Maybe I'm totally reading this wrong, but I think that err gets passed
> on to the upper layer:
>
> start_transaction: return 0xD5 -> return_hosed_msg -> deliver_recv_msg
> -> ipmi_smi_msg_received

Ah, yes, you are right.  I didn't write the BT state machine, so I'm not
terribly
familiar with how it works.  I'm pretty sure the author is no longer
with HP.

>
> [ 120.367379] BT: IDLE [ ] TO=5000000 - 0
> [ 120.367384] BT: CAP_BEGIN [ ] TO=5000000 - 0
> [ 120.376634] **Timer: 1534434704.376633453
> [ 120.376646] BT: XACTION [ ] TO=5000000 - 92000
> [ 120.376683] BT: WR_BYTES [ ] TO=4908000 - 0
> [ 120.376995] BT: WR_CONSUME [ H2B ] TO=4908000 - 0
> [ 120.377025] Send: 18 08
> [ 120.377027] **Enqueue: 1534434704.377028726
> [ 120.377028] **Start2: 1534434704.377029859

So what appears to be happening is that the BT state machine sends the
BT capabilities commands to the BMC.  While this is going on the upper
layer sends a messages, but the state machine can't send because it's
in the middle of another operation.

The "device" in qemu is almost instantaneous, so you would never
see it there.

So this is almost certainly due to the timing and messaging changes.
in the upper layer.  Another step may be necessary in initialization
to allow this to happen.

Let me poke at this a bit and I'll send another patch.

Thanks,

-corey

> # Added DBG # [ 120.377028] bt_start_transaction returned IPMI_NOT_IN_MY_STATE_ERR 0xd5
> # Added DBG # [ 120.377030] 18
> # Added DBG # [ 120.377030] 08
> # Added DBG # [ 120.377030]
> [ 120.377061] BT: WR_CONSUME [ ] TO=4908000 - 0
> [ 120.377137] BT: RD_WAIT [ ] TO=4908000 - 0
> [ 120.377140] Recv:: 1c 08 d5
> [ 120.377156] Send: 18 01
> [ 120.377161] **Enqueue: 1534434704.377161047
> [ 120.377164] **Start2: 1534434704.377164751
> # Added DBG # [ 120.377165] bt_start_transaction returned IPMI_NOT_IN_MY_STATE_ERR 0xd5
> # Added DBG # [ 120.377166] 18
> # Added DBG # [ 120.377167] 01
> # Added DBG # [ 120.377168]
> [ 120.377226] BT: RD_WAIT [ ] TO=4908000 - 0
> [ 120.377237] Recv:: 1c 01 d5
> [ 120.377242] ipmi_si IPI0001:00: IPMI message handler: device id demangle failed: -22
> [ 120.377252] ipmi_si IPI0001:00: Unable to get the device id: -5
> [ 120.377252] ipmi_si IPI0001:00: Unable to register device: error -5
>
>