2021-05-21 20:06:59

by Verma, Vishal L

[permalink] [raw]
Subject: [BISECTED] nvme probe failure with v5.13-rc1

Hi,

I ran into this failure to probe an nvme device in an emulator
(simics). It looks like there is a ~60 second wait followed by a
timeout and a failure to boot (the root device is an nvme disk) with
these messages in the log:

[ 67.174010] nvme nvme0: I/O 5 QID 0 timeout, disable controller
[ 67.175793] nvme nvme0: Removing after probe failure status: -4

I bisected this to:
5befc7c26e5a ("nvme: implement non-mdts command limits")

It's not immediately obvious to me what's causing the problem.
Reverting the above commit fixes it. It is easily reproducible - I'd be
happy to provide more info about the emulated device or test out
patches or theories.

It is of course possible that the emulated device is behaving in some
non spec-compliant way, in which case I'd appreciate any help figuring
out what that is.

Thanks,
-Vishal


2021-05-21 20:19:56

by Keith Busch

[permalink] [raw]
Subject: Re: [BISECTED] nvme probe failure with v5.13-rc1

On Fri, May 21, 2021 at 05:00:29AM +0000, Verma, Vishal L wrote:
> Hi,
>
> I ran into this failure to probe an nvme device in an emulator
> (simics). It looks like there is a ~60 second wait followed by a
> timeout and a failure to boot (the root device is an nvme disk) with
> these messages in the log:
>
> [ 67.174010] nvme nvme0: I/O 5 QID 0 timeout, disable controller
> [ 67.175793] nvme nvme0: Removing after probe failure status: -4
>
> I bisected this to:
> 5befc7c26e5a ("nvme: implement non-mdts command limits")
>
> It's not immediately obvious to me what's causing the problem.
> Reverting the above commit fixes it. It is easily reproducible - I'd be
> happy to provide more info about the emulated device or test out
> patches or theories.
>
> It is of course possible that the emulated device is behaving in some
> non spec-compliant way, in which case I'd appreciate any help figuring
> out what that is.

Hi Vishal,

The patch you bisected to sends only a single Identify command, so it
sounds like that must be the command that times out. The controller is
not required to support this specific identify (CNS 0x6), but it is
required to produce a response. If the identify is unsupported, the
controller should respond with an appropriate error (Invalid Field In
Command), but it looks like the controller didn't respond at all.

So based on your observation, it sounds like the simics implementation
has an identify bug. The spec doesn't provide a way for the driver to
know ahead of time whether or not this identification is supported, so
the driver just has to try it and react to the status code. If the
implmenetation can't be fixed, then we'll need to quirk your device.

If you want to confirm for certain that the new identify is the source
of your timeout, you could try the following patch and the timeout
should go away:

---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1a73eed61eee..b16d31d82606 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2711,7 +2711,7 @@ static int nvme_init_non_mdts_limits(struct nvme_ctrl *ctrl)
else
ctrl->max_zeroes_sectors = 0;

- if (nvme_ctrl_limited_cns(ctrl))
+ if (true || nvme_ctrl_limited_cns(ctrl))
return 0;

id = kzalloc(sizeof(*id), GFP_KERNEL);
--

2021-05-21 20:22:30

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [BISECTED] nvme probe failure with v5.13-rc1

On Fri, 2021-05-21 at 23:57 +0900, Keith Busch wrote:
> On Fri, May 21, 2021 at 05:00:29AM +0000, Verma, Vishal L wrote:
> > Hi,
> >
> > I ran into this failure to probe an nvme device in an emulator
> > (simics). It looks like there is a ~60 second wait followed by a
> > timeout and a failure to boot (the root device is an nvme disk) with
> > these messages in the log:
> >
> > [ 67.174010] nvme nvme0: I/O 5 QID 0 timeout, disable controller
> > [ 67.175793] nvme nvme0: Removing after probe failure status: -4
> >
> > I bisected this to:
> > 5befc7c26e5a ("nvme: implement non-mdts command limits")
> >
> > It's not immediately obvious to me what's causing the problem.
> > Reverting the above commit fixes it. It is easily reproducible - I'd be
> > happy to provide more info about the emulated device or test out
> > patches or theories.
> >
> > It is of course possible that the emulated device is behaving in some
> > non spec-compliant way, in which case I'd appreciate any help figuring
> > out what that is.
>
> Hi Vishal,
>
> The patch you bisected to sends only a single Identify command, so it
> sounds like that must be the command that times out. The controller is
> not required to support this specific identify (CNS 0x6), but it is
> required to produce a response. If the identify is unsupported, the
> controller should respond with an appropriate error (Invalid Field In
> Command), but it looks like the controller didn't respond at all.
>
> So based on your observation, it sounds like the simics implementation
> has an identify bug. The spec doesn't provide a way for the driver to
> know ahead of time whether or not this identification is supported, so
> the driver just has to try it and react to the status code. If the
> implmenetation can't be fixed, then we'll need to quirk your device.
>
> If you want to confirm for certain that the new identify is the source
> of your timeout, you could try the following patch and the timeout
> should go away:
>
> ---
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 1a73eed61eee..b16d31d82606 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2711,7 +2711,7 @@ static int nvme_init_non_mdts_limits(struct nvme_ctrl *ctrl)
> else
> ctrl->max_zeroes_sectors = 0;
>
> - if (nvme_ctrl_limited_cns(ctrl))
> + if (true || nvme_ctrl_limited_cns(ctrl))
> return 0;
>
> id = kzalloc(sizeof(*id), GFP_KERNEL);
> --

Hi Keith,

Thanks for looking into it - yes with that the problem goes away.
Let me chat with the simics folks and see if I can get them to fix it.

2021-05-21 20:24:07

by Keith Busch

[permalink] [raw]
Subject: Re: [BISECTED] nvme probe failure with v5.13-rc1

On Fri, May 21, 2021 at 04:50:15PM +0000, Verma, Vishal L wrote:
>
> Thanks for looking into it - yes with that the problem goes away.
> Let me chat with the simics folks and see if I can get them to fix it.

Thanks for confirming. It sounds like the simics implementation is
missing a default case in its identify function's CNS check and simply
loses track of the command. I'd wager it probably also times out for any
reserved or unsupported CNS value.