2024-04-09 09:35:27

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH v5 1/6] nvme: authentication error are always non-retryable

From: Hannes Reinecke <[email protected]>

Any authentication errors which are generated internally are always
non-retryable, so use negative error codes to ensure they are not
retried.

Signed-off-by: Hannes Reinecke <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/nvme/host/core.c | 6 +++---
drivers/nvme/host/fabrics.c | 25 +++++++++++++------------
drivers/nvme/host/nvme.h | 2 +-
3 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 504dc352c458..66387bcca8ae 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -383,14 +383,14 @@ static inline enum nvme_disposition nvme_decide_disposition(struct request *req)
if (likely(nvme_req(req)->status == 0))
return COMPLETE;

- if ((nvme_req(req)->status & 0x7ff) == NVME_SC_AUTH_REQUIRED)
- return AUTHENTICATE;
-
if (blk_noretry_request(req) ||
(nvme_req(req)->status & NVME_SC_DNR) ||
nvme_req(req)->retries >= nvme_max_retries)
return COMPLETE;

+ if ((nvme_req(req)->status & 0x7ff) == NVME_SC_AUTH_REQUIRED)
+ return AUTHENTICATE;
+
if (req->cmd_flags & REQ_NVME_MPATH) {
if (nvme_is_path_error(nvme_req(req)->status) ||
blk_queue_dying(req->q))
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 1f0ea1f32d22..0166f8a3a3eb 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -467,7 +467,7 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl)
if (result & NVME_CONNECT_AUTHREQ_ASCR) {
dev_warn(ctrl->device,
"qid 0: secure concatenation is not supported\n");
- ret = NVME_SC_AUTH_REQUIRED;
+ ret = -EOPNOTSUPP;
goto out_free_data;
}
/* Authentication required */
@@ -475,14 +475,14 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl)
if (ret) {
dev_warn(ctrl->device,
"qid 0: authentication setup failed\n");
- ret = NVME_SC_AUTH_REQUIRED;
goto out_free_data;
}
ret = nvme_auth_wait(ctrl, 0);
- if (ret)
+ if (ret) {
dev_warn(ctrl->device,
- "qid 0: authentication failed\n");
- else
+ "qid 0: authentication failed, error %d\n",
+ ret);
+ } else
dev_info(ctrl->device,
"qid 0: authenticated\n");
}
@@ -542,7 +542,7 @@ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid)
if (result & NVME_CONNECT_AUTHREQ_ASCR) {
dev_warn(ctrl->device,
"qid 0: secure concatenation is not supported\n");
- ret = NVME_SC_AUTH_REQUIRED;
+ ret = -EOPNOTSUPP;
goto out_free_data;
}
/* Authentication required */
@@ -550,12 +550,13 @@ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid)
if (ret) {
dev_warn(ctrl->device,
"qid %d: authentication setup failed\n", qid);
- ret = NVME_SC_AUTH_REQUIRED;
- } else {
- ret = nvme_auth_wait(ctrl, qid);
- if (ret)
- dev_warn(ctrl->device,
- "qid %u: authentication failed\n", qid);
+ goto out_free_data;
+ }
+ ret = nvme_auth_wait(ctrl, qid);
+ if (ret) {
+ dev_warn(ctrl->device,
+ "qid %u: authentication failed, error %d\n",
+ qid, ret);
}
}
out_free_data:
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index d0ed64dc7380..9b8904a476b8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -1122,7 +1122,7 @@ static inline int nvme_auth_negotiate(struct nvme_ctrl *ctrl, int qid)
}
static inline int nvme_auth_wait(struct nvme_ctrl *ctrl, int qid)
{
- return NVME_SC_AUTH_REQUIRED;
+ return -EPROTONOSUPPORT;
}
static inline void nvme_auth_free(struct nvme_ctrl *ctrl) {};
#endif
--
2.44.0



2024-04-09 14:04:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable

On Tue, Apr 09, 2024 at 11:35:05AM +0200, Daniel Wagner wrote:
> From: Hannes Reinecke <[email protected]>
>
> Any authentication errors which are generated internally are always
> non-retryable, so use negative error codes to ensure they are not
> retried.

Thanks, this looks much better now:

Reviewed-by: Christoph Hellwig <[email protected]>

2024-04-09 20:26:19

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable



On 09/04/2024 12:35, Daniel Wagner wrote:
> From: Hannes Reinecke <[email protected]>
>
> Any authentication errors which are generated internally are always
> non-retryable, so use negative error codes to ensure they are not
> retried.

The patch title says that any authentication error is not retryable, and
the patch body says "authentication errors which are generated locally
are non-retryable" so which one is it?

2024-04-10 06:52:48

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable

On Tue, Apr 09, 2024 at 11:26:00PM +0300, Sagi Grimberg wrote:
>
>
> On 09/04/2024 12:35, Daniel Wagner wrote:
> > From: Hannes Reinecke <[email protected]>
> >
> > Any authentication errors which are generated internally are always
> > non-retryable, so use negative error codes to ensure they are not
> > retried.
>
> The patch title says that any authentication error is not retryable, and
> the patch body says "authentication errors which are generated locally
> are non-retryable" so which one is it?

Forgot to update the commit message. What about:

All authentication errors are non-retryable, so use negative error
codes to ensure they are not retried.

?

2024-04-10 10:26:30

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable



On 10/04/2024 9:52, Daniel Wagner wrote:
> On Tue, Apr 09, 2024 at 11:26:00PM +0300, Sagi Grimberg wrote:
>>
>> On 09/04/2024 12:35, Daniel Wagner wrote:
>>> From: Hannes Reinecke <[email protected]>
>>>
>>> Any authentication errors which are generated internally are always
>>> non-retryable, so use negative error codes to ensure they are not
>>> retried.
>> The patch title says that any authentication error is not retryable, and
>> the patch body says "authentication errors which are generated locally
>> are non-retryable" so which one is it?
> Forgot to update the commit message. What about:
>
> All authentication errors are non-retryable, so use negative error
> codes to ensure they are not retried.
>
> ?

I have a question, what happens if nvmet updated its credentials (by the
admin) and in the period until
the host got his credentials updated, it happens to
disconnect/reconnect. It will see an authentication
error, so it will not retry and remove the controller altogether?

Sounds like an issue to me.

2024-04-10 12:06:09

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable

On 4/10/24 12:21, Sagi Grimberg wrote:
>
>
> On 10/04/2024 9:52, Daniel Wagner wrote:
>> On Tue, Apr 09, 2024 at 11:26:00PM +0300, Sagi Grimberg wrote:
>>>
>>> On 09/04/2024 12:35, Daniel Wagner wrote:
>>>> From: Hannes Reinecke <[email protected]>
>>>>
>>>> Any authentication errors which are generated internally are always
>>>> non-retryable, so use negative error codes to ensure they are not
>>>> retried.
>>> The patch title says that any authentication error is not retryable, and
>>> the patch body says "authentication errors which are generated locally
>>> are non-retryable" so which one is it?
>> Forgot to update the commit message. What about:
>>
>>    All authentication errors are non-retryable, so use negative error
>>    codes to ensure they are not retried.
>>
>> ?
>
> I have a question, what happens if nvmet updated its credentials (by the
> admin) and in the period until the host got his credentials updated, it
> happens to disconnect/reconnect. It will see an authentication
> error, so it will not retry and remove the controller altogether?
>
> Sounds like an issue to me.

Usual thing: we cannot differentiate (on the host side) whether the
current PSK is _about_ to be replaced; how should the kernel
know that the admin will replace the PSK in the next minutes?

But that really is an issue with the standard. Currently there is no
way how a target could inform the initiator that the credentials have
been updated.

We would need to define a new status code for this.
In the meantime the safe operations model is to set a lifetime
for each PSK, and ensure that the PSK is updated on both sides
during the lifetime. With that there is a timeframe during which
both PSKs are available (on the target), and the older will expire
automatically once the lifetime limit is reached.

Cheers,

Hannes


2024-04-10 14:01:40

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable



On 10/04/2024 15:05, Hannes Reinecke wrote:
> On 4/10/24 12:21, Sagi Grimberg wrote:
>>
>>
>> On 10/04/2024 9:52, Daniel Wagner wrote:
>>> On Tue, Apr 09, 2024 at 11:26:00PM +0300, Sagi Grimberg wrote:
>>>>
>>>> On 09/04/2024 12:35, Daniel Wagner wrote:
>>>>> From: Hannes Reinecke <[email protected]>
>>>>>
>>>>> Any authentication errors which are generated internally are always
>>>>> non-retryable, so use negative error codes to ensure they are not
>>>>> retried.
>>>> The patch title says that any authentication error is not
>>>> retryable, and
>>>> the patch body says "authentication errors which are generated locally
>>>> are non-retryable" so which one is it?
>>> Forgot to update the commit message. What about:
>>>
>>>    All authentication errors are non-retryable, so use negative error
>>>    codes to ensure they are not retried.
>>>
>>> ?
>>
>> I have a question, what happens if nvmet updated its credentials (by
>> the admin) and in the period until the host got his credentials
>> updated, it
>> happens to disconnect/reconnect. It will see an authentication
>> error, so it will not retry and remove the controller altogether?
>>
>> Sounds like an issue to me.
>
> Usual thing: we cannot differentiate (on the host side) whether the
> current PSK is _about_ to be replaced; how should the kernel
> know that the admin will replace the PSK in the next minutes?
>
> But that really is an issue with the standard. Currently there is no
> way how a target could inform the initiator that the credentials have
> been updated.

I'd say that the sane thing for the host to do in this case is to reconnect
until giving up with hope that it may work. This seems like a better
approach
than to abruptly remove the controller no?

>
> We would need to define a new status code for this.
> In the meantime the safe operations model is to set a lifetime
> for each PSK, and ensure that the PSK is updated on both sides
> during the lifetime. With that there is a timeframe during which
> both PSKs are available (on the target), and the older will expire
> automatically once the lifetime limit is reached.

That is a good solution, and will also prevent a loss of service until
the host credentials are updated as well.

But regardless I have a feeling that simply removing the controller upon
an authentication error is not the right thing to do here.

2024-04-11 07:13:18

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable

On 4/10/24 15:50, Sagi Grimberg wrote:
>
>
> On 10/04/2024 15:05, Hannes Reinecke wrote:
>> On 4/10/24 12:21, Sagi Grimberg wrote:
>>>
>>>
>>> On 10/04/2024 9:52, Daniel Wagner wrote:
>>>> On Tue, Apr 09, 2024 at 11:26:00PM +0300, Sagi Grimberg wrote:
>>>>>
>>>>> On 09/04/2024 12:35, Daniel Wagner wrote:
>>>>>> From: Hannes Reinecke <[email protected]>
>>>>>>
>>>>>> Any authentication errors which are generated internally are always
>>>>>> non-retryable, so use negative error codes to ensure they are not
>>>>>> retried.
>>>>> The patch title says that any authentication error is not
>>>>> retryable, and
>>>>> the patch body says "authentication errors which are generated locally
>>>>> are non-retryable" so which one is it?
>>>> Forgot to update the commit message. What about:
>>>>
>>>>    All authentication errors are non-retryable, so use negative error
>>>>    codes to ensure they are not retried.
>>>>
>>>> ?
>>>
>>> I have a question, what happens if nvmet updated its credentials (by
>>> the admin) and in the period until the host got his credentials
>>> updated, it
>>> happens to disconnect/reconnect. It will see an authentication
>>> error, so it will not retry and remove the controller altogether?
>>>
>>> Sounds like an issue to me.
>>
>> Usual thing: we cannot differentiate (on the host side) whether the
>> current PSK is _about_ to be replaced; how should the kernel
>> know that the admin will replace the PSK in the next minutes?
>>
>> But that really is an issue with the standard. Currently there is no
>> way how a target could inform the initiator that the credentials have
>> been updated.
>
> I'd say that the sane thing for the host to do in this case is to reconnect
> until giving up with hope that it may work. This seems like a better
> approach
> than to abruptly remove the controller no?
>
>>
>> We would need to define a new status code for this.
>> In the meantime the safe operations model is to set a lifetime
>> for each PSK, and ensure that the PSK is updated on both sides
>> during the lifetime. With that there is a timeframe during which
>> both PSKs are available (on the target), and the older will expire
>> automatically once the lifetime limit is reached.
>
> That is a good solution, and will also prevent a loss of service until
> the host credentials are updated as well.
>
> But regardless I have a feeling that simply removing the controller upon
> an authentication error is not the right thing to do here.

Guess what; that's what I tried to do initially. But then Christoph
objected that we shouldn't generate NVMe status codes internally.
But if we can't do that then we'll have to invent yet another way to
return a retryable error, leading to even more band-aid.
So I am not quite sure how we could achieve that, short of making
_every_ error retryable...

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


2024-04-11 08:44:18

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable



On 11/04/2024 10:11, Hannes Reinecke wrote:
> On 4/10/24 15:50, Sagi Grimberg wrote:
>>
>>
>> On 10/04/2024 15:05, Hannes Reinecke wrote:
>>> On 4/10/24 12:21, Sagi Grimberg wrote:
>>>>
>>>>
>>>> On 10/04/2024 9:52, Daniel Wagner wrote:
>>>>> On Tue, Apr 09, 2024 at 11:26:00PM +0300, Sagi Grimberg wrote:
>>>>>>
>>>>>> On 09/04/2024 12:35, Daniel Wagner wrote:
>>>>>>> From: Hannes Reinecke <[email protected]>
>>>>>>>
>>>>>>> Any authentication errors which are generated internally are always
>>>>>>> non-retryable, so use negative error codes to ensure they are not
>>>>>>> retried.
>>>>>> The patch title says that any authentication error is not
>>>>>> retryable, and
>>>>>> the patch body says "authentication errors which are generated
>>>>>> locally
>>>>>> are non-retryable" so which one is it?
>>>>> Forgot to update the commit message. What about:
>>>>>
>>>>>    All authentication errors are non-retryable, so use negative error
>>>>>    codes to ensure they are not retried.
>>>>>
>>>>> ?
>>>>
>>>> I have a question, what happens if nvmet updated its credentials
>>>> (by the admin) and in the period until the host got his credentials
>>>> updated, it
>>>> happens to disconnect/reconnect. It will see an authentication
>>>> error, so it will not retry and remove the controller altogether?
>>>>
>>>> Sounds like an issue to me.
>>>
>>> Usual thing: we cannot differentiate (on the host side) whether the
>>> current PSK is _about_ to be replaced; how should the kernel
>>> know that the admin will replace the PSK in the next minutes?
>>>
>>> But that really is an issue with the standard. Currently there is no
>>> way how a target could inform the initiator that the credentials have
>>> been updated.
>>
>> I'd say that the sane thing for the host to do in this case is to
>> reconnect
>> until giving up with hope that it may work. This seems like a better
>> approach
>> than to abruptly remove the controller no?
>>
>>>
>>> We would need to define a new status code for this.
>>> In the meantime the safe operations model is to set a lifetime
>>> for each PSK, and ensure that the PSK is updated on both sides
>>> during the lifetime. With that there is a timeframe during which
>>> both PSKs are available (on the target), and the older will expire
>>> automatically once the lifetime limit is reached.
>>
>> That is a good solution, and will also prevent a loss of service until
>> the host credentials are updated as well.
>>
>> But regardless I have a feeling that simply removing the controller upon
>> an authentication error is not the right thing to do here.
>
> Guess what; that's what I tried to do initially. But then Christoph
> objected that we shouldn't generate NVMe status codes internally.
> But if we can't do that then we'll have to invent yet another way to
> return a retryable error, leading to even more band-aid.
> So I am not quite sure how we could achieve that, short of making
> _every_ error retryable...

So this whole thing is that you want to make the host to not reconnect
if the controller
sent a DNR and reconnect otherwise?

What are you returning today if the authentication failed? Am I reading
it right that you are
returning -ECONNREFUSED? I think that for the specific case of
credentials mismatch (that and only
that) you may want to return -EKEYREJECTED. That according to the
documentation (/* Key was rejected by service */)
is specific enough that perhaps we can treat it specially when asking
"should I reconnect?"

Thoughts?