2019-09-25 04:34:30

by Milan P. Gandhi

[permalink] [raw]
Subject: [PATCH] scsi: core: Log SCSI command age with errors

Couple of users had requested to print the SCSI command age along
with command failure errors. This is a small change, but allows
users to get more important information about the command that was
failed, it would help the users in debugging the command failures:

Signed-off-by: Milan P. Gandhi <[email protected]>
---
diff --git a/drivers/scsi/scsi_logging.c b/drivers/scsi/scsi_logging.c
index ecc5918e372a..ca2182bc53c6 100644
--- a/drivers/scsi/scsi_logging.c
+++ b/drivers/scsi/scsi_logging.c
@@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd *cmd, const char *msg,
const char *mlret_string = scsi_mlreturn_string(disposition);
const char *hb_string = scsi_hostbyte_string(cmd->result);
const char *db_string = scsi_driverbyte_string(cmd->result);
+ unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ;

logbuf = scsi_log_reserve_buffer(&logbuf_len);
if (!logbuf)
@@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd *cmd, const char *msg,

if (db_string)
off += scnprintf(logbuf + off, logbuf_len - off,
- "driverbyte=%s", db_string);
+ "driverbyte=%s ", db_string);
else
off += scnprintf(logbuf + off, logbuf_len - off,
- "driverbyte=0x%02x", driver_byte(cmd->result));
+ "driverbyte=0x%02x ",
+ driver_byte(cmd->result));
+
+ off += scnprintf(logbuf + off, logbuf_len - off,
+ "cmd-age=%lus", cmd_age);
+
out_printk:
dev_printk(KERN_INFO, &cmd->device->sdev_gendev, "%s", logbuf);
scsi_log_release_buffer(logbuf);


2019-09-25 13:46:14

by Laurence Oberman

[permalink] [raw]
Subject: Re: [PATCH] scsi: core: Log SCSI command age with errors

On Mon, 2019-09-23 at 11:31 +0530, Milan P. Gandhi wrote:
> Couple of users had requested to print the SCSI command age along
> with command failure errors. This is a small change, but allows
> users to get more important information about the command that was
> failed, it would help the users in debugging the command failures:
>
> Signed-off-by: Milan P. Gandhi <[email protected]>
> ---
> diff --git a/drivers/scsi/scsi_logging.c
> b/drivers/scsi/scsi_logging.c
> index ecc5918e372a..ca2182bc53c6 100644
> --- a/drivers/scsi/scsi_logging.c
> +++ b/drivers/scsi/scsi_logging.c
> @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd
> *cmd, const char *msg,
> const char *mlret_string = scsi_mlreturn_string(disposition);
> const char *hb_string = scsi_hostbyte_string(cmd->result);
> const char *db_string = scsi_driverbyte_string(cmd->result);
> + unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ;
>
> logbuf = scsi_log_reserve_buffer(&logbuf_len);
> if (!logbuf)
> @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd
> *cmd, const char *msg,
>
> if (db_string)
> off += scnprintf(logbuf + off, logbuf_len - off,
> - "driverbyte=%s", db_string);
> + "driverbyte=%s ", db_string);
> else
> off += scnprintf(logbuf + off, logbuf_len - off,
> - "driverbyte=0x%02x", driver_byte(cmd-
> >result));
> + "driverbyte=0x%02x ",
> + driver_byte(cmd->result));
> +
> + off += scnprintf(logbuf + off, logbuf_len - off,
> + "cmd-age=%lus", cmd_age);
> +
> out_printk:
> dev_printk(KERN_INFO, &cmd->device->sdev_gendev, "%s", logbuf);
> scsi_log_release_buffer(logbuf);
>

This looks to be a useful debug addition to me, and the code looks
correct.
I believe this has also been tested by Milan in our lab.

Reviewed-by: Laurence Oberman <[email protected]>

2019-09-25 13:57:18

by Milan P. Gandhi

[permalink] [raw]
Subject: Re: [PATCH] scsi: core: Log SCSI command age with errors

On 9/23/19 6:32 PM, Laurence Oberman wrote:
> On Mon, 2019-09-23 at 11:31 +0530, Milan P. Gandhi wrote:
>> Couple of users had requested to print the SCSI command age along
>> with command failure errors. This is a small change, but allows
>> users to get more important information about the command that was
>> failed, it would help the users in debugging the command failures:
>>
>> Signed-off-by: Milan P. Gandhi <[email protected]>
>> ---
>> diff --git a/drivers/scsi/scsi_logging.c
>> b/drivers/scsi/scsi_logging.c
>> index ecc5918e372a..ca2182bc53c6 100644
>> --- a/drivers/scsi/scsi_logging.c
>> +++ b/drivers/scsi/scsi_logging.c
>> @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd
>> *cmd, const char *msg,
>> const char *mlret_string = scsi_mlreturn_string(disposition);
>> const char *hb_string = scsi_hostbyte_string(cmd->result);
>> const char *db_string = scsi_driverbyte_string(cmd->result);
>> + unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ;
>>
>> logbuf = scsi_log_reserve_buffer(&logbuf_len);
>> if (!logbuf)
>> @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd
>> *cmd, const char *msg,
>>
>> if (db_string)
>> off += scnprintf(logbuf + off, logbuf_len - off,
>> - "driverbyte=%s", db_string);
>> + "driverbyte=%s ", db_string);
>> else
>> off += scnprintf(logbuf + off, logbuf_len - off,
>> - "driverbyte=0x%02x", driver_byte(cmd-
>>> result));
>> + "driverbyte=0x%02x ",
>> + driver_byte(cmd->result));
>> +
>> + off += scnprintf(logbuf + off, logbuf_len - off,
>> + "cmd-age=%lus", cmd_age);
>> +
>> out_printk:
>> dev_printk(KERN_INFO, &cmd->device->sdev_gendev, "%s", logbuf);
>> scsi_log_release_buffer(logbuf);
>>
>
> This looks to be a useful debug addition to me, and the code looks
> correct.
> I believe this has also been tested by Milan in our lab.
>
> Reviewed-by: Laurence Oberman <[email protected]>
>
Yes, the patch was tested locally using scsi_debug as well as in real
storage issues caused by bad disks in customer environment.

Thanks,
Milan.

2019-09-26 09:44:55

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH] scsi: core: Log SCSI command age with errors

On 9/22/19 11:01 PM, Milan P. Gandhi wrote:
> + off += scnprintf(logbuf + off, logbuf_len - off,
> + "cmd-age=%lus", cmd_age);

Have you considered to change cmd-age into cmd_age? I'm afraid otherwise
someone might interpret the hyphen as a subtraction sign...

Thanks,

Bart.

2019-09-26 10:19:53

by Milan P. Gandhi

[permalink] [raw]
Subject: Re: [PATCH] scsi: core: Log SCSI command age with errors

On 9/25/19 9:15 PM, Bart Van Assche wrote:
> On 9/22/19 11:01 PM, Milan P. Gandhi wrote:
>> +??? off += scnprintf(logbuf + off, logbuf_len - off,
>> +???????????? "cmd-age=%lus", cmd_age);
>
> Have you considered to change cmd-age into cmd_age? I'm afraid otherwise someone might interpret the hyphen as a subtraction sign...

Thanks for your suggestion Bart.
Yes, it would be better to use cmd_age to avoid any confusion.
I will change it and send the updated patch.

Thanks,
Milan.

2019-09-27 15:30:19

by Martin Wilck

[permalink] [raw]
Subject: Re: [PATCH] scsi: core: Log SCSI command age with errors

On Mon, 2019-09-23 at 11:31 +0530, Milan P. Gandhi wrote:
> Couple of users had requested to print the SCSI command age along
> with command failure errors. This is a small change, but allows
> users to get more important information about the command that was
> failed, it would help the users in debugging the command failures:
>
> Signed-off-by: Milan P. Gandhi <[email protected]>
> ---
> diff --git a/drivers/scsi/scsi_logging.c
> b/drivers/scsi/scsi_logging.c
> index ecc5918e372a..ca2182bc53c6 100644
> --- a/drivers/scsi/scsi_logging.c
> +++ b/drivers/scsi/scsi_logging.c
> @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd
> *cmd, const char *msg,
> const char *mlret_string = scsi_mlreturn_string(disposition);
> const char *hb_string = scsi_hostbyte_string(cmd->result);
> const char *db_string = scsi_driverbyte_string(cmd->result);
> + unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ;

This comes down to pretty coarse time resolution, does it not? More
often than not, the time difference shown will be 0. I'd recommend at
least millisecond resolution.

>
> logbuf = scsi_log_reserve_buffer(&logbuf_len);
> if (!logbuf)
> @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd
> *cmd, const char *msg,
>
> if (db_string)
> off += scnprintf(logbuf + off, logbuf_len - off,
> - "driverbyte=%s", db_string);
> + "driverbyte=%s ", db_string);
> else
> off += scnprintf(logbuf + off, logbuf_len - off,
> - "driverbyte=0x%02x", driver_byte(cmd-
> >result));
> + "driverbyte=0x%02x ",
> + driver_byte(cmd->result));
> +
> + off += scnprintf(logbuf + off, logbuf_len - off,
> + "cmd-age=%lus", cmd_age);

This is certainly helpful in some situations. Yet I am unsure if it
should *always* be printed. I wouldn't say it's as important as the
other stuff scsi_print_result() provides. After all, by activating
MLQUEUE+MLCOMPLETE, the time on-wire can be extracted with better
accuracy from currently available logs.

So perhaps make this depend on a module parameter?

Also, we should carefully ponder if we want to put this on the same
line as the driver byte, as users may have created scripts for parsing
this output.

Regards,
Martin


2019-09-27 17:46:09

by Laurence Oberman

[permalink] [raw]
Subject: Re: [PATCH] scsi: core: Log SCSI command age with errors

On Fri, 2019-09-27 at 17:26 +0200, Martin Wilck wrote:
> On Mon, 2019-09-23 at 11:31 +0530, Milan P. Gandhi wrote:
> > Couple of users had requested to print the SCSI command age along
> > with command failure errors. This is a small change, but allows
> > users to get more important information about the command that was
> > failed, it would help the users in debugging the command failures:
> >
> > Signed-off-by: Milan P. Gandhi <[email protected]>
> > ---
> > diff --git a/drivers/scsi/scsi_logging.c
> > b/drivers/scsi/scsi_logging.c
> > index ecc5918e372a..ca2182bc53c6 100644
> > --- a/drivers/scsi/scsi_logging.c
> > +++ b/drivers/scsi/scsi_logging.c
> > @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd
> > *cmd, const char *msg,
> > const char *mlret_string = scsi_mlreturn_string(disposition);
> > const char *hb_string = scsi_hostbyte_string(cmd->result);
> > const char *db_string = scsi_driverbyte_string(cmd->result);
> > + unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ;
>
> This comes down to pretty coarse time resolution, does it not? More
> often than not, the time difference shown will be 0. I'd recommend at
> least millisecond resolution.
>
> >
> > logbuf = scsi_log_reserve_buffer(&logbuf_len);
> > if (!logbuf)
> > @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd
> > *cmd, const char *msg,
> >
> > if (db_string)
> > off += scnprintf(logbuf + off, logbuf_len - off,
> > - "driverbyte=%s", db_string);
> > + "driverbyte=%s ", db_string);
> > else
> > off += scnprintf(logbuf + off, logbuf_len - off,
> > - "driverbyte=0x%02x", driver_byte(cmd-
> > > result));
> >
> > + "driverbyte=0x%02x ",
> > + driver_byte(cmd->result));
> > +
> > + off += scnprintf(logbuf + off, logbuf_len - off,
> > + "cmd-age=%lus", cmd_age);
>
> This is certainly helpful in some situations. Yet I am unsure if it
> should *always* be printed. I wouldn't say it's as important as the
> other stuff scsi_print_result() provides. After all, by activating
> MLQUEUE+MLCOMPLETE, the time on-wire can be extracted with better
> accuracy from currently available logs.
>
> So perhaps make this depend on a module parameter?
>
> Also, we should carefully ponder if we want to put this on the same
> line as the driver byte, as users may have created scripts for
> parsing
> this output.
>
> Regards,
> Martin
>
>
Hi Martin

Agreed about log extraction, but turning that on with a busy workload
in a production environment is not practical. We cant do it with
systems with 1000's of luns and 1000's of IOPS/sec.
Also second resolution is good enough for the debug we want to see.

Regards
Laurence

2019-09-30 08:43:00

by Martin Wilck

[permalink] [raw]
Subject: Re: [PATCH] scsi: core: Log SCSI command age with errors

On Fri, 2019-09-27 at 13:45 -0400, Laurence Oberman wrote:
>
> Hi Martin
>
> Agreed about log extraction, but turning that on with a busy workload
> in a production environment is not practical. We cant do it with
> systems with 1000's of luns and 1000's of IOPS/sec.
> Also second resolution is good enough for the debug we want to see.

I gather that you look at a specific problem where second resolution is
sufficient. For upstream, the generic usefulness should be considered,
and I don't think we can say today that better-than-second resolution
will never be useful, so I still vote for milliseconds.

Wrt the enablement of the option on highly loaded systems, I'm not sure
I understand. You need to enable SCSI logging anyway, don't you? Is it
an issue to have to set 2 sysfs values rather than just one?

Regards,
Martin

>
> Regards
> Laurence
>


2019-09-30 09:09:37

by Milan P. Gandhi

[permalink] [raw]
Subject: Re: [PATCH] scsi: core: Log SCSI command age with errors

On 9/30/19 2:12 PM, Martin Wilck wrote:
> On Fri, 2019-09-27 at 13:45 -0400, Laurence Oberman wrote:
>>
>> Hi Martin
>>
>> Agreed about log extraction, but turning that on with a busy workload
>> in a production environment is not practical. We cant do it with
>> systems with 1000's of luns and 1000's of IOPS/sec.
>> Also second resolution is good enough for the debug we want to see.
>
> I gather that you look at a specific problem where second resolution is
> sufficient. For upstream, the generic usefulness should be considered,
> and I don't think we can say today that better-than-second resolution
> will never be useful, so I still vote for milliseconds.

Ok, I will change it to ms.

> Wrt the enablement of the option on highly loaded systems, I'm not sure
> I understand. You need to enable SCSI logging anyway, don't you?

By default we keep the SCSI debug logging disabled or am I missing
something?

>Is it an issue to have to set 2 sysfs values rather than just one?

The idea here is to capture the above debug data even without
any user interventions to change any sysfs entries or to enable
debug logging on busy, critical production systems.

Also, we are not changing the existing text in SCSI command error log,
but we are only adding one single word at the end of message. Ideally
the user scripts are written to grep specific pattern from the logs.
Since we are not replacing any existing text from the logs, the
scripts should still work with this change as well.

Thanks,
Milan.

2019-09-30 09:48:31

by Martin Wilck

[permalink] [raw]
Subject: Re: [PATCH] scsi: core: Log SCSI command age with errors

Hello Milan,

On Mon, 2019-09-30 at 14:35 +0530, Milan P. Gandhi wrote:
> On 9/30/19 2:12 PM, Martin Wilck wrote:

> > Wrt the enablement of the option on highly loaded systems, I'm not
> > sure
> > I understand. You need to enable SCSI logging anyway, don't you?
>
> By default we keep the SCSI debug logging disabled or am I missing
> something?
>
> > Is it an issue to have to set 2 sysfs values rather than just one?
>
> The idea here is to capture the above debug data even without
> any user interventions to change any sysfs entries or to enable
> debug logging on busy, critical production systems.

So, you're looking at the scsi_io_completion() code path. In my
experience that isn't reliable for bug hunting because of the the
message rate limiting. Therefore I prefer using SCSI logging
MLCOMPLETE=1, where no rate limiting applies. But that's just a side
note, it depends on the case what's more useful.

Back to the cmd age output, IMO we're are on a thin line between
capturing useful information and keeping the logs neat. As I already
said, I'm not convinced that this information, as important it may be
for the case(s) you're currently investigating, has the same generic
degree of importance or usefulness as what's currently printed (the CDB
and the sense data). But OTOH, that's just a gut feeling, and I can't
claim to have the experience to make general statement on it. If noone
else has issues with this being printed by default, I'm not going
oppose it.

> Also, we are not changing the existing text in SCSI command error
> log,
> but we are only adding one single word at the end of message. Ideally
> the user scripts are written to grep specific pattern from the logs.
> Since we are not replacing any existing text from the logs, the
> scripts should still work with this change as well.

You are certainly aware that such scripts don't necessarily conform to
what kernel developers would consider "ideal" :-) But again, I just
wanted to raise the issue; if noone else thinks it matters, fine with
me.

Thanks
Martin