2022-01-26 22:32:26

by Yang Shi

[permalink] [raw]
Subject: [v5 PATCH] block: introduce block_rq_error tracepoint

Currently, rasdaemon uses the existing tracepoint block_rq_complete
and filters out non-error cases in order to capture block disk errors.

But there are a few problems with this approach:

1. Even kernel trace filter could do the filtering work, there is
still some overhead after we enable this tracepoint.

2. The filter is merely based on errno, which does not align with kernel
logic to check the errors for print_req_error().

3. block_rq_complete only provides dev major and minor to identify
the block device, it is not convenient to use in user-space.

So introduce a new tracepoint block_rq_error just for the error case
and provides the device name for convenience too. With this patch,
rasdaemon could switch to block_rq_error.

Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Reviewed-by: Steven Rostedt <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
The v3 patch was submitted in Feb 2020, and Steven reviewed the patch, but
it was not merged to upstream. See
https://lore.kernel.org/lkml/[email protected]/.

The problems fixed by that patch still exist and we do need it to make
disk error handling in rasdaemon easier. So this resurrected it and
continued the version number.

v4 --> v5:
* Report the actual block layer status code instead of the errno per
Christoph Hellwig.
v3 --> v4:
* Rebased to v5.17-rc1.
* Collected reviewed-by tag from Steven.

block/blk-mq.c | 4 +++-
include/trace/events/block.h | 41 ++++++++++++++++++++++++++++++++++++
2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f3bf3358a3bb..4ca72ea917d4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -789,8 +789,10 @@ bool blk_update_request(struct request *req, blk_status_t error,
#endif

if (unlikely(error && !blk_rq_is_passthrough(req) &&
- !(req->rq_flags & RQF_QUIET)))
+ !(req->rq_flags & RQF_QUIET))) {
+ trace_block_rq_error(req, error, nr_bytes);
blk_print_req_error(req, error);
+ }

blk_account_io_completion(req, nr_bytes);

diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 27170e40e8c9..918b190718d5 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -144,6 +144,47 @@ TRACE_EVENT(block_rq_complete,
__entry->nr_sector, __entry->error)
);

+/**
+ * block_rq_error - block IO operation error reported by device driver
+ * @rq: block operations request
+ * @error: status code
+ * @nr_bytes: number of completed bytes
+ *
+ * The block_rq_error tracepoint event indicates that some portion
+ * of operation request has failed as reported by the device driver.
+ */
+TRACE_EVENT(block_rq_error,
+
+ TP_PROTO(struct request *rq, blk_status_t error, unsigned int nr_bytes),
+
+ TP_ARGS(rq, error, nr_bytes),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __string( name, rq->q->disk ? rq->q->disk->disk_name : "?")
+ __field( sector_t, sector )
+ __field( unsigned int, nr_sector )
+ __field( int, error )
+ __array( char, rwbs, RWBS_LEN )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = rq->q->disk ? disk_devt(rq->q->disk) : 0;
+ __assign_str(name, rq->q->disk ? rq->q->disk->disk_name : "?");
+ __entry->sector = blk_rq_pos(rq);
+ __entry->nr_sector = nr_bytes >> 9;
+ __entry->error = blk_status_to_errno(error);
+
+ blk_fill_rwbs(__entry->rwbs, rq->cmd_flags);
+ ),
+
+ TP_printk("%d,%d %s %s %llu + %u [%d]",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __get_str(name), __entry->rwbs,
+ (unsigned long long)__entry->sector,
+ __entry->nr_sector, __entry->error)
+);
+
DECLARE_EVENT_CLASS(block_rq,

TP_PROTO(struct request *rq),
--
2.26.3


2022-01-27 14:16:24

by Chaitanya Kulkarni

[permalink] [raw]
Subject: Re: [v5 PATCH] block: introduce block_rq_error tracepoint

On 1/26/22 10:51 AM, Yang Shi wrote:
> External email: Use caution opening links or attachments
>
>
> Currently, rasdaemon uses the existing tracepoint block_rq_complete
> and filters out non-error cases in order to capture block disk errors.
>
> But there are a few problems with this approach:
>
> 1. Even kernel trace filter could do the filtering work, there is
> still some overhead after we enable this tracepoint.
>
> 2. The filter is merely based on errno, which does not align with kernel
> logic to check the errors for print_req_error().
>
> 3. block_rq_complete only provides dev major and minor to identify
> the block device, it is not convenient to use in user-space.
>
> So introduce a new tracepoint block_rq_error just for the error case
> and provides the device name for convenience too. With this patch,
> rasdaemon could switch to block_rq_error.
>
> Cc: Jens Axboe <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Reviewed-by: Steven Rostedt <[email protected]>
> Signed-off-by: Cong Wang <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>
> ---
> The v3 patch was submitted in Feb 2020, and Steven reviewed the patch, but
> it was not merged to upstream. See
> https://lore.kernel.org/lkml/[email protected]/.
>
> The problems fixed by that patch still exist and we do need it to make
> disk error handling in rasdaemon easier. So this resurrected it and
> continued the version number.
>
> v4 --> v5:
> * Report the actual block layer status code instead of the errno per
> Christoph Hellwig.
> v3 --> v4:
> * Rebased to v5.17-rc1.
> * Collected reviewed-by tag from Steven.
>
> block/blk-mq.c | 4 +++-
> include/trace/events/block.h | 41 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f3bf3358a3bb..4ca72ea917d4 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -789,8 +789,10 @@ bool blk_update_request(struct request *req, blk_status_t error,
> #endif
>
> if (unlikely(error && !blk_rq_is_passthrough(req) &&
> - !(req->rq_flags & RQF_QUIET)))
> + !(req->rq_flags & RQF_QUIET))) {
> + trace_block_rq_error(req, error, nr_bytes);
> blk_print_req_error(req, error);
> + }
>
> blk_account_io_completion(req, nr_bytes);
>
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> index 27170e40e8c9..918b190718d5 100644
> --- a/include/trace/events/block.h
> +++ b/include/trace/events/block.h
> @@ -144,6 +144,47 @@ TRACE_EVENT(block_rq_complete,
> __entry->nr_sector, __entry->error)
> );
>
> +/**
> + * block_rq_error - block IO operation error reported by device driver
> + * @rq: block operations request
> + * @error: status code
> + * @nr_bytes: number of completed bytes
> + *
> + * The block_rq_error tracepoint event indicates that some portion
> + * of operation request has failed as reported by the device driver.
> + */
> +TRACE_EVENT(block_rq_error,
> +
> + TP_PROTO(struct request *rq, blk_status_t error, unsigned int nr_bytes),
> +
> + TP_ARGS(rq, error, nr_bytes),
> +
> + TP_STRUCT__entry(
> + __field( dev_t, dev )
> + __string( name, rq->q->disk ? rq->q->disk->disk_name : "?")
> + __field( sector_t, sector )
> + __field( unsigned int, nr_sector )
> + __field( int, error )
> + __array( char, rwbs, RWBS_LEN )
> + ),
> +
> + TP_fast_assign(
> + __entry->dev = rq->q->disk ? disk_devt(rq->q->disk) : 0;
> + __assign_str(name, rq->q->disk ? rq->q->disk->disk_name : "?");

Did you verify that rq->q->disk NULL checks are must in this checkout
for blk_update_request() ?

> + __entry->sector = blk_rq_pos(rq);
> + __entry->nr_sector = nr_bytes >> 9;
> + __entry->error = blk_status_to_errno(error);
> +
> + blk_fill_rwbs(__entry->rwbs, rq->cmd_flags);
> + ),
> +
> + TP_printk("%d,%d %s %s %llu + %u [%d]",
> + MAJOR(__entry->dev), MINOR(__entry->dev),
> + __get_str(name), __entry->rwbs,

Since printing major number and minor number is sufficient to identify
the block device. We don't have a notion of printing disk_name for the
block tracepoints what makes this tracepoint special ?

> + (unsigned long long)__entry->sector,
> + __entry->nr_sector, __entry->error)
> +);
> +
> DECLARE_EVENT_CLASS(block_rq,
>
> TP_PROTO(struct request *rq),
> --
> 2.26.3
>

2022-01-27 17:14:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [v5 PATCH] block: introduce block_rq_error tracepoint

On Thu, Jan 27, 2022 at 08:02:26AM +0000, Chaitanya Kulkarni wrote:
> > + __field( unsigned int, nr_sector )
> > + __field( int, error )
> > + __array( char, rwbs, RWBS_LEN )
> > + ),
> > +
> > + TP_fast_assign(
> > + __entry->dev = rq->q->disk ? disk_devt(rq->q->disk) : 0;
> > + __assign_str(name, rq->q->disk ? rq->q->disk->disk_name : "?");
>
> Did you verify that rq->q->disk NULL checks are must in this checkout
> for blk_update_request() ?

Yes, we need this.

2022-01-27 17:15:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [v5 PATCH] block: introduce block_rq_error tracepoint

On Wed, Jan 26, 2022 at 10:51:53AM -0800, Yang Shi wrote:
> + __entry->dev = rq->q->disk ? disk_devt(rq->q->disk) : 0;
> + __assign_str(name, rq->q->disk ? rq->q->disk->disk_name : "?");

None f the other tracepoints has the disk name, why does this one need
it? And if you add it please avoid the overly long line.

> + __entry->sector = blk_rq_pos(rq);
> + __entry->nr_sector = nr_bytes >> 9;
> + __entry->error = blk_status_to_errno(error);

This still converts the block status to an errno.

2022-01-28 15:29:48

by Yang Shi

[permalink] [raw]
Subject: Re: [v5 PATCH] block: introduce block_rq_error tracepoint

On Thu, Jan 27, 2022 at 1:53 AM Christoph Hellwig <[email protected]> wrote:
>
> On Wed, Jan 26, 2022 at 10:51:53AM -0800, Yang Shi wrote:
> > + __entry->dev = rq->q->disk ? disk_devt(rq->q->disk) : 0;
> > + __assign_str(name, rq->q->disk ? rq->q->disk->disk_name : "?");
>
> None f the other tracepoints has the disk name, why does this one need
> it? And if you add it please avoid the overly long line.

I guess the disk name was added to ease some handling in userspace
tools. But if all other tracepoints don't have disk name shown, I
think I'd better follow the convention. I did overlook this when I
ported this patch. Will remove it.

>
> > + __entry->sector = blk_rq_pos(rq);
> > + __entry->nr_sector = nr_bytes >> 9;
> > + __entry->error = blk_status_to_errno(error);
>
> This still converts the block status to an errno.

I may misunderstand your comments. I just followed what
block_rq_complete tracepoint does. Or the linux-block community is
converting all tracepoints to show blk status code instead of
conventional errno?

And the userspace tool doesn't know blk status code and still expects
the conventional errno. For example, rasdaemon reads block_rq_complete
events now and have the below:

static const struct {
int error;
const char *name;
} blk_errors[] = {
{ -EOPNOTSUPP, "operation not supported error" },
{ -ETIMEDOUT, "timeout error" },
{ -ENOSPC, "critical space allocation error" },
{ -ENOLINK, "recoverable transport error" },
{ -EREMOTEIO, "critical target error" },
{ -EBADE, "critical nexus error" },
{ -ENODATA, "critical medium error" },
{ -EILSEQ, "protection error" },
{ -ENOMEM, "kernel resource error" },
{ -EBUSY, "device resource error" },
{ -EAGAIN, "nonblocking retry error" },
{ -EREMCHG, "dm internal retry error" },
{ -EIO, "I/O error" },
};

This patch aims to add block_rq_err tracepoint to replace
block_rq_complete in rasdaemon.

2022-01-28 15:29:59

by Yang Shi

[permalink] [raw]
Subject: Re: [v5 PATCH] block: introduce block_rq_error tracepoint

On Thu, Jan 27, 2022 at 12:02 AM Chaitanya Kulkarni
<[email protected]> wrote:
>
> On 1/26/22 10:51 AM, Yang Shi wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > Currently, rasdaemon uses the existing tracepoint block_rq_complete
> > and filters out non-error cases in order to capture block disk errors.
> >
> > But there are a few problems with this approach:
> >
> > 1. Even kernel trace filter could do the filtering work, there is
> > still some overhead after we enable this tracepoint.
> >
> > 2. The filter is merely based on errno, which does not align with kernel
> > logic to check the errors for print_req_error().
> >
> > 3. block_rq_complete only provides dev major and minor to identify
> > the block device, it is not convenient to use in user-space.
> >
> > So introduce a new tracepoint block_rq_error just for the error case
> > and provides the device name for convenience too. With this patch,
> > rasdaemon could switch to block_rq_error.
> >
> > Cc: Jens Axboe <[email protected]>
> > Cc: Christoph Hellwig <[email protected]>
> > Reviewed-by: Steven Rostedt <[email protected]>
> > Signed-off-by: Cong Wang <[email protected]>
> > Signed-off-by: Yang Shi <[email protected]>
> > ---
> > The v3 patch was submitted in Feb 2020, and Steven reviewed the patch, but
> > it was not merged to upstream. See
> > https://lore.kernel.org/lkml/[email protected]/.
> >
> > The problems fixed by that patch still exist and we do need it to make
> > disk error handling in rasdaemon easier. So this resurrected it and
> > continued the version number.
> >
> > v4 --> v5:
> > * Report the actual block layer status code instead of the errno per
> > Christoph Hellwig.
> > v3 --> v4:
> > * Rebased to v5.17-rc1.
> > * Collected reviewed-by tag from Steven.
> >
> > block/blk-mq.c | 4 +++-
> > include/trace/events/block.h | 41 ++++++++++++++++++++++++++++++++++++
> > 2 files changed, 44 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index f3bf3358a3bb..4ca72ea917d4 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -789,8 +789,10 @@ bool blk_update_request(struct request *req, blk_status_t error,
> > #endif
> >
> > if (unlikely(error && !blk_rq_is_passthrough(req) &&
> > - !(req->rq_flags & RQF_QUIET)))
> > + !(req->rq_flags & RQF_QUIET))) {
> > + trace_block_rq_error(req, error, nr_bytes);
> > blk_print_req_error(req, error);
> > + }
> >
> > blk_account_io_completion(req, nr_bytes);
> >
> > diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> > index 27170e40e8c9..918b190718d5 100644
> > --- a/include/trace/events/block.h
> > +++ b/include/trace/events/block.h
> > @@ -144,6 +144,47 @@ TRACE_EVENT(block_rq_complete,
> > __entry->nr_sector, __entry->error)
> > );
> >
> > +/**
> > + * block_rq_error - block IO operation error reported by device driver
> > + * @rq: block operations request
> > + * @error: status code
> > + * @nr_bytes: number of completed bytes
> > + *
> > + * The block_rq_error tracepoint event indicates that some portion
> > + * of operation request has failed as reported by the device driver.
> > + */
> > +TRACE_EVENT(block_rq_error,
> > +
> > + TP_PROTO(struct request *rq, blk_status_t error, unsigned int nr_bytes),
> > +
> > + TP_ARGS(rq, error, nr_bytes),
> > +
> > + TP_STRUCT__entry(
> > + __field( dev_t, dev )
> > + __string( name, rq->q->disk ? rq->q->disk->disk_name : "?")
> > + __field( sector_t, sector )
> > + __field( unsigned int, nr_sector )
> > + __field( int, error )
> > + __array( char, rwbs, RWBS_LEN )
> > + ),
> > +
> > + TP_fast_assign(
> > + __entry->dev = rq->q->disk ? disk_devt(rq->q->disk) : 0;
> > + __assign_str(name, rq->q->disk ? rq->q->disk->disk_name : "?");
>
> Did you verify that rq->q->disk NULL checks are must in this checkout
> for blk_update_request() ?
>
> > + __entry->sector = blk_rq_pos(rq);
> > + __entry->nr_sector = nr_bytes >> 9;
> > + __entry->error = blk_status_to_errno(error);
> > +
> > + blk_fill_rwbs(__entry->rwbs, rq->cmd_flags);
> > + ),
> > +
> > + TP_printk("%d,%d %s %s %llu + %u [%d]",
> > + MAJOR(__entry->dev), MINOR(__entry->dev),
> > + __get_str(name), __entry->rwbs,
>
> Since printing major number and minor number is sufficient to identify
> the block device. We don't have a notion of printing disk_name for the
> block tracepoints what makes this tracepoint special ?

Thanks for the comments, will remove it to follow the convention.

>
> > + (unsigned long long)__entry->sector,
> > + __entry->nr_sector, __entry->error)
> > +);
> > +
> > DECLARE_EVENT_CLASS(block_rq,
> >
> > TP_PROTO(struct request *rq),
> > --
> > 2.26.3
> >
>

2022-02-03 14:15:01

by Yang Shi

[permalink] [raw]
Subject: Re: [v5 PATCH] block: introduce block_rq_error tracepoint

On Thu, Jan 27, 2022 at 10:18 AM Yang Shi <[email protected]> wrote:
>
> On Thu, Jan 27, 2022 at 1:53 AM Christoph Hellwig <[email protected]> wrote:
> >
> > On Wed, Jan 26, 2022 at 10:51:53AM -0800, Yang Shi wrote:
> > > + __entry->dev = rq->q->disk ? disk_devt(rq->q->disk) : 0;
> > > + __assign_str(name, rq->q->disk ? rq->q->disk->disk_name : "?");
> >
> > None f the other tracepoints has the disk name, why does this one need
> > it? And if you add it please avoid the overly long line.
>
> I guess the disk name was added to ease some handling in userspace
> tools. But if all other tracepoints don't have disk name shown, I
> think I'd better follow the convention. I did overlook this when I
> ported this patch. Will remove it.
>
> >
> > > + __entry->sector = blk_rq_pos(rq);
> > > + __entry->nr_sector = nr_bytes >> 9;
> > > + __entry->error = blk_status_to_errno(error);
> >
> > This still converts the block status to an errno.
>
> I may misunderstand your comments. I just followed what
> block_rq_complete tracepoint does. Or the linux-block community is
> converting all tracepoints to show blk status code instead of
> conventional errno?
>
> And the userspace tool doesn't know blk status code and still expects
> the conventional errno. For example, rasdaemon reads block_rq_complete
> events now and have the below:
>
> static const struct {
> int error;
> const char *name;
> } blk_errors[] = {
> { -EOPNOTSUPP, "operation not supported error" },
> { -ETIMEDOUT, "timeout error" },
> { -ENOSPC, "critical space allocation error" },
> { -ENOLINK, "recoverable transport error" },
> { -EREMOTEIO, "critical target error" },
> { -EBADE, "critical nexus error" },
> { -ENODATA, "critical medium error" },
> { -EILSEQ, "protection error" },
> { -ENOMEM, "kernel resource error" },
> { -EBUSY, "device resource error" },
> { -EAGAIN, "nonblocking retry error" },
> { -EREMCHG, "dm internal retry error" },
> { -EIO, "I/O error" },
> };

Gently ping. Should I print blk status code instead of errno for this
trace point? But I really don't get why. And block_rq_complete
tracepoint does:

TP_fast_assign(
__entry->dev = rq->q->disk ? disk_devt(rq->q->disk) : 0;
__entry->sector = blk_rq_pos(rq);
__entry->nr_sector = nr_bytes >> 9;
__entry->error = blk_status_to_errno(error); <===
convert blk status code to errno

blk_fill_rwbs(__entry->rwbs, rq->cmd_flags);
__get_str(cmd)[0] = '\0';
),

>
> This patch aims to add block_rq_err tracepoint to replace
> block_rq_complete in rasdaemon.