From: Keith Busch <[email protected]> Sent: Friday, June 3, 2022 12:23 PM
>
> On Fri, Jun 03, 2022 at 10:56:01AM -0700, Michael Kelley wrote:
>
> This series looks good to me. Just one concern below that may amount to
> nothing.
>
> > +static void nvme_handle_aer_persistent_error(struct nvme_ctrl *ctrl)
> > +{
> > + u32 csts;
> > +
> > + trace_nvme_async_event(ctrl, NVME_AER_ERROR);
> > +
> > + if (ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &csts) != 0 ||
>
> The reg_read32() is non-blocking for pcie, so this is safe to call from that
> driver's irq handler. The other transports block on register reads, though, so
> they can't call this from an atomic context. The TCP context looks safe, but
> I'm not sure about RDMA or FC.
Good point. But even if the RDMA and FC contexts are safe, if a
persistent error is reported, the controller is already in trouble and
may not respond to a request to retrieve the CSTS anyway. Perhaps
we should just trust the AER error report and not bother checking
CSTS to decide whether to do the reset. We can still check ctrl->state
and skip the reset if there's already one in progress.
>
> > + nvme_should_reset(ctrl, csts)) {
> > + dev_warn(ctrl->device, "resetting controller due to AER\n");
> > + nvme_reset_ctrl(ctrl);
> > + }
> > +}
> > +
> > void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
> > volatile union nvme_result *res)
> > {
> > u32 result = le32_to_cpu(res->u32);
> > u32 aer_type = result & 0x07;
> > + u32 aer_subtype = (result & 0xff00) >> 8;
>
> Since the above mask + shift is duplicated with nvme_handle_aen_notice(), an
> inline helper function seems reasonable.
Yep. Will do in v3.
Michael
On Sat, Jun 04, 2022 at 02:28:11PM +0000, Michael Kelley (LINUX) wrote:
> > driver's irq handler. The other transports block on register reads, though, so
> > they can't call this from an atomic context. The TCP context looks safe, but
> > I'm not sure about RDMA or FC.
>
> Good point. But even if the RDMA and FC contexts are safe,
For RDMA this is typically called from softirq context, so it is indeed
not save.
> if a
> persistent error is reported, the controller is already in trouble and
> may not respond to a request to retrieve the CSTS anyway. Perhaps
> we should just trust the AER error report and not bother checking
> CSTS to decide whether to do the reset. We can still check ctrl->state
> and skip the reset if there's already one in progress.
Yes, that might be a better option.
On Sat, Jun 04, 2022 at 02:28:11PM +0000, Michael Kelley (LINUX) wrote:
> From: Keith Busch <[email protected]> Sent: Friday, June 3, 2022 12:23 PM
> >
> > On Fri, Jun 03, 2022 at 10:56:01AM -0700, Michael Kelley wrote:
> >
> > This series looks good to me. Just one concern below that may amount to
> > nothing.
> >
> > > +static void nvme_handle_aer_persistent_error(struct nvme_ctrl *ctrl)
> > > +{
> > > + u32 csts;
> > > +
> > > + trace_nvme_async_event(ctrl, NVME_AER_ERROR);
> > > +
> > > + if (ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &csts) != 0 ||
> >
> > The reg_read32() is non-blocking for pcie, so this is safe to call from that
> > driver's irq handler. The other transports block on register reads, though, so
> > they can't call this from an atomic context. The TCP context looks safe, but
> > I'm not sure about RDMA or FC.
>
> Good point. But even if the RDMA and FC contexts are safe, if a
> persistent error is reported, the controller is already in trouble and
> may not respond to a request to retrieve the CSTS anyway. Perhaps
> we should just trust the AER error report and not bother checking
> CSTS to decide whether to do the reset. We can still check ctrl->state
> and skip the reset if there's already one in progress.
That sounds good to me. Christoph noted RDMA isn't safe to do this in the
callback anyway, and it's probably a bad idea in general to dispatch new
requests within another's completion: that may prevent reclaiming the only
available tag, and then deadlock.
So with that in mind, this AER persistent error handler could call
nvme_should_reset() with NVME_CSTS_CFS as a constant value for the csts
parameter.