Hi Christoph, Hi Keith,
I found a regression when recovering NVMes after a simulated PCI error on
s390, though I believe at least some POWER systems should be affected as
well. I tracked this down to commit b98235d3a471 ("nvme-pci: harden drive
presence detect in nvme_dev_disable()") which causes nvme_start_freeze() to
not be called before nvme_reset_work() does nvme_wait_freeze() thus hanging
forever. The detailed analysis is included in the commit message and not
too complex but I'm not entirely sure my proposed solution is the correct
one.
The patch I'm sending here works for me and should at least only affect
platforms using the explicit driver->err_handler->slot_reset callback. To
my understanding it seems that the nvme_dev_disable() in
nvme_error_detected() still does the necessary quiescing towards upper
layers and I assume that nvme_start_freeze() won't do anything useful if
the controller is inaccessible but I'm not an expert in this. In particular
I'm not sure it makes sense to start freezing the queues right after
a reset.
Also note I will be travelling for about 3 weeks starting July 14th and
won't have access to s390 machines or my work mail address so apologies if
I won't answer. Feel free to do your own fix. Also Matt (on CC) might be
able to test fixes for this.
Best regards,
Niklas
Niklas Schnelle (1):
nvme-pci: fix hang during error recovery when the PCI device is
isolated
drivers/nvme/host/pci.c | 1 +
1 file changed, 1 insertion(+)
--
2.34.1