2017-03-01 21:52:12

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver

On Sat, Feb 25, 2017 at 11:53:13PM -0700, Logan Gunthorpe wrote:
> Changes since v4:
>
> * Turns out pushing the pci release code into the device release
> function didn't work as I would have liked. If you try to unbind the
> device with an instance open, then you hit a kernel bug at
> drivers/pci/msi.c:371. (This didn't occur in v3.)

This is in free_msi_irqs():

368 for_each_pci_msi_entry(entry, dev)
369 if (entry->irq)
370 for (i = 0; i < entry->nvec_used; i++)
371 BUG_ON(irq_has_action(entry->irq + i));

I don't think this is indicating a bug in the PCI core (although I do
think a BUG_ON() here is an excessive response). I think it's an
indication that the driver didn't disconnect its ISR. Without more
details of the failure it's hard to tell if the BUG_ON is a symptom of
a problem in the driver or what.

An "alive" flag feels racy, and I can't tell if it's really the best
way to deal with this, or if it's just avoiding the issue. There must
be other drivers with the same cleanup issue -- do they handle it the
same way?

> To solve this, we've moved the pci release code back into the
> unregister function and reintroduced an alive flag. This time,
> however, the alive flag is protected by mrpc_mutex and we're very
> careful about what happens to devices still in use (they should
> all be released through the timeout path and an ENODEV error
> returned to userspace; while new commands are blocked with the
> same error).


2017-03-01 22:26:00

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver



On 01/03/17 02:41 PM, Bjorn Helgaas wrote:
> I don't think this is indicating a bug in the PCI core (although I do
> think a BUG_ON() here is an excessive response). I think it's an
> indication that the driver didn't disconnect its ISR. Without more
> details of the failure it's hard to tell if the BUG_ON is a symptom of
> a problem in the driver or what.

Yes, my assumption was that when you force an unbind on the PCI core,
it's designed to stop using the PCI device right away even if there are
users using it. Thus it becomes the drivers responsibility to handle
this situation.

> An "alive" flag feels racy, and I can't tell if it's really the best
> way to deal with this, or if it's just avoiding the issue. There must
> be other drivers with the same cleanup issue -- do they handle it the
> same way?

I haven't done a comprehensive search, but it's very common for people
to use (and this is what I've adopted again in v5):

devm_request_irq(&pdev->dev, ...)

In this way, the IRQs are released with the pci_dev (or often platform)
and thus the BUG_ON never hits. However, it means any user space program
waiting on an IRQ (like via a cdev call) will hang unless handled with
other means. Exactly what those means are seems driver specific and not
always obvious. I wouldn't be surprised if a lot of drivers get this
aspect wrong.

A couple examples I've looked at:

1) drivers/dax/dax.c uses an alive flag without any mutexes, atomics or
anything. So I don't know if it's racy or perhaps correct for other reasons.

2) drivers/char/hw_random has a drop_current_rng that looks like it
could easily be racy with the get_current_rng in the userspace flow.

3) A couple of drivers drivers/char/tpm doesn't seem to have any
protection at all and appears like they would continue to use io
operations even after the they may get unmapped because the char device
persists.

So I'm not sure where you'd find a driver that does it correctly and in
a simpler way..

Another thing: based on comments in [1], a lot of people don't seem to
realize that cdev instances can persist long after cdev_del so it's
probably very common for drivers to get this wrong.

Logan

[1] https://lists.01.org/pipermail/linux-nvdimm/2017-February/009001.html



>> To solve this, we've moved the pci release code back into the
>> unregister function and reintroduced an alive flag. This time,
>> however, the alive flag is protected by mrpc_mutex and we're very
>> careful about what happens to devices still in use (they should
>> all be released through the timeout path and an ENODEV error
>> returned to userspace; while new commands are blocked with the
>> same error).

2017-03-01 22:33:27

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver

On Wed, Mar 01, 2017 at 03:41:20PM -0600, Bjorn Helgaas wrote:
> On Sat, Feb 25, 2017 at 11:53:13PM -0700, Logan Gunthorpe wrote:
> > Changes since v4:
> >
> > * Turns out pushing the pci release code into the device release
> > function didn't work as I would have liked. If you try to unbind the
> > device with an instance open, then you hit a kernel bug at
> > drivers/pci/msi.c:371. (This didn't occur in v3.)
>
> This is in free_msi_irqs():
>
> 368 for_each_pci_msi_entry(entry, dev)
> 369 if (entry->irq)
> 370 for (i = 0; i < entry->nvec_used; i++)
> 371 BUG_ON(irq_has_action(entry->irq + i));
>
> I don't think this is indicating a bug in the PCI core (although I do
> think a BUG_ON() here is an excessive response). I think it's an
> indication that the driver didn't disconnect its ISR. Without more
> details of the failure it's hard to tell if the BUG_ON is a symptom of
> a problem in the driver or what.
>
> An "alive" flag feels racy, and I can't tell if it's really the best
> way to deal with this, or if it's just avoiding the issue. There must
> be other drivers with the same cleanup issue -- do they handle it the
> same way?

I think this is from using the managed device resource API to request the
irq actions. The scope of the resource used to be tied to the pci_dev's
dev, but now it's the new switchec class dev, which has a different
lifetime while open references exist, so it's not releasing the irq's.

One thing about the BUG_ON that is confusing me is how it's getting
to free_msi_irq's BUG in v4 or v5. I don't see any part releasing the
allocated ones. Maybe the devres API is harder to use than having the
driver manage all the resources...

2017-03-01 22:53:01

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver

Hey,

Seems to me like an elegant solution would be to implement a 'cdev_kill'
function which could kill all the processes using a cdev. Thus, during
an unbind, a driver could call it and be sure that there are no users
left and it can safely allow the devres unwind to continue. Then no
difficult and racy 'alive' flags would be necessary and it would be much
easier on drivers.

However, I don't think any such thing exists at the moment and it's not
likely to be done in the near term. I'm reasonably confident in the
correctness of v5 of my driver (especially when compared to other
drivers) and unless someone can describe how it's wrong or a better
solution I'd rather see it merged as is. If and when a better approach
arrives I'd happily patch it to improve the situation.

Logan

On 01/03/17 03:24 PM, Logan Gunthorpe wrote:
>
>
> On 01/03/17 02:41 PM, Bjorn Helgaas wrote:
>> I don't think this is indicating a bug in the PCI core (although I do
>> think a BUG_ON() here is an excessive response). I think it's an
>> indication that the driver didn't disconnect its ISR. Without more
>> details of the failure it's hard to tell if the BUG_ON is a symptom of
>> a problem in the driver or what.
>
> Yes, my assumption was that when you force an unbind on the PCI core,
> it's designed to stop using the PCI device right away even if there are
> users using it. Thus it becomes the drivers responsibility to handle
> this situation.
>
>> An "alive" flag feels racy, and I can't tell if it's really the best
>> way to deal with this, or if it's just avoiding the issue. There must
>> be other drivers with the same cleanup issue -- do they handle it the
>> same way?
>
> I haven't done a comprehensive search, but it's very common for people
> to use (and this is what I've adopted again in v5):
>
> devm_request_irq(&pdev->dev, ...)
>
> In this way, the IRQs are released with the pci_dev (or often platform)
> and thus the BUG_ON never hits. However, it means any user space program
> waiting on an IRQ (like via a cdev call) will hang unless handled with
> other means. Exactly what those means are seems driver specific and not
> always obvious. I wouldn't be surprised if a lot of drivers get this
> aspect wrong.
>
> A couple examples I've looked at:
>
> 1) drivers/dax/dax.c uses an alive flag without any mutexes, atomics or
> anything. So I don't know if it's racy or perhaps correct for other reasons.
>
> 2) drivers/char/hw_random has a drop_current_rng that looks like it
> could easily be racy with the get_current_rng in the userspace flow.
>
> 3) A couple of drivers drivers/char/tpm doesn't seem to have any
> protection at all and appears like they would continue to use io
> operations even after the they may get unmapped because the char device
> persists.
>
> So I'm not sure where you'd find a driver that does it correctly and in
> a simpler way..
>
> Another thing: based on comments in [1], a lot of people don't seem to
> realize that cdev instances can persist long after cdev_del so it's
> probably very common for drivers to get this wrong.
>
> Logan
>
> [1] https://lists.01.org/pipermail/linux-nvdimm/2017-February/009001.html


2017-03-01 22:56:22

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver

On Wed, Mar 01, 2017 at 03:37:03PM -0700, Logan Gunthorpe wrote:
> On 01/03/17 03:26 PM, Keith Busch wrote:
> > I think this is from using the managed device resource API to request the
> > irq actions. The scope of the resource used to be tied to the pci_dev's
> > dev, but now it's the new switchec class dev, which has a different
> > lifetime while open references exist, so it's not releasing the irq's.
>
> The scope of the IRQ was originally tied to the pci_dev. Then in v4 I
> tied it to the switchtec device in order to try and keep using the pci
> device after unbind. This didn't work, so I switched it back to using
> the pci_dev. (This seems to be the way most drivers work anyway.)

Okay, I see. Was mistakenliy looking at v4. The v5 looks right.

2017-03-01 23:11:47

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver



On 01/03/17 03:26 PM, Keith Busch wrote:
> I think this is from using the managed device resource API to request the
> irq actions. The scope of the resource used to be tied to the pci_dev's
> dev, but now it's the new switchec class dev, which has a different
> lifetime while open references exist, so it's not releasing the irq's.

The scope of the IRQ was originally tied to the pci_dev. Then in v4 I
tied it to the switchtec device in order to try and keep using the pci
device after unbind. This didn't work, so I switched it back to using
the pci_dev. (This seems to be the way most drivers work anyway.)


> One thing about the BUG_ON that is confusing me is how it's getting
> to free_msi_irq's BUG in v4 or v5. I don't see any part releasing the
> allocated ones. Maybe the devres API is harder to use than having the
> driver manage all the resources...

free_msi_irqs seems to be called via pci_disable_device in pcim_release
which devres will call during release of the PCI device and before all
the references to the pci_dev are freed (I tried adding an extra
get_device which gets put in the child devices release -- this didn't work):

[ 1079.845616] Call Trace:
[ 1079.845652] ? pcim_release+0x35/0x96
[ 1079.845691] ? release_nodes+0x15b/0x17c
[ 1079.845730] ? device_release_driver_internal+0x12d/0x1cb
[ 1079.845771] ? unbind_store+0x59/0x89
[ 1079.845809] ? kernfs_fop_write+0xe7/0x129
[ 1079.845847] ? __vfs_write+0x1c/0xa2
[ 1079.845885] ? kmem_cache_alloc+0xc5/0x131
[ 1079.845923] ? fput+0xd/0x7d
[ 1079.845958] ? filp_close+0x5a/0x61
[ 1079.845993] ? vfs_write+0xa2/0xe4
[ 1079.846028] ? SyS_write+0x48/0x73
[ 1079.846066] ? entry_SYSCALL_64_fastpath+0x13/0x94

v5 is correct because it registers the irqs against the pci_dev (with
devm_request_irq) and thus they get freed in time as part of the devres
unwind.

Logan


2017-03-01 23:13:17

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver



On 01/03/17 03:59 PM, Keith Busch wrote:
> Okay, I see. Was mistakenliy looking at v4. The v5 looks right.

Great! Thanks for reviewing that for me.

Logan

2017-03-02 00:40:03

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver

On Wed, Mar 01, 2017 at 03:49:04PM -0700, Logan Gunthorpe wrote:

> Seems to me like an elegant solution would be to implement a 'cdev_kill'
> function which could kill all the processes using a cdev. Thus, during
> an unbind, a driver could call it and be sure that there are no users
> left and it can safely allow the devres unwind to continue. Then no
> difficult and racy 'alive' flags would be necessary and it would be much
> easier on drivers.

That could help, but this would mean cdev would have to insert a shim
to grab locks around the various file ops.

> > 3) A couple of drivers drivers/char/tpm doesn't seem to have any
> > protection at all and appears like they would continue to use io
> > operations even after the they may get unmapped because the char device
> > persists.

AFAIK TPM is correct and has been robustly tested now. We have a 'vtpm'
driver that agressively uses hot-unplug.

Firstly, TPM correctly tracks krefs on the 'chip', so the pointer is
always valid.

Next, anytime someone has a 'chip' pointer and wants to execute a TPM
operation they have to call tpm_try_get_ops and hold that lock for the
duration of their usage. Internally it does this:

down_read(&chip->ops_sem);
if (!chip->ops)
goto out_lock; // FAILURE

In the cdev fops case this will cause the fop to return EPIPE if it
fails. cdev holds this read side lock while it is running TPM commands
inside the driver.

This meshes with this code called by tpm_chip_unregister():

down_write(&chip->ops_sem);
chip->ops = NULL;
up_write(&chip->ops_sem);

tpm_chip_unregister uses this to provide a strong fence guarentee to
the driver. After tpm_chip_unregister return:
- Nothing is running in a TPM ops callback currently
- Nothing will ever call a TPM ops callback in future

This allows the TPM drivers to be trivial: call tpm_chip_unregister(),
kill the IRQ, and unmap the io resource, unload the module code.

TPM drivers cannot associate HW resources to the 'chip' struct device,
as it is unwound and devm triggered *during* tpm_chip_unregister and
does not enjoy the strong fence guarantee.

infiniband has a similar 'strong fence' unregister function. I think
it is best-practice for subsystem design to provide that because
drivers will never get it right on their own :(

Both infiniband and TPM have the 'detach' flavour of unplug where the
user is not forced to close their open cdevs. For simpler cases you
can just wait for all cdevs to close with a simple rwsem, much like
sysfs does already during device_del.

Jason

2017-03-02 00:52:49

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver

On Wed, Mar 01, 2017 at 05:23:38PM -0700, Logan Gunthorpe wrote:

> > That could help, but this would mean cdev would have to insert a shim
> > to grab locks around the various file ops.
>
> Hmm, I was hoping something more along the lines of actually killing the
> processes instead of just shimming away fops.

That would probably make most cdev users unhappy, it is not what we
want in tpm or infiniband, for instance.

> > AFAIK TPM is correct and has been robustly tested now. We have a 'vtpm'
> > driver that agressively uses hot-unplug.
>
> Switchtec is a bit more tricky because a) there's no upper level driver
> to handle things

Introducing a light split between 'the upper part that owns the cdev'
and 'the lower part that owns the hardware' makes things much easier
to understand in a driver and it becomes clearer where, eg, devm
actions should be linked (ie probably not to the cdev part)

> and b) userspace may be inside a wait_for_completion (via read or
> poll) that needs to be completed. If a so called 'cdev_kill' could
> actually just kill these processes it would be a bit easier.

For TPM, poll could be something like:

static unsigned int tpm_poll(struct file *filp,
struct poll_table_struct *wait)
{
poll_wait(filp, &chip->poll_wait, wait);
if (tpm_try_get_ops(chip)) {
mask = chip->ops->driver_do_poll(...);
tpm_put_ops(chip);
} else
mask = POLLIN | POLLRDHUP | POLLOUT | POLLERR | POLLHUP;
return mask;
}

And we would trigger chip->poll_wait in the unregister.

wait_for_completion is similar, drop the rwsem while sleeping, add
'ops = NULL' to the sleeping condition test, trigger the wait on
unregister then reacquire the rwsem and test ops on wake.

Jason

2017-03-02 01:11:31

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver



On 01/03/17 05:23 PM, Logan Gunthorpe wrote:
> Because of how poll works, I don't see how I can just hold a semaphore
> inside every fops call like the tpm code does.

On 01/03/17 04:58 PM, Jason Gunthorpe wrote:
> Both infiniband and TPM have the 'detach' flavour of unplug where the
> user is not forced to close their open cdevs. For simpler cases you
> can just wait for all cdevs to close with a simple rwsem, much like
> sysfs does already during device_del.

Though, after reading your email again, a hard fence on close sounds
like a good idea. Tomorrow I'll post a v6 that uses that approach.


Thanks,

Logan

2017-03-02 01:11:13

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] New Microsemi PCI Switch Management Driver



On 01/03/17 04:58 PM, Jason Gunthorpe wrote:
> On Wed, Mar 01, 2017 at 03:49:04PM -0700, Logan Gunthorpe wrote:
>
>> Seems to me like an elegant solution would be to implement a 'cdev_kill'
>> function which could kill all the processes using a cdev. Thus, during
>> an unbind, a driver could call it and be sure that there are no users
>> left and it can safely allow the devres unwind to continue. Then no
>> difficult and racy 'alive' flags would be necessary and it would be much
>> easier on drivers.
>
> That could help, but this would mean cdev would have to insert a shim
> to grab locks around the various file ops.

Hmm, I was hoping something more along the lines of actually killing the
processes instead of just shimming away fops.

> AFAIK TPM is correct and has been robustly tested now. We have a 'vtpm'
> driver that agressively uses hot-unplug.

Ah, thanks for the explanation of how that works. I didn't notice the
semaphore usage.

Switchtec is a bit more tricky because a) there's no upper level driver
to handle things and b) userspace may be inside a wait_for_completion
(via read or poll) that needs to be completed. If a so called
'cdev_kill' could actually just kill these processes it would be a bit
easier.

Currently, in the Switchtec code, there's a timeout if the interrupt
doesn't fire (which occurs if the pci device has been torn down) and the
code will check an alive bit (under mutex protection) and error out if
it's not alive.

Because of how poll works, I don't see how I can just hold a semaphore
inside every fops call like the tpm code does.

Logan