2019-08-26 06:57:57

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace

On Thu, Aug 22, 2019 at 12:10:23PM -0700, Sagi Grimberg wrote:
>> You are correct that this information can be derived from sysfs, but the
>> main reason why we add these here, is because in udev rule we can't
>> just go ahead and start looking these up and parsing these..
>>
>> We could send the discovery aen with NVME_CTRL_NAME and have
>> then have systemd run something like:
>>
>> nvme connect-all -d nvme0 --sysfs
>>
>> and have nvme-cli retrieve all this stuff from sysfs?
>
> Actually that may be a problem.
>
> There could be a hypothetical case where after the event was fired
> and before it was handled, the discovery controller went away and
> came back again with a different controller instance, and the old
> instance is now a different discovery controller.
>
> This is why we need this information in the event. And we verify this
> information in sysfs in nvme-cli.

Well, that must be a usual issue with uevents, right? Don't we usually
have a increasing serial number for that or something?

If I look at other callers of kobject_uevent_env none throws in such
a huge context.

>
> Makes sense?
---end quoted text---


2019-08-26 08:00:48

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace

On Mon, Aug 26, 2019 at 08:56:39AM +0200, Christoph Hellwig wrote:
> On Thu, Aug 22, 2019 at 12:10:23PM -0700, Sagi Grimberg wrote:
> >> You are correct that this information can be derived from sysfs, but the
> >> main reason why we add these here, is because in udev rule we can't
> >> just go ahead and start looking these up and parsing these..
> >>
> >> We could send the discovery aen with NVME_CTRL_NAME and have
> >> then have systemd run something like:
> >>
> >> nvme connect-all -d nvme0 --sysfs
> >>
> >> and have nvme-cli retrieve all this stuff from sysfs?
> >
> > Actually that may be a problem.
> >
> > There could be a hypothetical case where after the event was fired
> > and before it was handled, the discovery controller went away and
> > came back again with a different controller instance, and the old
> > instance is now a different discovery controller.
> >
> > This is why we need this information in the event. And we verify this
> > information in sysfs in nvme-cli.
>
> Well, that must be a usual issue with uevents, right? Don't we usually
> have a increasing serial number for that or something?

Yes we do, userspace should use it to order events. Does udev not
handle that properly today?

thanks,

greg k-h

2019-08-29 18:23:32

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace


>>>> You are correct that this information can be derived from sysfs, but the
>>>> main reason why we add these here, is because in udev rule we can't
>>>> just go ahead and start looking these up and parsing these..
>>>>
>>>> We could send the discovery aen with NVME_CTRL_NAME and have
>>>> then have systemd run something like:
>>>>
>>>> nvme connect-all -d nvme0 --sysfs
>>>>
>>>> and have nvme-cli retrieve all this stuff from sysfs?
>>>
>>> Actually that may be a problem.
>>>
>>> There could be a hypothetical case where after the event was fired
>>> and before it was handled, the discovery controller went away and
>>> came back again with a different controller instance, and the old
>>> instance is now a different discovery controller.
>>>
>>> This is why we need this information in the event. And we verify this
>>> information in sysfs in nvme-cli.
>>
>> Well, that must be a usual issue with uevents, right? Don't we usually
>> have a increasing serial number for that or something?
>
> Yes we do, userspace should use it to order events. Does udev not
> handle that properly today?

The problem is not ordering of events, its really about the fact that
the chardev can be removed and reallocated for a different controller
(could be a completely different discovery controller) by the time
that userspace handles the event.

2019-08-30 05:56:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace

On Thu, Aug 29, 2019 at 11:21:02AM -0700, Sagi Grimberg wrote:
>> Yes we do, userspace should use it to order events. Does udev not
>> handle that properly today?
>
> The problem is not ordering of events, its really about the fact that
> the chardev can be removed and reallocated for a different controller
> (could be a completely different discovery controller) by the time
> that userspace handles the event.

The same is generally true for lot of kernel devices. We could reduce
the chance by using the idr cyclic allocator.

2019-08-30 06:22:10

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace

On Thu, Aug 29, 2019 at 11:21:02AM -0700, Sagi Grimberg wrote:
>
> > > > > You are correct that this information can be derived from sysfs, but the
> > > > > main reason why we add these here, is because in udev rule we can't
> > > > > just go ahead and start looking these up and parsing these..
> > > > >
> > > > > We could send the discovery aen with NVME_CTRL_NAME and have
> > > > > then have systemd run something like:
> > > > >
> > > > > nvme connect-all -d nvme0 --sysfs
> > > > >
> > > > > and have nvme-cli retrieve all this stuff from sysfs?
> > > >
> > > > Actually that may be a problem.
> > > >
> > > > There could be a hypothetical case where after the event was fired
> > > > and before it was handled, the discovery controller went away and
> > > > came back again with a different controller instance, and the old
> > > > instance is now a different discovery controller.
> > > >
> > > > This is why we need this information in the event. And we verify this
> > > > information in sysfs in nvme-cli.
> > >
> > > Well, that must be a usual issue with uevents, right? Don't we usually
> > > have a increasing serial number for that or something?
> >
> > Yes we do, userspace should use it to order events. Does udev not
> > handle that properly today?
>
> The problem is not ordering of events, its really about the fact that
> the chardev can be removed and reallocated for a different controller
> (could be a completely different discovery controller) by the time
> that userspace handles the event.

So? You will have gotten the remove and then new addition uevent in
order showing you this. So your userspace code knows that something
went away and then came back properly so you should be kept in sync.

thanks,

greg k-h

2019-08-30 18:09:33

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace


>>> Yes we do, userspace should use it to order events. Does udev not
>>> handle that properly today?
>>
>> The problem is not ordering of events, its really about the fact that
>> the chardev can be removed and reallocated for a different controller
>> (could be a completely different discovery controller) by the time
>> that userspace handles the event.
>
> The same is generally true for lot of kernel devices. We could reduce
> the chance by using the idr cyclic allocator.

Well, it was raised by Hannes and James, so I'll ask them respond here
because I don't mind having it this way. I personally think that this
is a better approach than having a cyclic idr allocator. In general, I
don't necessarily think that this is a good idea to have cyclic
controller enumerations if we don't absolutely have to...

2019-08-30 18:15:53

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace


>>>>>> You are correct that this information can be derived from sysfs, but the
>>>>>> main reason why we add these here, is because in udev rule we can't
>>>>>> just go ahead and start looking these up and parsing these..
>>>>>>
>>>>>> We could send the discovery aen with NVME_CTRL_NAME and have
>>>>>> then have systemd run something like:
>>>>>>
>>>>>> nvme connect-all -d nvme0 --sysfs
>>>>>>
>>>>>> and have nvme-cli retrieve all this stuff from sysfs?
>>>>>
>>>>> Actually that may be a problem.
>>>>>
>>>>> There could be a hypothetical case where after the event was fired
>>>>> and before it was handled, the discovery controller went away and
>>>>> came back again with a different controller instance, and the old
>>>>> instance is now a different discovery controller.
>>>>>
>>>>> This is why we need this information in the event. And we verify this
>>>>> information in sysfs in nvme-cli.
>>>>
>>>> Well, that must be a usual issue with uevents, right? Don't we usually
>>>> have a increasing serial number for that or something?
>>>
>>> Yes we do, userspace should use it to order events. Does udev not
>>> handle that properly today?
>>
>> The problem is not ordering of events, its really about the fact that
>> the chardev can be removed and reallocated for a different controller
>> (could be a completely different discovery controller) by the time
>> that userspace handles the event.
>
> So? You will have gotten the remove and then new addition uevent in
> order showing you this. So your userspace code knows that something
> went away and then came back properly so you should be kept in sync.

Still don't understand how this is ok...

I have /dev/nvme0 represents a network endpoint that I would discover
from, it is raising me an event to do a discovery operation (namely to
issue an ioctl to it) so my udev code calls a systemd script.

By the time I actually get to do that, /dev/nvme0 represents now a new
network endpoint (where the event is no longer relevant to). I would
rather the discovery to explicitly fail than to give me something
different, so we pass some arguments that we verify in the operation.

Its a stretch case, but it was raised by people as a potential issue.

2019-08-30 18:38:57

by James Smart

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace

On 8/30/2019 11:08 AM, Sagi Grimberg wrote:
>
>>>> Yes we do, userspace should use it to order events.  Does udev not
>>>> handle that properly today?
>>>
>>> The problem is not ordering of events, its really about the fact that
>>> the chardev can be removed and reallocated for a different controller
>>> (could be a completely different discovery controller) by the time
>>> that userspace handles the event.
>>
>> The same is generally true for lot of kernel devices.  We could reduce
>> the chance by using the idr cyclic allocator.
>
> Well, it was raised by Hannes and James, so I'll ask them respond here
> because I don't mind having it this way. I personally think that this
> is a better approach than having a cyclic idr allocator. In general, I
> don't necessarily think that this is a good idea to have cyclic
> controller enumerations if we don't absolutely have to...

We hit it right and left without the cyclic allocator, but that won't
necessarily remove it.

Perhaps we should have had a unique token assigned to the controller,
and have the event pass the name and the token.  The cli would then, if
the token is present, validate it via an ioctl before proceeding with
other ioctls.

Where all the connection arguments were added we due to the reuse issue
and then solving the question of how to verify and/or lookup the desired
controller, by using the shotgun approach rather than being very
pointed, which is what the name/token would do.

-- james

2019-08-30 21:08:24

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace


>>>>> Yes we do, userspace should use it to order events.  Does udev not
>>>>> handle that properly today?
>>>>
>>>> The problem is not ordering of events, its really about the fact that
>>>> the chardev can be removed and reallocated for a different controller
>>>> (could be a completely different discovery controller) by the time
>>>> that userspace handles the event.
>>>
>>> The same is generally true for lot of kernel devices.  We could reduce
>>> the chance by using the idr cyclic allocator.
>>
>> Well, it was raised by Hannes and James, so I'll ask them respond here
>> because I don't mind having it this way. I personally think that this
>> is a better approach than having a cyclic idr allocator. In general, I
>> don't necessarily think that this is a good idea to have cyclic
>> controller enumerations if we don't absolutely have to...
>
> We hit it right and left without the cyclic allocator, but that won't
> necessarily remove it.
>
> Perhaps we should have had a unique token assigned to the controller,
> and have the event pass the name and the token.  The cli would then, if
> the token is present, validate it via an ioctl before proceeding with
> other ioctls.
>
> Where all the connection arguments were added we due to the reuse issue
> and then solving the question of how to verify and/or lookup the desired
> controller, by using the shotgun approach rather than being very
> pointed, which is what the name/token would do.

This unique token is: trtype:traddr:trsvcid:host-traddr ...

2019-08-30 22:25:54

by James Smart

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace

On 8/30/2019 2:07 PM, Sagi Grimberg wrote:
>
>>>>>> Yes we do, userspace should use it to order events.  Does udev not
>>>>>> handle that properly today?
>>>>>
>>>>> The problem is not ordering of events, its really about the fact that
>>>>> the chardev can be removed and reallocated for a different controller
>>>>> (could be a completely different discovery controller) by the time
>>>>> that userspace handles the event.
>>>>
>>>> The same is generally true for lot of kernel devices.  We could reduce
>>>> the chance by using the idr cyclic allocator.
>>>
>>> Well, it was raised by Hannes and James, so I'll ask them respond here
>>> because I don't mind having it this way. I personally think that this
>>> is a better approach than having a cyclic idr allocator. In general, I
>>> don't necessarily think that this is a good idea to have cyclic
>>> controller enumerations if we don't absolutely have to...
>>
>> We hit it right and left without the cyclic allocator, but that won't
>> necessarily remove it.
>>
>> Perhaps we should have had a unique token assigned to the controller,
>> and have the event pass the name and the token.  The cli would then,
>> if the token is present, validate it via an ioctl before proceeding
>> with other ioctls.
>>
>> Where all the connection arguments were added we due to the reuse
>> issue and then solving the question of how to verify and/or lookup
>> the desired controller, by using the shotgun approach rather than
>> being very pointed, which is what the name/token would do.
>
> This unique token is: trtype:traddr:trsvcid:host-traddr ...

well yes :)  though rather verbose.   There is still a minute window as
we're comparing values in sysfs, prior to opening the device, so
technically something could change in that window between when we
checked sysfs and when we open'd.   We can certain check after we open
the device to solve that issue.

There is some elegance to a 32-bit token for the controller (can be an
incrementing value) passed in the event and used when servicing the
event that avoids a bunch of work.

Doing either of these would eliminate what Hannes liked - looking for
the discovery controller with those attributes. Although, I don't know
that looking for it is all that meaningful.

2019-09-02 19:35:11

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace

On Fri, Aug 30, 2019 at 11:14:39AM -0700, Sagi Grimberg wrote:
>
> > > > > > > You are correct that this information can be derived from sysfs, but the
> > > > > > > main reason why we add these here, is because in udev rule we can't
> > > > > > > just go ahead and start looking these up and parsing these..
> > > > > > >
> > > > > > > We could send the discovery aen with NVME_CTRL_NAME and have
> > > > > > > then have systemd run something like:
> > > > > > >
> > > > > > > nvme connect-all -d nvme0 --sysfs
> > > > > > >
> > > > > > > and have nvme-cli retrieve all this stuff from sysfs?
> > > > > >
> > > > > > Actually that may be a problem.
> > > > > >
> > > > > > There could be a hypothetical case where after the event was fired
> > > > > > and before it was handled, the discovery controller went away and
> > > > > > came back again with a different controller instance, and the old
> > > > > > instance is now a different discovery controller.
> > > > > >
> > > > > > This is why we need this information in the event. And we verify this
> > > > > > information in sysfs in nvme-cli.
> > > > >
> > > > > Well, that must be a usual issue with uevents, right? Don't we usually
> > > > > have a increasing serial number for that or something?
> > > >
> > > > Yes we do, userspace should use it to order events. Does udev not
> > > > handle that properly today?
> > >
> > > The problem is not ordering of events, its really about the fact that
> > > the chardev can be removed and reallocated for a different controller
> > > (could be a completely different discovery controller) by the time
> > > that userspace handles the event.
> >
> > So? You will have gotten the remove and then new addition uevent in
> > order showing you this. So your userspace code knows that something
> > went away and then came back properly so you should be kept in sync.
>
> Still don't understand how this is ok...
>
> I have /dev/nvme0 represents a network endpoint that I would discover
> from, it is raising me an event to do a discovery operation (namely to
> issue an ioctl to it) so my udev code calls a systemd script.
>
> By the time I actually get to do that, /dev/nvme0 represents now a new
> network endpoint (where the event is no longer relevant to). I would
> rather the discovery to explicitly fail than to give me something
> different, so we pass some arguments that we verify in the operation.
>
> Its a stretch case, but it was raised by people as a potential issue.

Ok, and how do you handle this same thing for something like /dev/sda ?
(hint, it isn't new, and is something we solved over a decade ago)

If you worry about stuff like this, use a persistant device naming
scheme and have your device node be pointed to by a symlink. Create
that symlink by using the information in the initial 'ADD' uevent.

That way, when userspace opens the device node, it "knows" exactly which
one it opens. It sounds like you have a bunch of metadata to describe
these uniquely, so pass that in the ADD event, not in some other crazy
non-standard manner.

thanks,

greg k-h

2019-09-04 01:37:38

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace


>> Still don't understand how this is ok...
>>
>> I have /dev/nvme0 represents a network endpoint that I would discover
>> from, it is raising me an event to do a discovery operation (namely to
>> issue an ioctl to it) so my udev code calls a systemd script.
>>
>> By the time I actually get to do that, /dev/nvme0 represents now a new
>> network endpoint (where the event is no longer relevant to). I would
>> rather the discovery to explicitly fail than to give me something
>> different, so we pass some arguments that we verify in the operation.
>>
>> Its a stretch case, but it was raised by people as a potential issue.
>
> Ok, and how do you handle this same thing for something like /dev/sda ?
> (hint, it isn't new, and is something we solved over a decade ago)
>
> If you worry about stuff like this, use a persistant device naming
> scheme and have your device node be pointed to by a symlink. Create
> that symlink by using the information in the initial 'ADD' uevent.
>
> That way, when userspace opens the device node, it "knows" exactly which
> one it opens. It sounds like you have a bunch of metadata to describe
> these uniquely, so pass that in the ADD event, not in some other crazy
> non-standard manner.

We could send these variables when adding the device and then validating
them using a rich-text-explanatory symlink. Seems slightly backwards to
me, but that would work too.

We create the char device using device_add (in nvme_init_subsystem),
I didn't find any way to append env variables to that ADD uevent.

Did you mean that we should add another flavor of device_add that
accepts char *envp_ext[]?

What exactly is the "standard manner" to pass these variables to
the chardev KOBJ_ADD uevent? All other examples I could find use
KOBJ_CHANGE to pass private stuff..

2019-09-04 05:26:48

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace

On Tue, Sep 03, 2019 at 06:35:30PM -0700, Sagi Grimberg wrote:
>
> > > Still don't understand how this is ok...
> > >
> > > I have /dev/nvme0 represents a network endpoint that I would discover
> > > from, it is raising me an event to do a discovery operation (namely to
> > > issue an ioctl to it) so my udev code calls a systemd script.
> > >
> > > By the time I actually get to do that, /dev/nvme0 represents now a new
> > > network endpoint (where the event is no longer relevant to). I would
> > > rather the discovery to explicitly fail than to give me something
> > > different, so we pass some arguments that we verify in the operation.
> > >
> > > Its a stretch case, but it was raised by people as a potential issue.
> >
> > Ok, and how do you handle this same thing for something like /dev/sda ?
> > (hint, it isn't new, and is something we solved over a decade ago)
> >
> > If you worry about stuff like this, use a persistant device naming
> > scheme and have your device node be pointed to by a symlink. Create
> > that symlink by using the information in the initial 'ADD' uevent.
> >
> > That way, when userspace opens the device node, it "knows" exactly which
> > one it opens. It sounds like you have a bunch of metadata to describe
> > these uniquely, so pass that in the ADD event, not in some other crazy
> > non-standard manner.
>
> We could send these variables when adding the device and then validating
> them using a rich-text-explanatory symlink. Seems slightly backwards to
> me, but that would work too.

That's the way the driver model is expected to work, instead of having
to do crazy device-specific stuff.

> We create the char device using device_add (in nvme_init_subsystem),
> I didn't find any way to append env variables to that ADD uevent.

You do that in your uevent or dev_uevent callback like all other
subsystems. Nothing "new" to do here, again, it's been working fine for
everyone else for well over a decade now :)

thanks,

greg k-h

2019-09-10 08:40:45

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace

On 8/31/19 12:24 AM, James Smart wrote:
> On 8/30/2019 2:07 PM, Sagi Grimberg wrote:
>>
>>>>>>> Yes we do, userspace should use it to order events.  Does udev not
>>>>>>> handle that properly today?
>>>>>>
>>>>>> The problem is not ordering of events, its really about the fact that
>>>>>> the chardev can be removed and reallocated for a different controller
>>>>>> (could be a completely different discovery controller) by the time
>>>>>> that userspace handles the event.
>>>>>
>>>>> The same is generally true for lot of kernel devices.  We could reduce
>>>>> the chance by using the idr cyclic allocator.
>>>>
>>>> Well, it was raised by Hannes and James, so I'll ask them respond here
>>>> because I don't mind having it this way. I personally think that this
>>>> is a better approach than having a cyclic idr allocator. In general, I
>>>> don't necessarily think that this is a good idea to have cyclic
>>>> controller enumerations if we don't absolutely have to...
>>>
>>> We hit it right and left without the cyclic allocator, but that won't
>>> necessarily remove it.
>>>
>>> Perhaps we should have had a unique token assigned to the controller,
>>> and have the event pass the name and the token.  The cli would then,
>>> if the token is present, validate it via an ioctl before proceeding
>>> with other ioctls.
>>>
>>> Where all the connection arguments were added we due to the reuse
>>> issue and then solving the question of how to verify and/or lookup
>>> the desired controller, by using the shotgun approach rather than
>>> being very pointed, which is what the name/token would do.
>>
>> This unique token is: trtype:traddr:trsvcid:host-traddr ...
>
> well yes :)  though rather verbose.   There is still a minute window as
> we're comparing values in sysfs, prior to opening the device, so
> technically something could change in that window between when we
> checked sysfs and when we open'd.   We can certain check after we open
> the device to solve that issue.
>
> There is some elegance to a 32-bit token for the controller (can be an
> incrementing value) passed in the event and used when servicing the
> event that avoids a bunch of work.
>
> Doing either of these would eliminate what Hannes liked - looking for
> the discovery controller with those attributes. Although, I don't know
> that looking for it is all that meaningful.
>
yeah, we do have this fundamental problem with uevents; they do refer to
a '/dev/nvmeX' device with those parameters, but this device might be
long gone (or have been reallocated) by the time the event is processed.

From my POV we have two choices here:
- rely on the transport options to find a matching controller (ignoring
the device name) and use that for sending out discovery requests. In the
end, it shouldn't really matter device we're using if the transport
options are identical. Although I'm not sure for RDMA; here we don't
necessarily have a host transport address, so we _might_ send the
discovery via the wrong controller in a CMIC enviroment.
- Match the options in nvme-cli, and just discard the event if it
doesn't match. Which is some additional coding in nvme-cli and might ran
afoul if we ever miss events.

I'd go for the second option; after considering the first introduces far
too many conditions rendering debugging impractical.

Cheers,

Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
[email protected] +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer