2003-06-12 18:57:33

by Steven Dake

[permalink] [raw]
Subject: [PATCH] udev enhancements to use kernel event queue

Folks,

I have been looking at the udev idea that Greg KH has developed.
Userland device enumeration definately is the way to go, however, there
are some problems with using /sbin/hotplug to transmit device
enumeration events:

1) performance of fork/exec with lots of devices is very poor
2) lots of processes can be forked with no policy to control how many
are forked at once potentially leading to OOM
3) /sbin/hotplug events can occur out of order, eg: remove event occurs,
/sbin/hotplug sleeps waiting for something, insert event occurs and
completes immediately. Then the remove event completes, after the
insert, resulting in out-of-order completion and a broken /dev. I have
seen this several times with udev.
4) early device enumeration (character devices, etc) are missed because
/sbin/hotplug is not available during early device init.

To solve these problems, I've developed a driver and some minor
modifications to the lib/kobject.c to queue these events and allow a
user space program to read the events at its leisure. The driver
supports poll/select for effecient use of the processor.

The feature is called the System Device Enumeration Event Queue. I've
attached a tarball which includes udev-0.1 modified to use the SDEQ, a
kernel patch against linux 2.5.70 that implements the SDEQ, and a
test/library that tests the SDEQ functionality.

To jump right in try these steps:
1) apply the SDEQ patch
2) configure CONFIG_SDEQ on in config/general
3) make in the udev-0.1 directory
4) mkdir /newdevs (this is where the devices are enumerated)
5) reboot with new kernel
6) mount sysfs in /sys
7) mknod /dev/sdeq c 10 222 (this creates the device node that is used
to communicate with the SDEQ)
7) run ./udev in udev-0.1 directory

Look in /newdevs for enumerated devices.

you can use fdisk on a disk device to see how dynamic device
creation/deletion works.

If it works for you or doesn't or you like the idea or don't, I've love
to hear about it

Thanks
-steve


Attachments:
sdeq.tar.gz (19.69 kB)

2003-06-12 21:32:41

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Nice, you posted the same message and tarball to lkml that we have been
talking about privately for a few days.

Just to save time, and to bring everyone else up to speed, I'll just
include my responses and then yours to this thread and then finally
respond to your last message to me about this.

Let me know if I get any of the attribution wrong, this is going to be a
fun cut and paste...


On Thu, Jun 12, 2003 at 12:10:48PM -0700, Steven Dake wrote:
> Folks,
>
> I have been looking at the udev idea that Greg KH has developed.
> Userland device enumeration definately is the way to go, however, there
> are some problems with using /sbin/hotplug to transmit device
> enumeration events:
>
> 1) performance of fork/exec with lots of devices is very poor

You later said your binary of udev was 500k. I said:
500k binary? Who's running anything that big? udev weighs in
around 6k right now! You have to add a _lot_ of functionality
to move up to 500k.

You responded:
Hmm perhaps my binary was not built optmized when I looked at
it, I'll take another look.

> 2) lots of processes can be forked with no policy to control how many
> are forked at once potentially leading to OOM

I responded:
No, this is pure speculation. Have you actually tried this out?
I have. I've hotplugged pci drawers connected to huge numbers
of SCSI disks and never even seen a blip on system load. It
turns out that pretty much everything happens consecutively, with
a reasonable amount of time just spent probing the SCSI busses.
USB is pretty slow in enumerating their devices too, so I don't
see a real world application that has this problem. Do you have
any proof of this?

You responded:
I don't agree. I've timed bootup with 12 fibrechannel disks
with 4 partitions each with udev using the SDEF patches. I've
then timed bootup and added in the time required to add those 12
fibrechannel disks (via a script) using /sbin/hotplug. The
script method (executing /sbin/hotplug 12*4 times with the
appropriate data hardcoded) takes rougly 1.25-1.75 seconds
longer. Since I'm using a chronograph, its not very accurate,
but it is definately slower to enumerate with /sbin/hotplug
using this methodlogy. That is only 12 disks. What about the
case of 1000 disks?

fork and exec are very expensive system calls, to be avoided at
all costs...

To which I responded:
Well, if you don't dynamically link a 500K file, perhaps it
would be a lot faster to start up :)

Anyway, who really cares? This is _not_ a time critical thing.
Who really cares that it takes 1 second longer to see the device
node for the newly plugged in device? And if you plug in 5000
devices all at once, you better be willing to wait the extra
minute or so for the system to calm down and make all of the
devices available to userspace.

And again you came back with:
System bootup time and time between failures are the two factors
used to determine availability. Reducing bootup time and
increasing time between failures both improve availability. Even
1 second is considerable when 5 9s is 30seconds of downtime per
year. 1 more second is a considerable change in availability
when you run the equations.

I now respond:
Bootup time reduction would be a great thing to have. And I agree for
situations where 1 second is a considerable amount of time, it is
important. However, 99.99% of the people running Linux out there, do
not have those problems. And even then, for those 00.01% of the people
who do, they do whatever they possibly can to keep from having to reboot
at all. I still say the measurable ammount of time to plug in a new
disk and have it's device node created is not measurable, from using
/sbin/hotplug and udev, and using your character node. Also, the
ultimate solution for these kinds of people is the in-kernel devfs, or
the current static /dev. If they use either of them, it's guaranteed to
be faster then yours or mine implementation.

> 3) /sbin/hotplug events can occur out of order, eg: remove event occurs,
> /sbin/hotplug sleeps waiting for something, insert event occurs and
> completes immediately. Then the remove event completes, after the
> insert, resulting in out-of-order completion and a broken /dev. I have
> seen this several times with udev.

I responded:
Yes this happens. I have a fix for this for udev itself. No
kernel changes are needed. I'll show it at OLS in July if you
want to see it :)

You responded:
There is a problem in udev that fork doesn't wait for the parent
to exit (mknod) which is solved by using the mknod system call.
But there is still another problem that /sbin/hotplug can
execute out of order. I'll be at OLS so I'll take a look at
your solution then.

So we agree to talk about this at OLS, fine.

> 4) early device enumeration (character devices, etc) are missed because
> /sbin/hotplug is not available during early device init.

I responded:
Then use early initramfs to catch this. I've done this and seen
it work already. If you don't want to do early init, walk the
sysfs tree yourself from userspace, it works just as well, and
we don't have to do any buffering in kernelspace at all.

You responded:
Initramfs still misses early events. That leaves walking sysfs.

I have considered and rejected using a user space program to
walk sysfs for this information. There are several problems
1) kobject device name is not present in sysfs (how do you know
if its a character or block, then without some external mapping
db, yuck)
2) scan of sysfs is not atomic, resulting in potential lost
events and lost device enumerations

I responded:
You don't know this from hotplug either. You have to figure it
out from the kobject name either way. Actually it's quite easy,
only block devices show up in /sys/block, everything else is a
character device.

No, this is quite simple to fix:
- set up /sbin/hotplug udev handler after init has
started.
- start walking sysfs to "cold plug" devices.
- any new events that happen are caught by udev and
handled there, while you are walking the tree.
- no events are lost.

You responded:
Events could be handled in this situation twice, which is
definately not what is desireable.

I now respond:
Doing a mknod() on a node that is already present doesn't harm anything.
And since you have to keep a database around of what devices are
present, and what you have called them, removing them after already
removing them again, is detectable. So handing the same event twice
isn't a problem.

> To solve these problems, I've developed a driver and some minor
> modifications to the lib/kobject.c to queue these events and allow a
> user space program to read the events at its leisure. The driver
> supports poll/select for effecient use of the processor.
>
> The feature is called the System Device Enumeration Event Queue. I've
> attached a tarball which includes udev-0.1 modified to use the SDEQ, a
> kernel patch against linux 2.5.70 that implements the SDEQ, and a
> test/library that tests the SDEQ functionality.

I responded:
Ok, I looked at your code. To put it kindly, it will never work
properly. See the long thread on lkml by Inaky after I
announced udev. He tried to create something like what you have
done, but got a big better implementation (yours has a number of
memory leaks, and permission problems...)

See that thread for why doing this in the kernel is not the way
to go. Please don't waste your time on this anymore, it's a
loosing battle...

You responded:
It obviously works properly when using the udev application so I
am at a loss as to how you can make a claim that "it will never
work". Please explain your comments about memory leaks. Data
structures are allocated and then freed at appropriate times.
Even in the case of a crash during a transaciton (get event,
clera events), there are no leaks. Please explain permission
problems. Permissions are controlled through policy in the
filesystem (/dev/sdeq) which fully controls access to the system
device enumeration event queue.

I did look at Inaky's code, and frankly it was too complicated
to solve the simple problem that needed to be solved. We also
have a full-blown event mechanism (for redundant system slot
compact pci support), but feel its too heavy-weight to be of
general purpose use to the kernel. This particular
implementation is lightweight and solves a specific need with
real applications that can really use the functionality.

I have read the thread and atleast 70% of the comments where "we
should not use /sbin/hotplug, but instead events should be
queued in the kernel". The thread raised all of the issues that
are solved with this kernel patch. Several core kernel
maintainers were the authors of the comments, so I'm not alone
in this belief.

Your obviously opposed to using character devices to queue
kobject creation/deletion events for some reason which I don't
understand. /sbin/hotplug is an inferior solution to queueing
events in the kernel. Your not the maintainer of the code this
feature is going into (which is the linux driver model). I'd
really like to hear Pat's thoughts on this issue as he is the
one that has to live with the changes. If you want to continue
using /sbin/hotplug, with all of its numerous problems for your
work, your more then welcome to do so. No one is forcing you to
use sdeq, however, for those other vendors that may want to make
persistent device naming solutions, this is a core feature that
can not easily be solved by poorly thought out hacks.

I will respond now:

Ok, permission problems can be solved by relying on the permission of
the device node, you are correct.

As for the memory issues, if no one ever reads from the character node,
it will constantly fill up with events, right? Shouldn't they be
eventually flushed out at some point in time?

Also for trying to remove events, why is userspace allowed to remove a
multiple of events at once? I would think that a valid read would
remove that event, you can have two applications grab the same event,
which is not what I think you want to have happen.


And finally...

Yes, I know I'm not the maintainer of the code that this feature is
going into. But my position is that this code is not needed in the main
kernel tree, and offer that opinion to the maintainer of this code.

I agree, most of the arguments in the previous udev thread talked about
out of order events, but I think I've gotten that solved now (I can wave
my hands about how it will be done, or I can show you working code at
OLS. I think I'll wait until OLS, as code matters, wavy hands don't.)

So, because of this, I still think that everything you are trying to do
can be successfully done from userspace today (or use devfsd today, it
will give you the same info!) And due to that, I do not think this code
is necessary to be in the kernel.

Thanks for reading this far, and my apologies if I got any of the above
quotes out of order or misspoken, I just wanted to cut to the chase, and
not have to go through 4 more rounds of email to make it to this point
in the conversation again.

greg k-h

2003-06-12 21:32:59

by Patrick Mochel

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue


*Sigh* You're asking for trouble, aren't you?

For the new spectators in the crowd, Steven, Greg and I have been on a
private email thread about this. This is the original email that he sent,
with all comments made to him completely disregarded..

> I have been looking at the udev idea that Greg KH has developed.
> Userland device enumeration definately is the way to go, however, there
> are some problems with using /sbin/hotplug to transmit device
> enumeration events:
>
> 1) performance of fork/exec with lots of devices is very poor

Please define:

- 'lots'

- 'poor'


The performance of /sbin/hotplug doesn't matter. It's a blip on the radar.
If you have lots of devices (e.g. 1000 SCSI disks) the time it takes to
fork + exec /sbin/hotplug is nothing compared to what it would take to
spin up and probe each of the disks. If you don't have a lot of devices,
then you're going to be spending a lot less time running /sbin/hotplug. So
little that it's not going to make a difference, I'm willing to wager.

If you have accurate time measurements, then I would be interested in
seeing them. Though, I request that you test the following:

- The default /sbin/hotplug for Redhat 9 systems
- The default /sbin/hotplug for Montavista systems.
- Greg's mini-hotplug
- Your sdeq

Please test against an identical kernel, with and without your patch.
Please also post links to (or attach) the /sbin/hotplug binaries and
scripts that you tested with. And, please post the complete 'tree -d'
output of sysfs on the system(s) you tested on.

> 2) lots of processes can be forked with no policy to control how many
> are forked at once potentially leading to OOM

Have you seen this happen? Sure, it's theoretically possible, but I highly
doubt it will ever happen on a real system. Typically, if you have an
astronomical number of devices, then you'll have an incredible amount of
memory, too. Please post a bug report when you see this.

> 3) /sbin/hotplug events can occur out of order, eg: remove event occurs,
> /sbin/hotplug sleeps waiting for something, insert event occurs and
> completes immediately. Then the remove event completes, after the
> insert, resulting in out-of-order completion and a broken /dev. I have
> seen this several times with udev.

This is true, and I'll let Greg handle this one.

> 4) early device enumeration (character devices, etc) are missed because
> /sbin/hotplug is not available during early device init.

initramfs, as well as one of many cold-plugging solutions out there should
suffice for this. If they don't, we would welcome and appreciate your
effort in helping overcome these deficiencies by evolving the current
system rather than attempting a hostile takeover.

> To solve these problems, I've developed a driver and some minor
> modifications to the lib/kobject.c to queue these events and allow a
> user space program to read the events at its leisure. The driver
> supports poll/select for effecient use of the processor.

Listen, it's not going to fly. Greg is hotplug czar, and I won't take the
patches if he doesn't like them.

Secondly, you're just not going to replace hotplug. At least not now.
/sbin/hotplug works today and has no serious, provable issues, besides
events coming in out of order. Incredibly long boot times, OOM situations
just have not happened. And, we've agreed that we're not going to
implement an overdesigned solution to fix problems that aren't there yet.
Show us REAL bugs and we'll work on making what we have better.

Thirdly, /sbin/hotplug is an ASCII interface, providing flexibility in the
userspace agent implementations. Your character device (which is not
appreciated) forces the userspace tools to use select(2), poll(2), or
ioctl(2). No simple read(2)/write(2). Binary interfaces == Bad(tm). If you
have doubts, please read the threads concerning binary vs. ASCII
interfaces over the last two years before replying.

I'm not even going to go as far as comment on the code, since conceptually
its FITH.

Finally, I think you need a serious attitude readjustment. You've exceeded
all expectations in the level of jackass that you've acheived. You
completley disregarded Greg's comments in a private thread started from
the SAME EXACT email. You were pompous and arrogant to the person whose
code you're trying to replace. Then, you have the nerve to start over on
this list. Did you think we wouldn't notice? If anything, you've
guaranteed that I will never take your emails seriously again.

I really wish the device driver people at Montavista would pool their
collective partial clues and figure WTF the rest of the world is doing.
Do you guys listen? Do you read email? Hello? McFly?

I'm frankly sick of you people wasting our time with these half-assed,
hare-brained hacks that you expect us to love and integrate. You never try
to evolve the current system; it's always a case of rewriting something
that at least works with a completely new interface. You don't listen to
our input, and you disappear after a few emails. Then, you try it all
again a few months later.

It's frustrating because you're obviously talented and have the
time+energy to help, but you never seem to attempt to align yourselves
with us or the rest of the kernel.

Please, either a) re-read the email threads, the papers, the magazine
articles, and/or the text file documentation and attempt to work with us;
or b) stop trying to get us to listen.


-pat



2003-06-12 21:53:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Greg KH <[email protected]> wrote:
>
> > 3) /sbin/hotplug events can occur out of order, eg: remove event occurs,
> > /sbin/hotplug sleeps waiting for something, insert event occurs and
> > completes immediately. Then the remove event completes, after the
> > insert, resulting in out-of-order completion and a broken /dev. I have
> > seen this several times with udev.
>
> I responded:
> Yes this happens. I have a fix for this for udev itself. No
> kernel changes are needed. I'll show it at OLS in July if you
> want to see it :)

This is a significantly crappy aspect of the /sbin/hotplug callout. I'd be
very interested in reading an outline of how you propose fixing it, without
waiting until OLS, thanks.

2003-06-12 22:13:57

by Oliver Neukum

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue


> If it works for you or doesn't or you like the idea or don't, I've love
> to hear about it

+ default:
+ result = -EINVAL;
+ break;
+ }
+ return (result);

Must return ENOTTY.

+static int sdeq_open (struct inode *inode, struct file *file)
+{
+ MOD_INC_USE_COUNT;
+
+ return 0;
+}
+
+static int sdeq_release (struct inode *inode, struct file *file)
+{
+ MOD_DEC_USE_COUNT;
+
+ return (0);
+}

Wrong. release does not map to close()


Aside from that, what exactly are you trying to do?
You are not solving the fundamental device node reuse race,
yet you are making necessary a further demon.
You are not addressing queue limits. The current hotplug
scheme does so, admittedly crudely by failing to spawn
a task, but considering the small numbers of events in
question here, for the time being we can live with that.

You can just as well add load control and error detection
to the current scheme. You fail to do so in your scheme.
You cannot queue events forever in unlimited numbers.

As for ordering, this is a real problem, but not fundamental.
You can make user space locking work. IMHO it will not be
pretty if done with shell scripts, but it can work.
There _is_ a basic problem with the kernel 'overtaking'
user space in its view of the device tree, but you cannot solve
that _at_ _all_ in user space.

In short, if you feel that the hotplug scheme is inadequate
for your needs, then write industry strength devfs2.

Regards
Oliver

2003-06-12 22:35:40

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Thu, Jun 12, 2003 at 03:03:35PM -0700, Andrew Morton wrote:
> Greg KH <[email protected]> wrote:
> >
> > > 3) /sbin/hotplug events can occur out of order, eg: remove event occurs,
> > > /sbin/hotplug sleeps waiting for something, insert event occurs and
> > > completes immediately. Then the remove event completes, after the
> > > insert, resulting in out-of-order completion and a broken /dev. I have
> > > seen this several times with udev.
> >
> > I responded:
> > Yes this happens. I have a fix for this for udev itself. No
> > kernel changes are needed. I'll show it at OLS in July if you
> > want to see it :)
>
> This is a significantly crappy aspect of the /sbin/hotplug callout. I'd be
> very interested in reading an outline of how you propose fixing it, without
> waiting until OLS, thanks.

Sure, I knew someone would probably want to :)

Anyway, here's what I've come up with, feel free to shoot it full of
holes:

<handwaving>
- serialize the hotplug events in userspace:
- udev daemon running listening on named pipe
- small event generator kicked off from /sbin/hotplug
call to write event to udev pipe

This alone solves the major memory issues that people have
complained about, and allows us to keep a ram database of
present devices and their names, which a lot of people want to
have. It also makes the /sbin/hotplug binary even smaller
than 6k :)

- apply debounce on events:
- get event, delay Tbus amount of time.
- after time expires, check queue to see if we have any
other events for this device.
- if not, this is the only one, act on it.
- if so, delay Tbus amount of time again.
- continue delaying until no new events for this device
are present.
- count up events for this device, and throw away the
odd ones (even vs. odd, i.e. 2 adds and 1 remove
really mean add the device.)
- if both counts are even, then leave device at its
current state (added or removed) but check device
attributes to see if we had named it a "special" name.
If so, need to make sure "special" name is still
correct. If not, fix it.

Now the whole trick is coming up with the Tbus time limit :)

For all physical busses, it takes a decent amount of time to add or
remove a device (in the seconds for PCI and USB). It's pretty hard to
get these events out of order in the first place, except on a _very_
heavily loaded system (I've tried.) It's easier to get events out of
order for virtual devices (like scsi-debug). That's why a different
time value for different busses makes sense.

So if Tbus is too small, we do get events out of order, make Tbus too
big, and we start delaying too long, and get a real deep queue.
So it's better to leave Tbus too big to be safe, more testing of proper
values is essential before I even start to claim that this will work for
all people, but I do think it is possible.

One other thing that I think will work is making it a sliding scale (if
we get an event, for example, for add, and the device is already there,
increase Tbus and throw it back on the queue (and don't delete the other
ones) and sleep again). This too needs a lot of testing.

The above sequence has seemed to work pretty well for me so far, but it
needs a lot more work and tweaking with real life loads.

I've also been sidetracked recently with other work, but should have
code to show by OLS...

</handwaving>

thanks,

greg k-h

2003-06-12 22:42:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Greg KH <[email protected]> wrote:
>
> <handwaving>

heh.

Thought about adding a sequence number to the /sbin/hotplug argument list?

2003-06-12 22:46:48

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Thu, Jun 12, 2003 at 03:51:48PM -0700, Andrew Morton wrote:
> Greg KH <[email protected]> wrote:
> >
> > <handwaving>
>
> heh.
>
> Thought about adding a sequence number to the /sbin/hotplug argument list?

/me owes you a lot of beer at ols.

That's a world simpler than my proposal, I think I'll go write that
patch right now...

thanks,

greg k-h

2003-06-12 22:52:16

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Thu, 2003-06-12 at 15:50, Greg KH wrote:

> - serialize the hotplug events in userspace:
> - udev daemon running listening on named pipe
> - small event generator kicked off from /sbin/hotplug
> call to write event to udev pipe

What if you just passed a sequence number to /sbin/hotplug ?

Robert Love

2003-06-12 22:53:34

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Thu, Jun 12, 2003 at 04:02:47PM -0700, Greg KH wrote:
> On Thu, Jun 12, 2003 at 03:51:48PM -0700, Andrew Morton wrote:
> > Greg KH <[email protected]> wrote:
> > >
> > > <handwaving>
> >
> > heh.
> >
> > Thought about adding a sequence number to the /sbin/hotplug argument list?
>
> /me owes you a lot of beer at ols.
>
> That's a world simpler than my proposal, I think I'll go write that
> patch right now...

Pat, here's a patch to add a sequence number to kobject hotplug calls to
help userspace out a lot.

thanks,

greg k-h

# kobject: add sequence number to hotplug events to help userspace out.

diff -Nru a/lib/kobject.c b/lib/kobject.c
--- a/lib/kobject.c Thu Jun 12 16:05:06 2003
+++ b/lib/kobject.c Thu Jun 12 16:05:06 2003
@@ -100,6 +100,7 @@

#define BUFFER_SIZE 1024 /* should be enough memory for the env */
#define NUM_ENVP 32 /* number of env pointers */
+static unsigned long sequence_num;
static void kset_hotplug(const char *action, struct kset *kset,
struct kobject *kobj)
{
@@ -151,6 +152,10 @@

envp [i++] = scratch;
scratch += sprintf(scratch, "ACTION=%s", action) + 1;
+
+ envp [i++] = scratch;
+ scratch += sprintf(scratch, "SEQNUM=%ld", sequence_num) + 1;
+ ++sequence_num;

kobj_path_length = get_kobj_path_length (kset, kobj);
kobj_path = kmalloc (kobj_path_length, GFP_KERNEL);

2003-06-12 22:59:28

by Patrick Mochel

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue


> Pat, here's a patch to add a sequence number to kobject hotplug calls to
> help userspace out a lot.

Thanks, applied.


-pat

2003-06-12 23:01:58

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Thu, 2003-06-12 at 16:09, Greg KH wrote:

> +
> + envp [i++] = scratch;
> + scratch += sprintf(scratch, "SEQNUM=%ld", sequence_num) + 1;
> + ++sequence_num;

Since I do not see a lock in here, I think you need to use atomics?

As is, you can also race and have the same sequence_num value written
out twice.

Robert Love

2003-06-12 23:13:45

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Thu, Jun 12, 2003 at 04:16:03PM -0700, Robert Love wrote:
> On Thu, 2003-06-12 at 16:09, Greg KH wrote:
>
> > +
> > + envp [i++] = scratch;
> > + scratch += sprintf(scratch, "SEQNUM=%ld", sequence_num) + 1;
> > + ++sequence_num;
>
> Since I do not see a lock in here, I think you need to use atomics?
>
> As is, you can also race and have the same sequence_num value written
> out twice.

Doh, I ate too much for lunch today and need to wake up...

Pat, here's an updated patch.

thanks,

greg k-h


diff -Nru a/lib/kobject.c b/lib/kobject.c
--- a/lib/kobject.c Thu Jun 12 16:21:21 2003
+++ b/lib/kobject.c Thu Jun 12 16:21:21 2003
@@ -100,6 +100,8 @@

#define BUFFER_SIZE 1024 /* should be enough memory for the env */
#define NUM_ENVP 32 /* number of env pointers */
+static unsigned long sequence_num;
+static spinlock_t sequence_lock = SPIN_LOCK_UNLOCKED;
static void kset_hotplug(const char *action, struct kset *kset,
struct kobject *kobj)
{
@@ -151,6 +153,12 @@

envp [i++] = scratch;
scratch += sprintf(scratch, "ACTION=%s", action) + 1;
+
+ spin_lock(&sequence_lock);
+ envp [i++] = scratch;
+ scratch += sprintf(scratch, "SEQNUM=%ld", sequence_num) + 1;
+ ++sequence_num;
+ spin_unlock(&sequence_lock);

kobj_path_length = get_kobj_path_length (kset, kobj);
kobj_path = kmalloc (kobj_path_length, GFP_KERNEL);

2003-06-12 23:11:11

by Patrick Mochel

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue


On 12 Jun 2003, Robert Love wrote:

> On Thu, 2003-06-12 at 16:09, Greg KH wrote:
>
> > +
> > + envp [i++] = scratch;
> > + scratch += sprintf(scratch, "SEQNUM=%ld", sequence_num) + 1;
> > + ++sequence_num;
>
> Since I do not see a lock in here, I think you need to use atomics?
>
> As is, you can also race and have the same sequence_num value written
> out twice.

Yeah, how about this ammended patch?


-pat

===== lib/kobject.c 1.25 vs edited =====
--- 1.25/lib/kobject.c Wed Jun 11 12:06:06 2003
+++ edited/lib/kobject.c Thu Jun 12 16:24:26 2003
@@ -100,6 +100,9 @@

#define BUFFER_SIZE 1024 /* should be enough memory for the env */
#define NUM_ENVP 32 /* number of env pointers */
+static unsigned long sequence_num;
+static spinlock_t sequence_lock = SPIN_LOCK_UNLOCKED;
+
static void kset_hotplug(const char *action, struct kset *kset,
struct kobject *kobj)
{
@@ -112,6 +115,7 @@
int kobj_path_length;
char *kobj_path = NULL;
char *name = NULL;
+ unsigned long seq;

/* If the kset has a filter operation, call it. If it returns
failure, no hotplug event is required. */
@@ -151,6 +155,13 @@

envp [i++] = scratch;
scratch += sprintf(scratch, "ACTION=%s", action) + 1;
+
+ spin_lock(&sequence_lock);
+ seq = sequence_num++;
+ spin_unlock(&sequence_lock);
+
+ envp [i++] = scratch;
+ scratch += sprintf(scratch, "SEQNUM=%ld", seq) + 1;

kobj_path_length = get_kobj_path_length (kset, kobj);
kobj_path = kmalloc (kobj_path_length, GFP_KERNEL);

2003-06-12 23:17:34

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Thu, Jun 12, 2003 at 04:29:25PM -0700, Robert Love wrote:
> On Thu, 2003-06-12 at 16:25, Patrick Mochel wrote:
>
> > Yeah, how about this ammended patch?
>
> Both this and Greg's look fine.
>
> I guess this is preferred, since the lock hold time is shorter :)

Heh, I don't care either way, It's up to Pat...

thanks for pointing this out,

greg k-h

2003-06-12 23:17:00

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Thu, 2003-06-12 at 16:25, Patrick Mochel wrote:

> Yeah, how about this ammended patch?

Both this and Greg's look fine.

I guess this is preferred, since the lock hold time is shorter :)

> + spin_lock(&sequence_lock);
> + seq = sequence_num++;
> + spin_unlock(&sequence_lock);
> +
> + envp [i++] = scratch;
> + scratch += sprintf(scratch, "SEQNUM=%ld", seq) + 1;

Nice thinking. It is a shame we need a lock for this, but we don't have
an atomic_inc_and_return().

Robert Love

2003-06-12 23:20:15

by Patrick Mochel

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue


> > + spin_lock(&sequence_lock);
> > + seq = sequence_num++;
> > + spin_unlock(&sequence_lock);
> > +
> > + envp [i++] = scratch;
> > + scratch += sprintf(scratch, "SEQNUM=%ld", seq) + 1;
>
> Nice thinking. It is a shame we need a lock for this, but we don't have
> an atomic_inc_and_return().

Those were my sentiments exactly:

16:21 * mochel searches for atomic_inc_and_read() :)

It seems like the following should work. Would someone mind commenting on
it?

Thanks,


-pat


===== include/asm/atomic.h 1.4 vs edited =====
--- 1.4/include/asm-i386/atomic.h Mon Feb 4 23:42:13 2002
+++ edited/include/asm/atomic.h Thu Jun 12 16:32:10 2003
@@ -110,6 +110,23 @@
:"m" (v->counter));
}

+
+/**
+ * atomic_inc_and_read - increment atomic variable and return new value
+ * @v: pointer of type atomic_t
+ *
+ * Atomically increments @v by 1. Note that the guaranteed
+ * useful range of an atomic_t is only 24 bits.
+ */
+static inline int atomic_inc_and_read(atomic_t *v)
+{
+ __asm__ __volatile__(
+ LOCK "incl %0"
+ :"=m" (v->counter)
+ :"m" (v->counter));
+ return v->counter;
+}
+
/**
* atomic_dec - decrement atomic variable
* @v: pointer of type atomic_t

2003-06-12 23:26:44

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Thu, 2003-06-12 at 16:34, Patrick Mochel wrote:

> > Nice thinking. It is a shame we need a lock for this, but we don't
> > have an atomic_inc_and_return().
>
> Those were my sentiments exactly:

Heh.

> +static inline int atomic_inc_and_read(atomic_t *v)
> +{
> + __asm__ __volatile__(
> + LOCK "incl %0"
> + :"=m" (v->counter)
> + :"m" (v->counter));
> + return v->counter;
> +}

What prevents a race between the increment and the return (i.e. you
return v->counter after another person also increments it)? Only the
increment is guaranteed atomic.

I think you need to copy the result of the increment into a local
variable _inside_ of the LOCK and return that. Whether or not that will
work sanely on all architectures I dunno.

Robert Love

2003-06-12 23:31:50

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Patrick Mochel writes:

> It seems like the following should work. Would someone mind commenting on
> it?

> +/**
> + * atomic_inc_and_read - increment atomic variable and return new value
> + * @v: pointer of type atomic_t
> + *
> + * Atomically increments @v by 1. Note that the guaranteed
> + * useful range of an atomic_t is only 24 bits.
> + */
> +static inline int atomic_inc_and_read(atomic_t *v)
> +{
> + __asm__ __volatile__(
> + LOCK "incl %0"
> + :"=m" (v->counter)
> + :"m" (v->counter));
> + return v->counter;
> +}

BZZZT. If another CPU is also doing atomic_inc_and_read you could end
up with both calls returning the same value.

You can't do atomic_inc_and_read on 386. You can on cpus that have
cmpxchg (e.g. later x86). You can also on machines with load-locked
and store-conditional instructions (alpha, ppc, probably most other
RISCs).

Paul.

2003-06-12 23:34:09

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Thu, 12 Jun 2003, Patrick Mochel wrote:

>
> > > + spin_lock(&sequence_lock);
> > > + seq = sequence_num++;
> > > + spin_unlock(&sequence_lock);
> > > +
> > > + envp [i++] = scratch;
> > > + scratch += sprintf(scratch, "SEQNUM=%ld", seq) + 1;
> >
> > Nice thinking. It is a shame we need a lock for this, but we don't have
> > an atomic_inc_and_return().
>
> Those were my sentiments exactly:
>
> 16:21 * mochel searches for atomic_inc_and_read() :)
>
> It seems like the following should work. Would someone mind commenting on
> it?

It does not Pat, look at the generated asm.



- Davide

2003-06-12 23:34:56

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Thu, 2003-06-12 at 16:40, Paul Mackerras wrote:

> BZZZT. If another CPU is also doing atomic_inc_and_read you could end
> up with both calls returning the same value.

That is what I thought. Damn.

> You can't do atomic_inc_and_read on 386. You can on cpus that have
> cmpxchg (e.g. later x86). You can also on machines with load-locked
> and store-conditional instructions (alpha, ppc, probably most other
> RISCs).

So this is doable on everything but old i386 chips... hrm.

Robert Love

2003-06-12 23:37:38

by Patrick Mochel

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue


> You can't do atomic_inc_and_read on 386. You can on cpus that have
> cmpxchg (e.g. later x86). You can also on machines with load-locked
> and store-conditional instructions (alpha, ppc, probably most other
> RISCs).

Damn, so be it. Ho hum, back to the spinlock I go..


-pat

2003-06-13 07:03:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Thu, Jun 12, 2003 at 12:10:48PM -0700, Steven Dake wrote:
> Folks,
>
> I have been looking at the udev idea that Greg KH has developed.
> Userland device enumeration definately is the way to go, however, there
> are some problems with using /sbin/hotplug to transmit device
> enumeration events:

Given the comments already on the list I won't repeat commetning
on your attitued, but if you really think the maintainers and everyone
else is totally clueless you should probably start the mail with that
so every is aware of it..

So if you can prove the fork+exec really is a problem I'd suggest you
go for an existing, standard interface instead. That would be netlink
and not some chardev ioctl hack..

2003-06-13 08:29:19

by Alan

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Iau, 2003-06-12 at 23:03, Andrew Morton wrote:
> This is a significantly crappy aspect of the /sbin/hotplug callout. I'd be
> very interested in reading an outline of how you propose fixing it, without
> waiting until OLS, thanks.

Dave Miller posted a simple patch to allow netlink to be used for
kernel->user messages - hotplug/disk error/logging/whatever. I'm
suprised therefore that the whole thing is being regurgitated again.


2003-06-13 08:31:45

by Alan

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Gwe, 2003-06-13 at 00:40, Paul Mackerras wrote:
> You can't do atomic_inc_and_read on 386. You can on cpus that have

lock xaddl $1, [foo]

2003-06-13 09:00:15

by Olivier Galibert

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Fri, Jun 13, 2003 at 09:40:37AM +0100, Alan Cox wrote:
> Dave Miller posted a simple patch to allow netlink to be used for
> kernel->user messages - hotplug/disk error/logging/whatever. I'm
> suprised therefore that the whole thing is being regurgitated again.

"Netlink is not a reliable protocol." sez the manpage.

That sounds a little ridiculous for a local kernel<->user message
protocol. Having to manage acks on that kind of communication is both
painful and quasi-untestable.

OG.

2003-06-13 10:49:53

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Alan Cox writes:

> lock xaddl $1, [foo]

On 386? I stand corrected. :)

Paul.

2003-06-13 10:53:45

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Paul Mackerras <[email protected]> wrote:
> Alan Cox writes:
>
>> lock xaddl $1, [foo]
>
> On 386? I stand corrected. :)

My instruction manual lists it as 486 only.
--
Debian GNU/Linux 3.0 is out! ( http://www.debian.org/ )
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2003-06-13 11:04:11

by David Schwartz

[permalink] [raw]
Subject: RE: [PATCH] udev enhancements to use kernel event queue


Pual Mackerras is said to have opined:

> Patrick Mochel writes:

> > +static inline int atomic_inc_and_read(atomic_t *v)
> > +{
> > + __asm__ __volatile__(
> > + LOCK "incl %0"
> > + :"=m" (v->counter)
> > + :"m" (v->counter));
> > + return v->counter;
> > +}

> BZZZT. If another CPU is also doing atomic_inc_and_read you could end
> up with both calls returning the same value.
>
> You can't do atomic_inc_and_read on 386. You can on cpus that have
> cmpxchg (e.g. later x86). You can also on machines with load-locked
> and store-conditional instructions (alpha, ppc, probably most other
> RISCs).

You can also do it with a conditional move instruction, but it's kind of
ugly. No help on a '386 though.

There are ways to do it that work on a 386, but they are all basically
equivalent to (or worse than) acquiring a spinlock, doing the deed, and then
releasing it.

You could also do (in pseudo-code):

top:
ret <- v->counter
inc ret
LOCK incl v->counter
cmp v->counter, ret
jz end
LOCK decl v->counter
jmp top:
end:
return ret

This does not strictly guarantee in order return values, but that's
meaningless without a lock anyway.

DS


2003-06-13 12:29:42

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Thu, 12 Jun 2003, Robert Love wrote:

> On Thu, 2003-06-12 at 16:40, Paul Mackerras wrote:
>
> > BZZZT. If another CPU is also doing atomic_inc_and_read you could end
> > up with both calls returning the same value.
>
> That is what I thought. Damn.
>
> > You can't do atomic_inc_and_read on 386. You can on cpus that have
> > cmpxchg (e.g. later x86). You can also on machines with load-locked
> > and store-conditional instructions (alpha, ppc, probably most other
> > RISCs).
>
> So this is doable on everything but old i386 chips... hrm.
>
> Robert Love
>

No! They do not return the same value. They just don't return the
final value, and the final value might not be important if the
atomic operations are used correctly.

The atomic stuff is to get rid of the read/modify/write problem where
you have a read *INTERRUPT* (other read, other modify, other write),
modify, write. Now the operand is nothing like expected. The
atomic operations guarantee that the memory operand will, in fact,
be completely modified as expected. Reading any value after such
modification does not necessarily return the final value because
somebody could modify it (again atomically), before you get
a chance to read it.

So, if you have N atomic_inc() and N atomic_dec() operations, the
value will be whatever it was before the first operation. In other
words, the value will be correct.

FYI, all memory modify operations, not using an intermediate register,
in 32-bit mode, of a longword or less, on ix86 machines are atomic,
even without the lock prefix.

like:

memory: .long 0
incl (memory)
incw (memory)
incb (memory)

This is because the CPU does not load the operand, modify it, then
write it back. It might "physically" do this in hardware, but the
physical operation cannot be interrupted until complete. So, in
principle, the lock instruction isn't necessary for things like above.

If you really need to know what the value of the memory operand
was at the instant it was modified, you need use the locked/exchange
instructions. Normally, you don't need to know the exact value
of some semaphore, etc., only that some conditions were true or
false at the instant of a test.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.

2003-06-13 12:40:32

by Olivier Galibert

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Fri, Jun 13, 2003 at 08:44:52AM -0400, Richard B. Johnson wrote:
> FYI, all memory modify operations, not using an intermediate register,
> in 32-bit mode, of a longword or less, on ix86 machines are atomic,
> even without the lock prefix.

Unless you're on SMP. Of course 80386 SMP is not really what people
want to do, but people may compile an "universal" 386 SMP kernel and
run it on a later SMP box.

OG.

2003-06-13 13:20:28

by Alan

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Gwe, 2003-06-13 at 12:00, Paul Mackerras wrote:
> Alan Cox writes:
>
> > lock xaddl $1, [foo]
>
> On 386? I stand corrected. :)

Turns out its 486 and higher, so you win.

2003-06-13 15:33:11

by Davide Libenzi

[permalink] [raw]
Subject: RE: [PATCH] udev enhancements to use kernel event queue

On Fri, 13 Jun 2003, David Schwartz wrote:

> You could also do (in pseudo-code):
>
> top:
> ret <- v->counter
> inc ret
> LOCK incl v->counter
> cmp v->counter, ret
> jz end
> LOCK decl v->counter
> jmp top:
> end:
> return ret
>
> This does not strictly guarantee in order return values, but that's
> meaningless without a lock anyway.

It is even worse since it can return same values. Suppose counter is 0 :

TASK0 TASK1

top:
ret <- v->counter ; ret = 0
inc ret ; ret -> 1
LOCK incl v->counter ; counter -> 1

top:
ret <- v->counter ; ret = 1
inc ret ; ret -> 2
LOCK incl v->counter ; counter -> 2
cmp v->counter, ret ; match !! got 2
jz end
...

cmp v->counter, ret ; ret == 1 , counter == 2 no-match
jz end
LOCK decl v->counter ; counter -> 1
jmp top

Then TASK0 starts again with no interruption and gets 2. You need
instrucions that does atomic mod/cmp to do this kind of tricks.



- Davide

2003-06-13 15:37:27

by Steven Dake

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue



Greg KH wrote:

>Nice, you posted the same message and tarball to lkml that we have been
>talking about privately for a few days.
>
>Just to save time, and to bring everyone else up to speed, I'll just
>include my responses and then yours to this thread and then finally
>respond to your last message to me about this.
>
>Let me know if I get any of the attribution wrong, this is going to be a
>fun cut and paste...
>
>
Thanks for brining everyone up to speed. Everything looks correct from
a quote point.

>
>On Thu, Jun 12, 2003 at 12:10:48PM -0700, Steven Dake wrote:
>
>
>>Folks,
>>
>>I have been looking at the udev idea that Greg KH has developed.
>>Userland device enumeration definately is the way to go, however, there
>>are some problems with using /sbin/hotplug to transmit device
>>enumeration events:
>>
>>1) performance of fork/exec with lots of devices is very poor
>>
>>
>
>You later said your binary of udev was 500k. I said:
> 500k binary? Who's running anything that big? udev weighs in
> around 6k right now! You have to add a _lot_ of functionality
> to move up to 500k.
>
>You responded:
> Hmm perhaps my binary was not built optmized when I looked at
> it, I'll take another look.
>
>
>
>>2) lots of processes can be forked with no policy to control how many
>>are forked at once potentially leading to OOM
>>
>>
>
>I responded:
> No, this is pure speculation. Have you actually tried this out?
> I have. I've hotplugged pci drawers connected to huge numbers
> of SCSI disks and never even seen a blip on system load. It
> turns out that pretty much everything happens consecutively, with
> a reasonable amount of time just spent probing the SCSI busses.
> USB is pretty slow in enumerating their devices too, so I don't
> see a real world application that has this problem. Do you have
> any proof of this?
>
>You responded:
> I don't agree. I've timed bootup with 12 fibrechannel disks
> with 4 partitions each with udev using the SDEF patches. I've
> then timed bootup and added in the time required to add those 12
> fibrechannel disks (via a script) using /sbin/hotplug. The
> script method (executing /sbin/hotplug 12*4 times with the
> appropriate data hardcoded) takes rougly 1.25-1.75 seconds
> longer. Since I'm using a chronograph, its not very accurate,
> but it is definately slower to enumerate with /sbin/hotplug
> using this methodlogy. That is only 12 disks. What about the
> case of 1000 disks?
>
> fork and exec are very expensive system calls, to be avoided at
> all costs...
>
>To which I responded:
> Well, if you don't dynamically link a 500K file, perhaps it
> would be a lot faster to start up :)
>
> Anyway, who really cares? This is _not_ a time critical thing.
> Who really cares that it takes 1 second longer to see the device
> node for the newly plugged in device? And if you plug in 5000
> devices all at once, you better be willing to wait the extra
> minute or so for the system to calm down and make all of the
> devices available to userspace.
>
>And again you came back with:
> System bootup time and time between failures are the two factors
> used to determine availability. Reducing bootup time and
> increasing time between failures both improve availability. Even
> 1 second is considerable when 5 9s is 30seconds of downtime per
> year. 1 more second is a considerable change in availability
> when you run the equations.
>
>I now respond:
>Bootup time reduction would be a great thing to have. And I agree for
>situations where 1 second is a considerable amount of time, it is
>important. However, 99.99% of the people running Linux out there, do
>not have those problems. And even then, for those 00.01% of the people
>who do, they do whatever they possibly can to keep from having to reboot
>at all. I still say the measurable ammount of time to plug in a new
>disk and have it's device node created is not measurable, from using
>/sbin/hotplug and udev, and using your character node. Also, the
>ultimate solution for these kinds of people is the in-kernel devfs, or
>the current static /dev. If they use either of them, it's guaranteed to
>be faster then yours or mine implementation.
>
>
I suppose it is possible that devfs could be faster, however, there are
significant amounts of policies that can never be done in devfs which
must be done in user space. For these types of applications, it makes
sense to provide the fastest mechanism available, even when it may only
improve boot performance by 1 second.

I understand what you mean by saying that 99.99% of users don't care
about availability. While those particular users may spend significant
amounts of time improving Linux, it is the remaining folks that care
about availability that are interested in investing money into r&d to
improve availability while also improving feature set. It is this set
of folks, (the people that do care about availability) that this patch
is targeted towards.

>
>
>>3) /sbin/hotplug events can occur out of order, eg: remove event occurs,
>>/sbin/hotplug sleeps waiting for something, insert event occurs and
>>completes immediately. Then the remove event completes, after the
>>insert, resulting in out-of-order completion and a broken /dev. I have
>>seen this several times with udev.
>>
>>
>
>I responded:
> Yes this happens. I have a fix for this for udev itself. No
> kernel changes are needed. I'll show it at OLS in July if you
> want to see it :)
>
>You responded:
> There is a problem in udev that fork doesn't wait for the parent
> to exit (mknod) which is solved by using the mknod system call.
> But there is still another problem that /sbin/hotplug can
> execute out of order. I'll be at OLS so I'll take a look at
> your solution then.
>
>So we agree to talk about this at OLS, fine.
>
>
>
>>4) early device enumeration (character devices, etc) are missed because
>>/sbin/hotplug is not available during early device init.
>>
>>
>
>I responded:
> Then use early initramfs to catch this. I've done this and seen
> it work already. If you don't want to do early init, walk the
> sysfs tree yourself from userspace, it works just as well, and
> we don't have to do any buffering in kernelspace at all.
>
>You responded:
> Initramfs still misses early events. That leaves walking sysfs.
>
> I have considered and rejected using a user space program to
> walk sysfs for this information. There are several problems
> 1) kobject device name is not present in sysfs (how do you know
> if its a character or block, then without some external mapping
> db, yuck)
> 2) scan of sysfs is not atomic, resulting in potential lost
> events and lost device enumerations
>
>I responded:
> You don't know this from hotplug either. You have to figure it
> out from the kobject name either way. Actually it's quite easy,
> only block devices show up in /sys/block, everything else is a
> character device.
>
> No, this is quite simple to fix:
> - set up /sbin/hotplug udev handler after init has
> started.
> - start walking sysfs to "cold plug" devices.
> - any new events that happen are caught by udev and
> handled there, while you are walking the tree.
> - no events are lost.
>
>You responded:
> Events could be handled in this situation twice, which is
> definately not what is desireable.
>
>I now respond:
>Doing a mknod() on a node that is already present doesn't harm anything.
>And since you have to keep a database around of what devices are
>present, and what you have called them, removing them after already
>removing them again, is detectable. So handing the same event twice
>isn't a problem.
>
>
I suppose you could handle multiple events twice, but the ability to
detect a replacement without the OS removing the initial kobject is now
removed. This is a real case which should be handled.

>
>
>>To solve these problems, I've developed a driver and some minor
>>modifications to the lib/kobject.c to queue these events and allow a
>>user space program to read the events at its leisure. The driver
>>supports poll/select for effecient use of the processor.
>>
>>The feature is called the System Device Enumeration Event Queue. I've
>>attached a tarball which includes udev-0.1 modified to use the SDEQ, a
>>kernel patch against linux 2.5.70 that implements the SDEQ, and a
>>test/library that tests the SDEQ functionality.
>>
>>
>
>I responded:
> Ok, I looked at your code. To put it kindly, it will never work
> properly. See the long thread on lkml by Inaky after I
> announced udev. He tried to create something like what you have
> done, but got a big better implementation (yours has a number of
> memory leaks, and permission problems...)
>
> See that thread for why doing this in the kernel is not the way
> to go. Please don't waste your time on this anymore, it's a
> loosing battle...
>
>You responded:
> It obviously works properly when using the udev application so I
> am at a loss as to how you can make a claim that "it will never
> work". Please explain your comments about memory leaks. Data
> structures are allocated and then freed at appropriate times.
> Even in the case of a crash during a transaciton (get event,
> clera events), there are no leaks. Please explain permission
> problems. Permissions are controlled through policy in the
> filesystem (/dev/sdeq) which fully controls access to the system
> device enumeration event queue.
>
> I did look at Inaky's code, and frankly it was too complicated
> to solve the simple problem that needed to be solved. We also
> have a full-blown event mechanism (for redundant system slot
> compact pci support), but feel its too heavy-weight to be of
> general purpose use to the kernel. This particular
> implementation is lightweight and solves a specific need with
> real applications that can really use the functionality.
>
> I have read the thread and atleast 70% of the comments where "we
> should not use /sbin/hotplug, but instead events should be
> queued in the kernel". The thread raised all of the issues that
> are solved with this kernel patch. Several core kernel
> maintainers were the authors of the comments, so I'm not alone
> in this belief.
>
> Your obviously opposed to using character devices to queue
> kobject creation/deletion events for some reason which I don't
> understand. /sbin/hotplug is an inferior solution to queueing
> events in the kernel. Your not the maintainer of the code this
> feature is going into (which is the linux driver model). I'd
> really like to hear Pat's thoughts on this issue as he is the
> one that has to live with the changes. If you want to continue
> using /sbin/hotplug, with all of its numerous problems for your
> work, your more then welcome to do so. No one is forcing you to
> use sdeq, however, for those other vendors that may want to make
> persistent device naming solutions, this is a core feature that
> can not easily be solved by poorly thought out hacks.
>
>I will respond now:
>
>Ok, permission problems can be solved by relying on the permission of
>the device node, you are correct.
>
>As for the memory issues, if no one ever reads from the character node,
>it will constantly fill up with events, right? Shouldn't they be
>eventually flushed out at some point in time?
>
>
This is a problem... I wasn't quite sure how to handle this. The two
choices are to include timeouts in events (after a certain amount of
time, events are timed out and freed) or to allow only a certain number
of events, after which events at the front of the queue are flushed.

The reality though, is that the user will be running the daemon to clear
out the events. If they don't, then they get what they deserve :)

>Also for trying to remove events, why is userspace allowed to remove a
>multiple of events at once? I would think that a valid read would
>remove that event, you can have two applications grab the same event,
>which is not what I think you want to have happen.
>
I probably should have explained why this decision was made. I believe
it is desireable to be able to handle multiple events at once. The
process is: user reads all events on event queue, user processes events,
user deletes all events atomically. This solves the problem of what
happens if the daemon crashes during deletion of events, or during
processing of events. The daemon will be able to totally recover and
process events properly.

>
>And finally...
>
>Yes, I know I'm not the maintainer of the code that this feature is
>going into. But my position is that this code is not needed in the main
>kernel tree, and offer that opinion to the maintainer of this code.
>
>I agree, most of the arguments in the previous udev thread talked about
>out of order events, but I think I've gotten that solved now (I can wave
>my hands about how it will be done, or I can show you working code at
>OLS. I think I'll wait until OLS, as code matters, wavy hands don't.)
>
>So, because of this, I still think that everything you are trying to do
>can be successfully done from userspace today (or use devfsd today, it
>will give you the same info!) And due to that, I do not think this code
>is necessary to be in the kernel.
>
>
Everything could possibly be done in userspace, but performance is poor,
and I am still concerned about the out-of-order events issue.

>Thanks for reading this far, and my apologies if I got any of the above
>quotes out of order or misspoken, I just wanted to cut to the chase, and
>not have to go through 4 more rounds of email to make it to this point
>in the conversation again.
>
>
Thanks greg
-steve

>greg k-h
>
>
>
>
>

2003-06-13 15:51:56

by Steven Dake

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Alan
Thanks didn't see that patch come by. I'll have a look

-steve

Alan Cox wrote:

>On Iau, 2003-06-12 at 23:03, Andrew Morton wrote:
>
>
>>This is a significantly crappy aspect of the /sbin/hotplug callout. I'd be
>>very interested in reading an outline of how you propose fixing it, without
>>waiting until OLS, thanks.
>>
>>
>
>Dave Miller posted a simple patch to allow netlink to be used for
>kernel->user messages - hotplug/disk error/logging/whatever. I'm
>suprised therefore that the whole thing is being regurgitated again.
>
>
>
>
>
>

2003-06-13 15:53:06

by Steven Dake

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue



Oliver Neukum wrote:

>>If it works for you or doesn't or you like the idea or don't, I've love
>>to hear about it
>>
>>
>
>+ default:
>+ result = -EINVAL;
>+ break;
>+ }
>+ return (result);
>
>Must return ENOTTY.
>
>+static int sdeq_open (struct inode *inode, struct file *file)
>+{
>+ MOD_INC_USE_COUNT;
>+
>+ return 0;
>+}
>+
>+static int sdeq_release (struct inode *inode, struct file *file)
>+{
>+ MOD_DEC_USE_COUNT;
>+
>+ return (0);
>+}
>
>Wrong. release does not map to close()
>
>
>
hmm not sure where I got that from i'll fix thanks.

>Aside from that, what exactly are you trying to do?
>You are not solving the fundamental device node reuse race,
>yet you are making necessary a further demon.
>
>
For device enumeration, I see a daemon as necessary. The main goal of
this work is to solve the out-of-order execution of sbin/hotplug and
improve performance of the system during device enumeration with
significant (200 disks, 4 partitions each) amounts of devices. Boot
time with this scheme appears, in my rudimentary tests, to be faster on
the order of 1-2 seconds for bootup for the case of just 12 disks. I
would imagine 200 disks (which I don't have a good way to test, as I
don't have 200 disks:) would provide better speed gains during bootup.
This compares greg's original udev to this patched udev binary.

>You are not addressing queue limits. The current hotplug
>scheme does so, admittedly crudely by failing to spawn
>a task, but considering the small numbers of events in
>question here, for the time being we can live with that.
>
>You can just as well add load control and error detection
>to the current scheme. You fail to do so in your scheme.
>You cannot queue events forever in unlimited numbers.
>
>
I agree there should be some way of limiting events. I'll add this set
of code.

>As for ordering, this is a real problem, but not fundamental.
>You can make user space locking work. IMHO it will not be
>pretty if done with shell scripts, but it can work.
>There _is_ a basic problem with the kernel 'overtaking'
>user space in its view of the device tree, but you cannot solve
>that _at_ _all_ in user space.
>
>In short, if you feel that the hotplug scheme is inadequate
>for your needs, then write industry strength devfs2.
>
>
devfs is not appropriate as it does not allow for complex policy with
external attributes that the kernel is unaware of. For an example, lets
take the situation where a policy must access a cluster-wide manager to
determine some information before it can make a policy decision. For
that to occur, there must be sockets, and hopefully libc, which puts the
entire thing in user space. Who would want to write policies in the
kernel? uck.

> Regards
> Oliver
>
>
>
>
Thanks
-steve

2003-06-13 16:18:29

by Steven Dake

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue



Patrick Mochel wrote:

>*Sigh* You're asking for trouble, aren't you?
>
>For the new spectators in the crowd, Steven, Greg and I have been on a
>private email thread about this. This is the original email that he sent,
>with all comments made to him completely disregarded..
>
>
>
>>I have been looking at the udev idea that Greg KH has developed.
>>Userland device enumeration definately is the way to go, however, there
>>are some problems with using /sbin/hotplug to transmit device
>>enumeration events:
>>
>>1) performance of fork/exec with lots of devices is very poor
>>
>>
>
>Please define:
>
>- 'lots'
>
>
i tested 12 devices

>- 'poor'
>
80% slower for enumeration.

>
>
>The performance of /sbin/hotplug doesn't matter. It's a blip on the radar.
>If you have lots of devices (e.g. 1000 SCSI disks) the time it takes to
>fork + exec /sbin/hotplug is nothing compared to what it would take to
>spin up and probe each of the disks. If you don't have a lot of devices,
>then you're going to be spending a lot less time running /sbin/hotplug. So
>little that it's not going to make a difference, I'm willing to wager.
>
>If you have accurate time measurements, then I would be interested in
>seeing them. Though, I request that you test the following:
>
>- The default /sbin/hotplug for Redhat 9 systems
>- The default /sbin/hotplug for Montavista systems.
>- Greg's mini-hotplug
>- Your sdeq
>
>Please test against an identical kernel, with and without your patch.
>Please also post links to (or attach) the /sbin/hotplug binaries and
>scripts that you tested with. And, please post the complete 'tree -d'
>output of sysfs on the system(s) you tested on.
>
>
>
>>2) lots of processes can be forked with no policy to control how many
>>are forked at once potentially leading to OOM
>>
>>
>
>Have you seen this happen? Sure, it's theoretically possible, but I highly
>doubt it will ever happen on a real system. Typically, if you have an
>astronomical number of devices, then you'll have an incredible amount of
>memory, too. Please post a bug report when you see this.
>
>
I have not seen this happen, however, it seems clear that it would
happen if /sbin/hotplug were called a sufficient number of times.

>
>
>>3) /sbin/hotplug events can occur out of order, eg: remove event occurs,
>>/sbin/hotplug sleeps waiting for something, insert event occurs and
>>completes immediately. Then the remove event completes, after the
>>insert, resulting in out-of-order completion and a broken /dev. I have
>>seen this several times with udev.
>>
>>
>
>This is true, and I'll let Greg handle this one.
>
>
>
>>4) early device enumeration (character devices, etc) are missed because
>>/sbin/hotplug is not available during early device init.
>>
>>
>
>initramfs, as well as one of many cold-plugging solutions out there should
>suffice for this. If they don't, we would welcome and appreciate your
>effort in helping overcome these deficiencies by evolving the current
>system rather than attempting a hostile takeover.
>
>
I am not attempting a hostile takeover. The patch provides an ADDITIONAL
solution to what is already available which significantly improves
performance.

>
>
>>To solve these problems, I've developed a driver and some minor
>>modifications to the lib/kobject.c to queue these events and allow a
>>user space program to read the events at its leisure. The driver
>>supports poll/select for effecient use of the processor.
>>
>>
>
>Listen, it's not going to fly. Greg is hotplug czar, and I won't take the
>patches if he doesn't like them.
>
>
This is not a hotplug issue, it is strictly a system device enumeration
issue.

>Secondly, you're just not going to replace hotplug. At least not now.
>/sbin/hotplug works today and has no serious, provable issues, besides
>events coming in out of order. Incredibly long boot times, OOM situations
>just have not happened. And, we've agreed that we're not going to
>implement an overdesigned solution to fix problems that aren't there yet.
>Show us REAL bugs and we'll work on making what we have better.
>
>
I have no intention of replacing hotplug. The patch is in addition to
hotplug, and uses the good infrastructure Greg and yourself have already
put in place.

>Thirdly, /sbin/hotplug is an ASCII interface, providing flexibility in the
>userspace agent implementations. Your character device (which is not
>appreciated) forces the userspace tools to use select(2), poll(2), or
>ioctl(2). No simple read(2)/write(2). Binary interfaces == Bad(tm). If you
>have doubts, please read the threads concerning binary vs. ASCII
>interfaces over the last two years before replying.
>
>
I believe it is highly desireable to be able to use select. Imagine the
case where the kernel generates an event, and some external agent, that
wants to enumerate devices, must also generate an event. This allows a C
program to select which fd to use for I/O. I don't understand the
problems with binary interfaces, as they are numerous in the kernel. For
an example, take a look at the syscall interface!

>I'm not even going to go as far as comment on the code, since conceptually
>its FITH.
>
>Finally, I think you need a serious attitude readjustment. You've exceeded
>all expectations in the level of jackass that you've acheived. You
>completley disregarded Greg's comments in a private thread started from
>the SAME EXACT email. You were pompous and arrogant to the person whose
>code you're trying to replace. Then, you have the nerve to start over on
>this list. Did you think we wouldn't notice? If anything, you've
>guaranteed that I will never take your emails seriously again.
>
>
I did not disregard Greg's comments. The bottom line is the performance
issues and out-of-order issues have not been addressed. Perhaps the
out-of-order issue can be addressed, but performance of /sbin/hotplug
will _always_ be slower then the operation of one process on an event
queue. You can wait until enough people bitch about the performance of
/sbin/hotplug and evolve /sbin/hotplug into the scheme I propose, or you
can take this patch as is. I really don't care as I don't make patches
available for my benefit; they are for the benefit of the community.

Finally, I didn't realize the sandbox was full. If you don't want to
play with my toys, you certainly don't have to.

Thanks
-steve

>I really wish the device driver people at Montavista would pool their
>collective partial clues and figure WTF the rest of the world is doing.
>Do you guys listen? Do you read email? Hello? McFly?
>
>I'm frankly sick of you people wasting our time with these half-assed,
>hare-brained hacks that you expect us to love and integrate. You never try
>to evolve the current system; it's always a case of rewriting something
>that at least works with a completely new interface. You don't listen to
>our input, and you disappear after a few emails. Then, you try it all
>again a few months later.
>
>It's frustrating because you're obviously talented and have the
>time+energy to help, but you never seem to attempt to align yourselves
>with us or the rest of the kernel.
>
>Please, either a) re-read the email threads, the papers, the magazine
>articles, and/or the text file documentation and attempt to work with us;
>or b) stop trying to get us to listen.
>
>
> -pat
>
>
>
>
>
>
>

2003-06-13 16:26:07

by Patrick Mochel

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue


> I suppose it is possible that devfs could be faster, however, there are
> significant amounts of policies that can never be done in devfs which
> must be done in user space. For these types of applications, it makes
> sense to provide the fastest mechanism available, even when it may only
> improve boot performance by 1 second.

Eh? Why must you completely re-engineer a solution because you see the
current one as deficient? Not only is it completely over-engineered by
enforcing your fanatical ideas about requiring a new system daemon, but
it's total pre-mature optimization.

On top of that, you don't have any accurate numbers to back up your
claims. Please perform and post the timings I suggested yesterday, and
then we'll talk.

> I understand what you mean by saying that 99.99% of users don't care
> about availability. While those particular users may spend significant
> amounts of time improving Linux, it is the remaining folks that care
> about availability that are interested in investing money into r&d to
> improve availability while also improving feature set. It is this set
> of folks, (the people that do care about availability) that this patch
> is targeted towards.

Then it is your responsibility to merge the continuing efforts and design
goals of the kernel with the requirements of your high-paying customers in
the smoothest possible way. Serving one while ignoring the other is a good
way to waste a lot of time and money.

I care about availability. But, I am not willing to integrate or support
features that come from some random person just because they claim to
improve availability, especially when a) I don't like the numbers and b)
there are no numbers to back it up.

> >As for the memory issues, if no one ever reads from the character node,
> >it will constantly fill up with events, right? Shouldn't they be
> >eventually flushed out at some point in time?
> >
> >
> This is a problem... I wasn't quite sure how to handle this. The two
> choices are to include timeouts in events (after a certain amount of
> time, events are timed out and freed) or to allow only a certain number
> of events, after which events at the front of the queue are flushed.
>
> The reality though, is that the user will be running the daemon to clear
> out the events. If they don't, then they get what they deserve :)

And this improves availability how?


-pat

2003-06-13 16:30:26

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Fri, Jun 13, 2003 at 09:40:37AM +0100, Alan Cox wrote:
> On Iau, 2003-06-12 at 23:03, Andrew Morton wrote:
> > This is a significantly crappy aspect of the /sbin/hotplug callout. I'd be
> > very interested in reading an outline of how you propose fixing it, without
> > waiting until OLS, thanks.
>
> Dave Miller posted a simple patch to allow netlink to be used for
> kernel->user messages - hotplug/disk error/logging/whatever. I'm
> suprised therefore that the whole thing is being regurgitated again.

For error logging stuff I think the netlink interface is fine. And I'm
pretty sure some of the IBM RAS people are looking into useing it.

But as a hotplug interface, no, I do not want to change to using it.
The bash /sbin/hotplug plugin writers would hate me... :)

thanks,

greg k-h

2003-06-13 16:29:10

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Fri, Jun 13, 2003 at 08:51:01AM -0700, Steven Dake wrote:
> >I now respond:
> >Bootup time reduction would be a great thing to have. And I agree for
> >situations where 1 second is a considerable amount of time, it is
> >important. However, 99.99% of the people running Linux out there, do
> >not have those problems. And even then, for those 00.01% of the people
> >who do, they do whatever they possibly can to keep from having to reboot
> >at all. I still say the measurable ammount of time to plug in a new
> >disk and have it's device node created is not measurable, from using
> >/sbin/hotplug and udev, and using your character node. Also, the
> >ultimate solution for these kinds of people is the in-kernel devfs, or
> >the current static /dev. If they use either of them, it's guaranteed to
> >be faster then yours or mine implementation.
> >
> >
> I suppose it is possible that devfs could be faster, however, there are
> significant amounts of policies that can never be done in devfs which
> must be done in user space. For these types of applications, it makes
> sense to provide the fastest mechanism available, even when it may only
> improve boot performance by 1 second.

That's what devfsd can be used for. It provides you with a way to get
those events, in a binary fashion, with no out-of-order issues, in a
"quick" way. People have implemented persistant device naming schemes
by using this interface in the past. Don't write something new when
what you are looking for is already there.

> I understand what you mean by saying that 99.99% of users don't care
> about availability. While those particular users may spend significant
> amounts of time improving Linux, it is the remaining folks that care
> about availability that are interested in investing money into r&d to
> improve availability while also improving feature set. It is this set
> of folks, (the people that do care about availability) that this patch
> is targeted towards.

Like Pat said, until you prove the speed and uptime differences between
the following, I don't see how you can say this with a straight face:
- The default /sbin/hotplug for Redhat 9 systems
- The default /sbin/hotplug for Montavista systems.
- my mini-hotplug
- Your sdeq
- devfsd
After running those numbers, please let us know what you find out.

> >I now respond:
> >Doing a mknod() on a node that is already present doesn't harm anything.
> >And since you have to keep a database around of what devices are
> >present, and what you have called them, removing them after already
> >removing them again, is detectable. So handing the same event twice
> >isn't a problem.
> >
> >
> I suppose you could handle multiple events twice, but the ability to
> detect a replacement without the OS removing the initial kobject is now
> removed. This is a real case which should be handled.

Um, huh? How can the OS not remove a kobject, and not tell userspace?
And if a device is unplugged, and then plugged right back in again, in
the same place, with the same attributes (which is the only way you
would not be able to detect this easily), what is the big problem? You
would act apon that device in the same way, again, which would be just
fine.

> >As for the memory issues, if no one ever reads from the character node,
> >it will constantly fill up with events, right? Shouldn't they be
> >eventually flushed out at some point in time?
> >
> This is a problem... I wasn't quite sure how to handle this. The two
> choices are to include timeouts in events (after a certain amount of
> time, events are timed out and freed) or to allow only a certain number
> of events, after which events at the front of the queue are flushed.
>
> The reality though, is that the user will be running the daemon to clear
> out the events. If they don't, then they get what they deserve :)

Heh, then your kernel which _has_ to stay up (due to your previous
guarantees of uptime) keels over. I don't think that by requiring that
a kernel _has_ to have a specific userspace program running in order for
it to stay healthy would be anying any "carrier grade" user would ever
agree to.

:)

> >Also for trying to remove events, why is userspace allowed to remove a
> >multiple of events at once? I would think that a valid read would
> >remove that event, you can have two applications grab the same event,
> >which is not what I think you want to have happen.
> >
> I probably should have explained why this decision was made. I believe
> it is desireable to be able to handle multiple events at once. The
> process is: user reads all events on event queue, user processes events,
> user deletes all events atomically. This solves the problem of what
> happens if the daemon crashes during deletion of events, or during
> processing of events. The daemon will be able to totally recover and
> process events properly.

But your implementation does not do this, so you added features and
complexity before they were needed :)

Again, you are relying on a userspace program to keep the kernel safe,
not a wise choice.

> >So, because of this, I still think that everything you are trying to do
> >can be successfully done from userspace today (or use devfsd today, it
> >will give you the same info!) And due to that, I do not think this code
> >is necessary to be in the kernel.
> >
> >
> Everything could possibly be done in userspace, but performance is poor,
> and I am still concerned about the out-of-order events issue.

We just solved the out-of-order events issue. I don't see any proof of
your performance claims. Hence, I don't think this patch should be
applied.

thanks,

greg k-h

2003-06-13 16:35:29

by Patrick Mochel

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue


> >+static int sdeq_open (struct inode *inode, struct file *file)
> >+{
> >+ MOD_INC_USE_COUNT;
> >+
> >+ return 0;
> >+}
> >+
> >+static int sdeq_release (struct inode *inode, struct file *file)
> >+{
> >+ MOD_DEC_USE_COUNT;
> >+
> >+ return (0);
> >+}
> >
> >Wrong. release does not map to close()
> >
> >
> >
> hmm not sure where I got that from i'll fix thanks.

You don't even need to modify the module refcount, since sdeq cannot even
be built as a module. Also, return is a statement, not a function; you
could at least be consistent.

> For device enumeration, I see a daemon as necessary. The main goal of
> this work is to solve the out-of-order execution of sbin/hotplug and
> improve performance of the system during device enumeration with
> significant (200 disks, 4 partitions each) amounts of devices. Boot
> time with this scheme appears, in my rudimentary tests, to be faster on
> the order of 1-2 seconds for bootup for the case of just 12 disks. I
> would imagine 200 disks (which I don't have a good way to test, as I
> don't have 200 disks:) would provide better speed gains during bootup.
> This compares greg's original udev to this patched udev binary.

This part I really don't get about your argument.

For one, please back up your claim with the method that you used to test
and post some real numbers, showing real improvement. If you would like
more hardware to test on, we have an entire lab full of large systems
specifically for purposes like this. I don't know if we have a 200 disk
library, but you can see what you can get at:

http://osdl.org/stp/


For another, you're preaching about availability, from I presume the
carrier-grade camp, since Montavista has a CG distro, and you're involved
in our own CGL working group. But, I have never heard of a carrier-grade
system that used 12 disks, let alone 200. I fail to see how you're backing
up your argument about the 1 second that you saving with a realistic
example.

But, I also don't think that the speed hit would increase linearly.

Besides, even if the system had access to 200+ disks, it's probably in a
separate library, accessible to an/the entire cluster. It's likely also a
commercial storage library, and not a linux system.

Please, educate me and prove me wrong; I'm not opposed to new ways of
thinking.

> devfs is not appropriate as it does not allow for complex policy with
> external attributes that the kernel is unaware of. For an example, lets
> take the situation where a policy must access a cluster-wide manager to
> determine some information before it can make a policy decision. For
> that to occur, there must be sockets, and hopefully libc, which puts the
> entire thing in user space. Who would want to write policies in the
> kernel? uck.

Cluster-wide manager of policy? Fine, you can do that with /sbin/hotplug,
too. But, I don't think network latency is going to help your boot time..


-pat

2003-06-13 16:51:27

by Patrick Mochel

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue


> >Please define:
> >
> >- 'lots'
> >
> >
> i tested 12 devices
>
> >- 'poor'
> >
> 80% slower for enumeration.

Prove it. :)

> >>2) lots of processes can be forked with no policy to control how many
> >>are forked at once potentially leading to OOM
> >>
> >>
> >
> >Have you seen this happen? Sure, it's theoretically possible, but I highly
> >doubt it will ever happen on a real system. Typically, if you have an
> >astronomical number of devices, then you'll have an incredible amount of
> >memory, too. Please post a bug report when you see this.
> >
> >
> I have not seen this happen, however, it seems clear that it would
> happen if /sbin/hotplug were called a sufficient number of times.

You have the same problem, theoretically. If the daemon dies, the event
queue will fill and the system will crash. With the current scheme, user
processes will be killed and we will lose hotplug events. With your
scheme, the entire system will crash eventually.

> >Thirdly, /sbin/hotplug is an ASCII interface, providing flexibility in the
> >userspace agent implementations. Your character device (which is not
> >appreciated) forces the userspace tools to use select(2), poll(2), or
> >ioctl(2). No simple read(2)/write(2). Binary interfaces == Bad(tm). If you
> >have doubts, please read the threads concerning binary vs. ASCII
> >interfaces over the last two years before replying.
> >
> >
> I believe it is highly desireable to be able to use select. Imagine the
> case where the kernel generates an event, and some external agent, that
> wants to enumerate devices, must also generate an event. This allows a C
> program to select which fd to use for I/O. I don't understand the
> problems with binary interfaces, as they are numerous in the kernel. For
> an example, take a look at the syscall interface!

That doesn't mean they're the right thing to do in all cases. Binary
interfaces may offer better performance, but they're harder to maintain,
harder to debug, and harder to write agents to use them. You may be so
smart and arrogant to disregard those things as trite, but they are
important, from a long-term development standpoint.

Again, please read the archives concerning this topic on linux-kernel
before commenting on this subject again.

> I did not disregard Greg's comments. The bottom line is the performance
> issues and out-of-order issues have not been addressed. Perhaps the
> out-of-order issue can be addressed, but performance of /sbin/hotplug
> will _always_ be slower then the operation of one process on an event
> queue. You can wait until enough people bitch about the performance of
> /sbin/hotplug and evolve /sbin/hotplug into the scheme I propose, or you
> can take this patch as is. I really don't care as I don't make patches
> available for my benefit; they are for the benefit of the community.

Ok, I'll wait. I would rather see slow, marked improvements then blindly
applying and believing the whims of some random kernel hacker under the
farce of altruism and increased availability.

> Finally, I didn't realize the sandbox was full. If you don't want to
> play with my toys, you certainly don't have to.

It isn't full, and I will not touch your toys, thank you very much. But,
one of the first social rules that people usually learn in the sandbox is
that in order to play with the other kids, you have to play their games
with their rules.


-pat

2003-06-13 16:58:38

by Steven Dake

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue



Patrick Mochel wrote:

>>>+static int sdeq_open (struct inode *inode, struct file *file)
>>>+{
>>>+ MOD_INC_USE_COUNT;
>>>+
>>>+ return 0;
>>>+}
>>>+
>>>+static int sdeq_release (struct inode *inode, struct file *file)
>>>+{
>>>+ MOD_DEC_USE_COUNT;
>>>+
>>>+ return (0);
>>>+}
>>>
>>>Wrong. release does not map to close()
>>>
>>>
>>>
>>>
>>>
>>hmm not sure where I got that from i'll fix thanks.
>>
>>
>
>You don't even need to modify the module refcount, since sdeq cannot even
>be built as a module. Also, return is a statement, not a function; you
>could at least be consistent.
>
>
Yes this area needs some work. I copied from a previous driver, which
was obviously also not consistent:)

>
>
>>For device enumeration, I see a daemon as necessary. The main goal of
>>this work is to solve the out-of-order execution of sbin/hotplug and
>>improve performance of the system during device enumeration with
>>significant (200 disks, 4 partitions each) amounts of devices. Boot
>>time with this scheme appears, in my rudimentary tests, to be faster on
>>the order of 1-2 seconds for bootup for the case of just 12 disks. I
>>would imagine 200 disks (which I don't have a good way to test, as I
>>don't have 200 disks:) would provide better speed gains during bootup.
>>This compares greg's original udev to this patched udev binary.
>>
>>
>
>This part I really don't get about your argument.
>
>For one, please back up your claim with the method that you used to test
>and post some real numbers, showing real improvement. If you would like
>more hardware to test on, we have an entire lab full of large systems
>specifically for purposes like this. I don't know if we have a 200 disk
>library, but you can see what you can get at:
>
> http://osdl.org/stp/
>
>
Brian Jackson (open gfs project) has offered to run a test case. I'll
work with him to determine a test methodology and test results.

>
>For another, you're preaching about availability, from I presume the
>carrier-grade camp, since Montavista has a CG distro, and you're involved
>in our own CGL working group. But, I have never heard of a carrier-grade
>system that used 12 disks, let alone 200. I fail to see how you're backing
>up your argument about the 1 second that you saving with a realistic
>example.
>
>
There are two major types of carrier grade applications. Data plane and
Control plane. Data plane applications generally don't use disks at all
and this is probably what your most familiar with. Control plane
applications (such as a home location register) or billing systems etc
often use multitudes of disks for databases. Currently Linux is well
suited for control plane applications on x86 with storage, and data
plane applications on ppc without storage.

>But, I also don't think that the speed hit would increase linearly.
>
>Besides, even if the system had access to 200+ disks, it's probably in a
>separate library, accessible to an/the entire cluster. It's likely also a
>commercial storage library, and not a linux system.
>
>Please, educate me and prove me wrong; I'm not opposed to new ways of
>thinking.
>
>
Future designs (such as ATCA), of which I am mostly concerned with,
include storage in the chassis itself. A chassis may have 12 slots, of
which 10 could be disks or cpu slots. The remaining two slots are
fibrechannel hubs or ethernet switches. The fibrechannel hubs can (and
often are) tied together through front-panel connectors, to form a rack
or multiple racks in a cluster. Fibrechannel allows for 126 devices in a
FCAL. There are two fibrechannel hubs per chassis, which allows for each
cpu to see a total of 252 devices. now some of these devices could
certainly be CPUs, but the ratio is atleast 1cpu:2disks since each cpu
blade should use RAID 1 to protect against disk failure and provide easy
hotswap. Then there are of course partitions, of which there could be
many, and of which each one is a seperate device seperately enumerated
in the system. So a realistic system, currently shipping today for the
telecom market, could require the enumeration of
16(partitions)*168(disks) More realistically, each system may have 3
partitions, which is 3*168 disks, or even more realistically, each
system may only have 1/4 as many disks, which is 3*42, or 120 disks
(times two channels, 240 device enumerations).

Brian Jackson has said there are 50 disks in the OSDL cluster, of which
he could put 3 partitions on, which would test 150 device enumerations.

I propose the following test:
Create 3 partitions on all 50 disks. Create system to boot kernel.org
2.5.70 with initramdisk. Try greg's original udev without kernel aptch,
try modified udev with kernel patch, time each system's startup time.

Do you agree with the test methodology? If so, then we can proceed with
gathering real data.

Thanks
-steve

>
>
>>devfs is not appropriate as it does not allow for complex policy with
>>external attributes that the kernel is unaware of. For an example, lets
>>take the situation where a policy must access a cluster-wide manager to
>>determine some information before it can make a policy decision. For
>>that to occur, there must be sockets, and hopefully libc, which puts the
>>entire thing in user space. Who would want to write policies in the
>>kernel? uck.
>>
>>
>
>Cluster-wide manager of policy? Fine, you can do that with /sbin/hotplug,
>too. But, I don't think network latency is going to help your boot time..
>
>
>
yes this can be done with /sbin/hotplug, but not with devfs in the
kernel. This was just an example, as there may be other policies which
can only be done in the C language and not accessible from the kernel.
/sbin/hotplug, of course, could have the same access as any other policy.

> -pat
>
>
>
>
>

2003-06-13 17:12:10

by Chris Friesen

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Greg KH wrote:
> On Fri, Jun 13, 2003 at 08:51:01AM -0700, Steven Dake wrote:

>>The reality though, is that the user will be running the daemon to clear
>>out the events. If they don't, then they get what they deserve :)
>>
>
> Heh, then your kernel which _has_ to stay up (due to your previous
> guarantees of uptime) keels over. I don't think that by requiring that
> a kernel _has_ to have a specific userspace program running in order for
> it to stay healthy would be anying any "carrier grade" user would ever
> agree to.
>
> :)

I don't think I totally buy that argument. Often a "carrier-grade" box has a
whole bunch of things that must be running for the box to stay healthy. If one
of them dies, the system logs the problem along with diagnostic information and
tries to restart it. If it can't be restarted, then we failover to the warm
standby and go for a reboot to try and clean things up.

I agree that there are other issues, but needing to have a particular binary
running isn't really a problem. (init, for example...)


Chris


--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]

2003-06-13 17:20:04

by Oliver Neukum

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue


> >Aside from that, what exactly are you trying to do?
> >You are not solving the fundamental device node reuse race,
> >yet you are making necessary a further demon.
>
> For device enumeration, I see a daemon as necessary. The main goal of
> this work is to solve the out-of-order execution of sbin/hotplug and
> improve performance of the system during device enumeration with
> significant (200 disks, 4 partitions each) amounts of devices. Boot
> time with this scheme appears, in my rudimentary tests, to be faster on
> the order of 1-2 seconds for bootup for the case of just 12 disks. I
> would imagine 200 disks (which I don't have a good way to test, as I
> don't have 200 disks:) would provide better speed gains during bootup.
> This compares greg's original udev to this patched udev binary.

You will not beat a pure devfs solution in terms of speed.
Enumeration is not a real problem. The real target of the hotplug
system is configuration, not enumeration.

> >You are not addressing queue limits. The current hotplug
> >scheme does so, admittedly crudely by failing to spawn
> >a task, but considering the small numbers of events in
> >question here, for the time being we can live with that.
> >
> >You can just as well add load control and error detection
> >to the current scheme. You fail to do so in your scheme.
> >You cannot queue events forever in unlimited numbers.
>
> I agree there should be some way of limiting events. I'll add this set
> of code.

And you need to report loss of events. Doing this all by ioctls is
very questionable.

> >As for ordering, this is a real problem, but not fundamental.
> >You can make user space locking work. IMHO it will not be
> >pretty if done with shell scripts, but it can work.
> >There _is_ a basic problem with the kernel 'overtaking'
> >user space in its view of the device tree, but you cannot solve
> >that _at_ _all_ in user space.
> >
> >In short, if you feel that the hotplug scheme is inadequate
> >for your needs, then write industry strength devfs2.
>
> devfs is not appropriate as it does not allow for complex policy with
> external attributes that the kernel is unaware of. For an example, lets
> take the situation where a policy must access a cluster-wide manager to
> determine some information before it can make a policy decision. For
> that to occur, there must be sockets, and hopefully libc, which puts the
> entire thing in user space. Who would want to write policies in the
> kernel? uck.

Then the method of invocation of your configurator will not matter with
regards to speed. Out of order issues matter, but can be solved easier.

It seems to me that you should take a close look at devfs and devfsd.

Regards
Oliver

2003-06-13 17:54:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

In article <[email protected]>,
Steven Dake <[email protected]> wrote:
>
>I have been looking at the udev idea that Greg KH has developed.
>Userland device enumeration definately is the way to go, however, there
>are some problems with using /sbin/hotplug to transmit device
>enumeration events:

No.

WE ARE NOT GOING TO A EVENT DEAMON!

Centralized event deamons are crap, unmaintainable, and unreadable.
They are a maintenance nightmare, both from a development standpoint and
from a MIS standpoint. They encourage doing everything in one program,
keeping state in private memory, depending on ordering, and just
generally do bad things.

/sbin/hotplug, on the other hand, makes events clearly independent
things, and encourages writing simple scripts that can be combined and
localized. In other words, it's the UNIX way.

I've seen the madness of event deamons, and it's called "cardmgr" and
"acpid". We don't want it.

Linus

2003-06-13 18:13:18

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Fri, Jun 13, 2003 at 09:03:19AM -0700, Steven Dake wrote:
> >Aside from that, what exactly are you trying to do?
> >You are not solving the fundamental device node reuse race,
> >yet you are making necessary a further demon.
> >
> >
> For device enumeration, I see a daemon as necessary. The main goal of
> this work is to solve the out-of-order execution of sbin/hotplug and
> improve performance of the system during device enumeration with
> significant (200 disks, 4 partitions each) amounts of devices. Boot
> time with this scheme appears, in my rudimentary tests, to be faster on
> the order of 1-2 seconds for bootup for the case of just 12 disks. I
> would imagine 200 disks (which I don't have a good way to test, as I
> don't have 200 disks:) would provide better speed gains during bootup.
> This compares greg's original udev to this patched udev binary.

You're also using a 500k dynamically linked binary. Please don't do
that for testing/real life. Use the 6k udev binary for you tests...

thanks,

greg k-h

2003-06-13 18:13:23

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Fri, Jun 13, 2003 at 09:03:19AM -0700, Steven Dake wrote:
> devfs is not appropriate as it does not allow for complex policy with
> external attributes that the kernel is unaware of. For an example, lets
> take the situation where a policy must access a cluster-wide manager to
> determine some information before it can make a policy decision. For
> that to occur, there must be sockets, and hopefully libc, which puts the
> entire thing in user space. Who would want to write policies in the
> kernel? uck.

Look at devfsd, it is in userspace.

greg k-h

2003-06-13 18:15:46

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Fri, Jun 13, 2003 at 10:10:15AM -0700, Steven Dake wrote:
> Brian Jackson has said there are 50 disks in the OSDL cluster, of which
> he could put 3 partitions on, which would test 150 device enumerations.

You can also create 2000+ virtual disks today on your desktop using
scsi-debug to eliminate any hardware spin up times :)

thanks,

greg k-h

2003-06-13 18:52:41

by Patrick Mansfield

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Fri, Jun 13, 2003 at 10:10:15AM -0700, Steven Dake wrote:

> I propose the following test:
> Create 3 partitions on all 50 disks. Create system to boot kernel.org
> 2.5.70 with initramdisk. Try greg's original udev without kernel aptch,
> try modified udev with kernel patch, time each system's startup time.
>
> Do you agree with the test methodology? If so, then we can proceed with
> gathering real data.

I've seen a large variance on detect/probe/scan for fibre disks depending
on whether the feral or qlogic drive is used.

This is with switch attached fibre storage.

The qlogic takes more time to probe, but has shorter timeouts when trying
to detect disabled or disconnected ports. On my system (where I have a few
unconnected fibre cards), the insmod of the qlogic and feral seem to be
about equal (I haven't timed the actual differences, I normally don't
build a kernel or modules with both feral and qlogic available, they are
both available only outside of the mainline kernel). If I removed or
connected the unconnected cards, the feral driver would be faster.

You could probably improve boot time (or modprobe time) more by speeding
up the qlogic driver or using the feral driver rather than trying to speed
up the hotplug/whatever.

Note that scsi_debug driver has a default IO response time of 1 tick
(scsi_debug_delay), you can set it to zero, but then it has to wakeup the
softirqd to complete the IO. So you you either rely on a timeout to
complete an IO, or a wakeup of the ksoftirqd. I was trying to measure IO
latencies at one point using scsi_debug, but I seemed to end up measuring
context switch times instead.

-- Patrick Mansfield

2003-06-13 19:45:16

by Joe Korty

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Fri, Jun 13, 2003 at 02:31:40PM +0100, Alan Cox wrote:
> On Gwe, 2003-06-13 at 12:00, Paul Mackerras wrote:
> > Alan Cox writes:
> >
> > > lock xaddl $1, [foo]
> >
> > On 386? I stand corrected. :)
>
> Turns out its 486 and higher, so you win.


Perhaps it's time to remove 386 support from 2.5? Users
of very old machines can stick with 2.0, 2.2 or 2.4.

Joe

2003-06-13 19:50:08

by Oliver Neukum

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue


> +
> + spin_lock(&sequence_lock);
> + envp [i++] = scratch;
> + scratch += sprintf(scratch, "SEQNUM=%ld", sequence_num) + 1;
> + ++sequence_num;
> + spin_unlock(&sequence_lock);
>
> kobj_path_length = get_kobj_path_length (kset, kobj);
> kobj_path = kmalloc (kobj_path_length, GFP_KERNEL);

If this kmalloc fails, you'll have a hole in the numbers and
user space will be very confused. You need to report dropped
events if you do this.

Regards
Oliver

2003-06-13 20:59:18

by Alan

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Gwe, 2003-06-13 at 20:57, Joe Korty wrote:
> > Turns out its 486 and higher, so you win.
>
>
> Perhaps it's time to remove 386 support from 2.5? Users
> of very old machines can stick with 2.0, 2.2 or 2.4.

You have to solve these problems generally anyway, and lots of
other platforms have limited locking functions. What it means
is that all sane PC hardware can do it efficiently

2003-06-14 06:15:21

by John Bradford

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

> > > Turns out its 486 and higher, so you win.
> > Perhaps it's time to remove 386 support from 2.5? Users
> > of very old machines can stick with 2.0, 2.2 or 2.4.

I strongly disagree, 386 support is important for the embedded
market.

> You have to solve these problems generally anyway, and lots of
> other platforms have limited locking functions.

Including future Linux platforms.

(possibly).

John.

2003-06-18 23:16:51

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Fri, Jun 13, 2003 at 10:01:47PM +0200, Oliver Neukum wrote:
>
> > +
> > + spin_lock(&sequence_lock);
> > + envp [i++] = scratch;
> > + scratch += sprintf(scratch, "SEQNUM=%ld", sequence_num) + 1;
> > + ++sequence_num;
> > + spin_unlock(&sequence_lock);
> >
> > kobj_path_length = get_kobj_path_length (kset, kobj);
> > kobj_path = kmalloc (kobj_path_length, GFP_KERNEL);
>
> If this kmalloc fails, you'll have a hole in the numbers and
> user space will be very confused. You need to report dropped
> events if you do this.

Yes, we should add the sequence number last.

Good catch.

greg k-h

2003-06-18 23:58:53

by Kevin P. Fleming

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Greg KH wrote:

>>If this kmalloc fails, you'll have a hole in the numbers and
>>user space will be very confused. You need to report dropped
>>events if you do this.
>
>
> Yes, we should add the sequence number last.
>

While this is not a bad idea, I don't think you want to make a promise
to userspace that there will never be gaps in the sequence numbers. When
this sequence number was proposed, in my mind it seemed perfect because
then userspace could _order_ multiple events for the same device to
ensure they got processed in the correct order. I don't know that any
hotplug userspace implementation is going to be large and complex enough
to warrant "holding" events until lower-numbered events have been
delivered. That just seems like a very difficult task with little
potential gain, but I could very well be mistaken :-)

2003-06-19 00:07:05

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Wed, Jun 18, 2003 at 05:12:50PM -0700, Kevin P. Fleming wrote:
> Greg KH wrote:
>
> >>If this kmalloc fails, you'll have a hole in the numbers and
> >>user space will be very confused. You need to report dropped
> >>events if you do this.
> >
> >
> >Yes, we should add the sequence number last.
> >
>
> While this is not a bad idea, I don't think you want to make a promise
> to userspace that there will never be gaps in the sequence numbers. When
> this sequence number was proposed, in my mind it seemed perfect because
> then userspace could _order_ multiple events for the same device to
> ensure they got processed in the correct order. I don't know that any
> hotplug userspace implementation is going to be large and complex enough
> to warrant "holding" events until lower-numbered events have been
> delivered. That just seems like a very difficult task with little
> potential gain, but I could very well be mistaken :-)

Yes you are :)

You will have to handle gaps properly yes.

But you also have to "hold" events for a bit of time in order to
determine that things are out of order, or we have a gap. So a bit of
"complex" logic is in the works, but it's much less complex than if we
didn't have that sequence number.

thanks,

greg k-h

2003-06-19 00:11:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

"Kevin P. Fleming" <[email protected]> wrote:
>
> > Yes, we should add the sequence number last.
> >
>
> While this is not a bad idea, I don't think you want to make a promise
> to userspace that there will never be gaps in the sequence numbers. When
> this sequence number was proposed, in my mind it seemed perfect because
> then userspace could _order_ multiple events for the same device to
> ensure they got processed in the correct order. I don't know that any
> hotplug userspace implementation is going to be large and complex enough
> to warrant "holding" events until lower-numbered events have been
> delivered. That just seems like a very difficult task with little
> potential gain, but I could very well be mistaken :-)

The userspace support tools need to be able to handle gaps
in any case, because call_usermodehelper() may fail.

(In practice it won't fail, because the memory allocator is immortal.
But the capability should be there.

(Well actually, it could fail because the VM overcommit code might
refuse the mmap.

(But probably not, because root gets an extra margin.)))


2003-06-19 00:28:49

by Kevin P. Fleming

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Greg KH wrote:

> But you also have to "hold" events for a bit of time in order to
> determine that things are out of order, or we have a gap. So a bit of
> "complex" logic is in the works, but it's much less complex than if we
> didn't have that sequence number.
>

OK, so the important point here is that while you probably delay events
for a small amount of time (1-3 seconds maybe) to ensure that you aren't
processing steps out of order, it's not likely that userspace is going
hold event #1321 for an indefinite period of time just because it has
never seen event #1320.

I can't wait to see this implementation; it's going to be interesting,
to say the least. However, without the sequence numbers, it would very
likely be impossible.

2003-06-19 03:56:50

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Wed, Jun 18, 2003 at 05:42:45PM -0700, Kevin P. Fleming wrote:
> Greg KH wrote:
>
> >But you also have to "hold" events for a bit of time in order to
> >determine that things are out of order, or we have a gap. So a bit of
> >"complex" logic is in the works, but it's much less complex than if we
> >didn't have that sequence number.
> >
>
> OK, so the important point here is that while you probably delay events
> for a small amount of time (1-3 seconds maybe) to ensure that you aren't
> processing steps out of order, it's not likely that userspace is going
> hold event #1321 for an indefinite period of time just because it has
> never seen event #1320.

Exactly.

> I can't wait to see this implementation; it's going to be interesting,
> to say the least. However, without the sequence numbers, it would very
> likely be impossible.

Hey, I still think my originally proposed version would have worked,
eventually... :)

greg k-h

2003-06-19 06:13:57

by Oliver Neukum

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Am Donnerstag, 19. Juni 2003 02:12 schrieb Kevin P. Fleming:
> Greg KH wrote:
> >>If this kmalloc fails, you'll have a hole in the numbers and
> >>user space will be very confused. You need to report dropped
> >>events if you do this.
> >
> > Yes, we should add the sequence number last.
>
> While this is not a bad idea, I don't think you want to make a promise
> to userspace that there will never be gaps in the sequence numbers. When
> this sequence number was proposed, in my mind it seemed perfect because
> then userspace could _order_ multiple events for the same device to
> ensure they got processed in the correct order. I don't know that any
> hotplug userspace implementation is going to be large and complex enough
> to warrant "holding" events until lower-numbered events have been
> delivered. That just seems like a very difficult task with little
> potential gain, but I could very well be mistaken :-)

You cannot order events unless you hold such events. One event always
arrives first. If it's the lower numbered, the point is moot. If it's the
higher numbered, you'll need to hold it or there's no ordering.

For the paranoid even that is not enough. A hotplug script may die in user
space due to OOM oe EIO.

Regards
Oliver

2003-06-19 19:40:11

by Werner Almesberger

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Greg KH wrote:
> On Thu, Jun 12, 2003 at 03:03:35PM -0700, Andrew Morton wrote:
> > This is a significantly crappy aspect of the /sbin/hotplug callout. I'd be
> > very interested in reading an outline of how you propose fixing it, without
> > waiting until OLS, thanks.
>
> Sure, I knew someone would probably want to :)

An interesting problem, with a lot of potential for wrong
solutions ;-) A few comments from an interested bystander:

1) Reordering:

First of all, it doesn't seem to be clear whether the
ordering mechanism as a whole is supposed to be reliable.
E.g. your timeout-based approach would fix reordering with a
certain probability, but occasionally, reordering would still
occur. (This is similar to IP networks: a little but of
reordering is okay, but if you do it a lot, TCP performance
will suffer.)

Also, you don't try to correct losses, right ? I.e. if we
get events A and B, A is lost, then B is handled without any
attempt to retry A first. Right ?

If you want a reliable reordering detection, /sbin/hotplug
needs to know what happens with concurrent events, e.g. if A
and B occur in parallel, B cannot be executed unless we know
that A has either succeeded or failed. Sequence numbers alone
don't help (you could, of course, combine them with timeouts,
and end up with a more efficent version of your original
timeout approach).

You can track concurrent events in a number of ways, e.g. by
going back to the kernel, and asking how many "older"
instances of /sbin/hotplug are running, or they could
communicate directly among each other.

E.g. the kernel could open a pipe for each /sbin/hotplug, and
give each /sbin/hotplug a duplicate of the reader end of the
pipe of each concurrently running /sbin/hotplug.
/sbin/hotplug could then poll them, and wail until all those
fds are closed.

What I don't quite understand is why you won't want to
serialize in the kernel. The overall resource footprint of
sleeping /sbin/hotplugs, and such, is certainly much larger
than a few queued event messages.

Furthermore, if you serialize in the kernel, you can easily
and reliably indicate how many events have been lost since
the last one.


2) Communication mechanism:

Given all these complications, I'm not so sure if just
sending messages (ASCII, EBCDIC, binary, Haiku, whatever ;-)
wouldn't be more convenient in the end. You could dispatch
them later to scripts easily enough.

But since time doesn't seem to be an issue (more about that
below), you could of course also use the same concept we use
for interrupts: make /sbin/hotplug a "fast" interface script,
which then delegates the real work to some other process.

Given that it's a bit of an uphill battle to write user-space
programs that absolutely never fail, it may also be good to
have some completion signal that can be used to keep track of
dropped events, and that can then be used to trigger a
recovery procedure.

A central dispatcher would also have the advantage of
possessing full information on the events being handled. That
way, events without interdependencies could still be
processed in parallel.


3) We're in no hurry at all:

Sorry, I don't buy this. You're of course right that bus
systems that need some slow, timeout-based scan won't
initialize quickly anyway. But it wouldn't be all that hard
to make them faster, and then the hotplug interface would
become the bottleneck.

Example: put 1000 SCSI drives in a cabinet with a door
switch. When the door is open, do things in the usual, slow
way. When the door is closed, no drives can be added or
removed, so the system could cache bus scan results and
synthesize NACKs for absent devices.

So I think it makes sense to avoid obvious bottlenecks in
this design. Therefore, as long as any kind of serialization
is required, and unless the kernel itself knows what can be
parallelized, and what not, a dispatcher demon looks like the
most light-weight solution.

If you're worried about having yet another rarely used demon
in the system, just make /sbin/hotplug persistent. If it is
idle for too long, it can exit, and when there are new
events, the kernel can launch it again.


4) Losses:

Actually, I'm not so sure what really ought to happen with
losses. If we have serialization requirements elsewhere,
proceeding after unrecovered losses would probably mean that
they're being violated. So if they can be violated then,
maybe there is some leeway in other circumstances too ?

On the other hand, if any loss means that major surgery is
needed, the interface should probably have a "in loss" state,
in which it just shuts down until someone cleans up the mess.
Also a partial shutdown may be interesting (e.g. implemented
by the dispatcher), where events with no interdependencies
with other events would still be processed.

The kernel could even provide some hints of what has been
lost. (I.e. aggregate any information in the lost events.)
That way, the partial shutdown would be even more efficient.
But I think I'm overdesining now :-)


Anyway, just my 2 centavos :-)

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-06-19 20:43:22

by Werner Almesberger

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Olivier Galibert wrote:
> "Netlink is not a reliable protocol." sez the manpage.
>
> That sounds a little ridiculous for a local kernel<->user message
> protocol.

There are many way of being unreliable. If it means "bad things
happen and nobody is told about it", then yes, this would be
strange in this context. Howver, "loss due to congestion" is a
perfectly reasonable behaviour (*), and netlink indicates when
this has occurred.

If you get more events than you can queue or compress (e.g. if
you have on/off events but are only interested in the current
status, you could aggregate any number of such events in just
one bit), congestion is possible.

> Having to manage acks on that kind of communication is both
> painful and quasi-untestable.

<ObPlug>
Ha ! That's exactly the kind of stuff I'm writing umlsim for :-)
http://umlsim.sourceforge.net/
</ObPlug>

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-06-19 23:19:05

by Werner Almesberger

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Linus Torvalds wrote:
> [ Event demons ] encourage doing everything in one program,
> keeping state in private memory, depending on ordering, and just
> generally do bad things.

Well, the ordering bit is the hairy part. As long as it doesn't
matter if an event gets lost every once in a while, and in which
order they get processed, things are fine as they are.

But then it scares me to see people start to try to design some
general serialization mechanism on top of /sbin/hotplug

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-06-19 23:32:32

by Steven Dake

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue



Werner Almesberger wrote:

>Linus Torvalds wrote:
>
>
>>[ Event demons ] encourage doing everything in one program,
>>keeping state in private memory, depending on ordering, and just
>>generally do bad things.
>>
>>
>
>Well, the ordering bit is the hairy part. As long as it doesn't
>matter if an event gets lost every once in a while, and in which
>order they get processed, things are fine as they are.
>
>But then it scares me to see people start to try to design some
>general serialization mechanism on top of /sbin/hotplug
>
>
A serialization methodology can be built on /sbin/hotplug, but it has
all of the problems that Linus previously talked about for a kernel
event queue. The difference is that the problem is moved to userland.

>- Werner
>
>

-steve



2003-06-19 23:52:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue


On Thu, 19 Jun 2003, Steven Dake wrote:
>
> A serialization methodology can be built on /sbin/hotplug, but it has
> all of the problems that Linus previously talked about for a kernel
> event queue. The difference is that the problem is moved to userland.

Having event ordering is a trivial matter, and I'm not against adding a
sequence number to /sbin/hotplug as part of the environment, for example.

What I worry about is the situation where one big deamon handles
everything, which makes it impossible to "hook in" to the thing without
understanding the one big thing.

The thing that makes /sbin/hotplug so wonderful is that it's stateless,
and if you want to hook into it, it's absolutely _trivial_. Look at the
default script there in redhat-9 for example, and it's obvious how to hook
up to certain events etc.

And why do people care about serialization anyway? Really? The whole
notion is ludicrous. /sbin/hotplug _shouldn't_ be serialized.
Serialization is bad. You should look at whatever problems you have with
it now, and ask yourself whether maybe it should be done some other way
entirely.

Linus

2003-06-19 23:56:13

by Steven Dake

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue



Linus Torvalds wrote:

>On Thu, 19 Jun 2003, Steven Dake wrote:
>
>
>>A serialization methodology can be built on /sbin/hotplug, but it has
>>all of the problems that Linus previously talked about for a kernel
>>event queue. The difference is that the problem is moved to userland.
>>
>>
>
>Having event ordering is a trivial matter, and I'm not against adding a
>sequence number to /sbin/hotplug as part of the environment, for example.
>
>What I worry about is the situation where one big deamon handles
>everything, which makes it impossible to "hook in" to the thing without
>understanding the one big thing.
>
>The thing that makes /sbin/hotplug so wonderful is that it's stateless,
>and if you want to hook into it, it's absolutely _trivial_. Look at the
>default script there in redhat-9 for example, and it's obvious how to hook
>up to certain events etc.
>
>And why do people care about serialization anyway? Really? The whole
>notion is ludicrous. /sbin/hotplug _shouldn't_ be serialized.
>Serialization is bad. You should look at whatever problems you have with
>it now, and ask yourself whether maybe it should be done some other way
>entirely.
>
>
Serialization is important for the case of a device state change
occuring rapidly. An example is a device add and then remove, which
results in (if out of order) the add occuring after the remove, leaving
a dead device node in the filesystem. This is tolerable. What isn't as
tolerable is a remove and then an add, where the add occurs out of order
first. In this case, a device node should exist on the filesystem, but
instead doesn't because the remove has come along and removed it.

This occurance does happen much more often then you might think. When
changing disk partitions, for example, items are removed and then added
several times before the devices are finally instantiated.

Definately to be avoided.

Thanks
-steve

> Linus
>
>
>
>
>

2003-06-20 00:29:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue


On Thu, 19 Jun 2003, Steven Dake wrote:
>
> Serialization is important for the case of a device state change
> occuring rapidly. An example is a device add and then remove, which
> results in (if out of order) the add occuring after the remove, leaving
> a dead device node in the filesystem. This is tolerable. What isn't as
> tolerable is a remove and then an add, where the add occurs out of order
> first. In this case, a device node should exist on the filesystem, but
> instead doesn't because the remove has come along and removed it.

This still isn't a global serialization thing, though. For example,
there's no reason _another_ device should be serialized with the disk
discovery.

And there are lots and lots of reasons why they should not EVER be
serialized, especially since full serialization (like through a shared
event deamon) results is _serious_ problems if one of the events end up
being a slow thing that needs to query and/or spin up hardware.

In other words, once you start talking about device nodes, you really want
to handle the serialization on a per-node basis. Having some kind of
kernel support for that is quite possible a good idea, but care should be
taken that non-dependent events shouldn't be serialized.

An easy way to do partial serialization is something like this:

- each "call_usermodehelper()" gets associated with two numbers: the
"class number" and the "sequence number within a class". We keep track
of outstanding usermodehelpers.

Add the class and sequence number to the environment of the hotplug
execution so that the user-mode hotplug code can query them.

- we add a simple interface to allow a user mode helper to ask the kernel
to "please wait until class X has no pending sequence numbers smaller
than Y"

See what I'm aiming at? The class numbers shouldn't have any particular
meaning, they're just a random number (it might be the bus number of the
device that hotplugged, for example). But this allows a hotplug thing to
say "if there are other outstanding hotplug things in my class that were
started before me, wait here".

Linus

2003-06-20 02:13:19

by Werner Almesberger

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Linus Torvalds wrote:
> This still isn't a global serialization thing, though. For example,
> there's no reason _another_ device should be serialized with the disk
> discovery.

Yes, that sounds a lot better. Let's hope nobody discovers such
a reason ...

> An easy way to do partial serialization is something like this:

That would work. It's in fact a per-class message queue, in which
sleeping processes play the role of messages ;-)

I thought a bit about what else one could do with this interface:

- if you need loss/error handling, and are willing to keep one
bit of information for each such class in the kernel, the it
could register whether processing the previous event has
succeeded, i.e. whether the kernel was able to launch a helper,
and whether that helper has returned with exit status 0. This
bit could be returned by the synchronization call.

- in principle it would now be easy, and in many ways cleaner and
more efficient, to turn this into a per-class event demon, with
a "real" event queue. The problem is that you'd probably still
want classes that are not serialized, and the kernel wouldn't
know whether a given class needs serialization before
/sbin/hotplug has run for a while. Neiter configuring class
behaviour nor adding an "asychronous class" look particularly
attractive to me.

Too bad. A synchronous /sbin/hotplug as follows would certainly
be nice, e.g. assuming that in each class only one event type
can occur:

#!/bin/sh
agent=$1
read event
set $event
<... normal /sbin/hotplug code ...>

or, the high-performance variant:

#!/bin/sh
exec /etc/hotplug/$1.agent

#!/bin/sh
<... do really expensive initialization ...>
while read event; do
set $event
<... normal /etc/hotplug/*.agent code ...>
done

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-06-20 16:51:41

by Steven Dake

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue



Linus Torvalds wrote:

>On Thu, 19 Jun 2003, Steven Dake wrote:
>
>
>>Serialization is important for the case of a device state change
>>occuring rapidly. An example is a device add and then remove, which
>>results in (if out of order) the add occuring after the remove, leaving
>>a dead device node in the filesystem. This is tolerable. What isn't as
>>tolerable is a remove and then an add, where the add occurs out of order
>>first. In this case, a device node should exist on the filesystem, but
>>instead doesn't because the remove has come along and removed it.
>>
>>
>
>This still isn't a global serialization thing, though. For example,
>there's no reason _another_ device should be serialized with the disk
>discovery.
>
>And there are lots and lots of reasons why they should not EVER be
>serialized, especially since full serialization (like through a shared
>event deamon) results is _serious_ problems if one of the events end up
>being a slow thing that needs to query and/or spin up hardware.
>
>In other words, once you start talking about device nodes, you really want
>to handle the serialization on a per-node basis. Having some kind of
>kernel support for that is quite possible a good idea, but care should be
>taken that non-dependent events shouldn't be serialized.
>
>An easy way to do partial serialization is something like this:
>
> - each "call_usermodehelper()" gets associated with two numbers: the
> "class number" and the "sequence number within a class". We keep track
> of outstanding usermodehelpers.
>
> Add the class and sequence number to the environment of the hotplug
> execution so that the user-mode hotplug code can query them.
>
> - we add a simple interface to allow a user mode helper to ask the kernel
> to "please wait until class X has no pending sequence numbers smaller
> than Y"
>
>See what I'm aiming at? The class numbers shouldn't have any particular
>meaning, they're just a random number (it might be the bus number of the
>device that hotplugged, for example). But this allows a hotplug thing to
>say "if there are other outstanding hotplug things in my class that were
>started before me, wait here".
>
>
>
A class number is an excellent idea, and could have potential
performance gains. I thought of making such a patch, but expected it
would be rejected since it is necessarily required and would require
changes to significant sections of the kernel (to add the class to each
object type).

Have any better ideas for an implementation that doesn't touch so many
sections of the kernel?

> Linus
>
>
>
>
>

2003-06-20 17:04:37

by Patrick Mochel

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue


> A class number is an excellent idea, and could have potential
> performance gains. I thought of making such a patch, but expected it
> would be rejected since it is necessarily required and would require
> changes to significant sections of the kernel (to add the class to each
> object type).
>
> Have any better ideas for an implementation that doesn't touch so many
> sections of the kernel?

It should be trivial - we can use the subsystem name that the kobject
belongs to for the class ID. And, struct subsystem can gain a sequence
number, instead of having a global one in lib/kobject.c.


-pat

2003-06-26 13:24:58

by Tommi Virtanen

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

On Thu, Jun 19, 2003 at 04:51:35PM -0300, Werner Almesberger wrote:
> 4) Losses:
>
> Actually, I'm not so sure what really ought to happen with
> losses. If we have serialization requirements elsewhere,
> proceeding after unrecovered losses would probably mean that
> they're being violated. So if they can be violated then,
> maybe there is some leeway in other circumstances too ?
>
> On the other hand, if any loss means that major surgery is
> needed, the interface should probably have a "in loss" state,
> in which it just shuts down until someone cleans up the mess.
> Also a partial shutdown may be interesting (e.g. implemented
> by the dispatcher), where events with no interdependencies
> with other events would still be processed.

One thing came to my mind:

If you have a sysfs-scanning method for startup, couldn't you
just make the sequence-number-checking daemon reset its state
and redo the sysfs scan on loss of events? (Or even make it
just exec itself and use the exact same code as at startup.)

That way the system recovers from event loss (or a reordering
that gets the earlier event too late and is believed to be a
loss) in a way that needs to work anyway, and isn't a magic
special case.

--
:(){ :|:&};:

2003-06-26 14:21:40

by Werner Almesberger

[permalink] [raw]
Subject: Re: [PATCH] udev enhancements to use kernel event queue

Tommi Virtanen wrote:
> If you have a sysfs-scanning method for startup, couldn't you
> just make the sequence-number-checking daemon reset its state
> and redo the sysfs scan on loss of events?

Yes, that's the easier approach if you don't have any detection
of error in the kernel itself. If the kernel already does all
the work for figuring out that something has gone wrong, it may
as well use this to reduce the noise.

> That way the system recovers from event loss (or a reordering
> that gets the earlier event too late and is believed to be a
> loss) in a way that needs to work anyway, and isn't a magic
> special case.

Let's just hope reordering stays dead :-)

There's still a bit of magic in loss recovery, because you need
something that triggers loss recovery, after the loss has
happened. That can of course just be whatever else happens to
come along (or the user getting impatient), or a periodic scan.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/