Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261956AbTFMPh1 (ORCPT ); Fri, 13 Jun 2003 11:37:27 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S265415AbTFMPh1 (ORCPT ); Fri, 13 Jun 2003 11:37:27 -0400 Received: from fw-az.mvista.com ([65.200.49.158]:36337 "EHLO zipcode.az.mvista.com") by vger.kernel.org with ESMTP id S261956AbTFMPhU (ORCPT ); Fri, 13 Jun 2003 11:37:20 -0400 Message-ID: <3EE9F2E5.1050407@mvista.com> Date: Fri, 13 Jun 2003 08:51:01 -0700 From: Steven Dake User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030529 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Greg KH CC: linux-kernel@vger.kernel.org Subject: Re: [PATCH] udev enhancements to use kernel event queue References: <3EE8D038.7090600@mvista.com> <20030612214753.GA1087@kroah.com> In-Reply-To: <20030612214753.GA1087@kroah.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 13958 Lines: 336 Greg KH wrote: >Nice, you posted the same message and tarball to lkml that we have been >talking about privately for a few days. > >Just to save time, and to bring everyone else up to speed, I'll just >include my responses and then yours to this thread and then finally >respond to your last message to me about this. > >Let me know if I get any of the attribution wrong, this is going to be a >fun cut and paste... > > Thanks for brining everyone up to speed. Everything looks correct from a quote point. > >On Thu, Jun 12, 2003 at 12:10:48PM -0700, Steven Dake wrote: > > >>Folks, >> >>I have been looking at the udev idea that Greg KH has developed. >>Userland device enumeration definately is the way to go, however, there >>are some problems with using /sbin/hotplug to transmit device >>enumeration events: >> >>1) performance of fork/exec with lots of devices is very poor >> >> > >You later said your binary of udev was 500k. I said: > 500k binary? Who's running anything that big? udev weighs in > around 6k right now! You have to add a _lot_ of functionality > to move up to 500k. > >You responded: > Hmm perhaps my binary was not built optmized when I looked at > it, I'll take another look. > > > >>2) lots of processes can be forked with no policy to control how many >>are forked at once potentially leading to OOM >> >> > >I responded: > No, this is pure speculation. Have you actually tried this out? > I have. I've hotplugged pci drawers connected to huge numbers > of SCSI disks and never even seen a blip on system load. It > turns out that pretty much everything happens consecutively, with > a reasonable amount of time just spent probing the SCSI busses. > USB is pretty slow in enumerating their devices too, so I don't > see a real world application that has this problem. Do you have > any proof of this? > >You responded: > I don't agree. I've timed bootup with 12 fibrechannel disks > with 4 partitions each with udev using the SDEF patches. I've > then timed bootup and added in the time required to add those 12 > fibrechannel disks (via a script) using /sbin/hotplug. The > script method (executing /sbin/hotplug 12*4 times with the > appropriate data hardcoded) takes rougly 1.25-1.75 seconds > longer. Since I'm using a chronograph, its not very accurate, > but it is definately slower to enumerate with /sbin/hotplug > using this methodlogy. That is only 12 disks. What about the > case of 1000 disks? > > fork and exec are very expensive system calls, to be avoided at > all costs... > >To which I responded: > Well, if you don't dynamically link a 500K file, perhaps it > would be a lot faster to start up :) > > Anyway, who really cares? This is _not_ a time critical thing. > Who really cares that it takes 1 second longer to see the device > node for the newly plugged in device? And if you plug in 5000 > devices all at once, you better be willing to wait the extra > minute or so for the system to calm down and make all of the > devices available to userspace. > >And again you came back with: > System bootup time and time between failures are the two factors > used to determine availability. Reducing bootup time and > increasing time between failures both improve availability. Even > 1 second is considerable when 5 9s is 30seconds of downtime per > year. 1 more second is a considerable change in availability > when you run the equations. > >I now respond: >Bootup time reduction would be a great thing to have. And I agree for >situations where 1 second is a considerable amount of time, it is >important. However, 99.99% of the people running Linux out there, do >not have those problems. And even then, for those 00.01% of the people >who do, they do whatever they possibly can to keep from having to reboot >at all. I still say the measurable ammount of time to plug in a new >disk and have it's device node created is not measurable, from using >/sbin/hotplug and udev, and using your character node. Also, the >ultimate solution for these kinds of people is the in-kernel devfs, or >the current static /dev. If they use either of them, it's guaranteed to >be faster then yours or mine implementation. > > I suppose it is possible that devfs could be faster, however, there are significant amounts of policies that can never be done in devfs which must be done in user space. For these types of applications, it makes sense to provide the fastest mechanism available, even when it may only improve boot performance by 1 second. I understand what you mean by saying that 99.99% of users don't care about availability. While those particular users may spend significant amounts of time improving Linux, it is the remaining folks that care about availability that are interested in investing money into r&d to improve availability while also improving feature set. It is this set of folks, (the people that do care about availability) that this patch is targeted towards. > > >>3) /sbin/hotplug events can occur out of order, eg: remove event occurs, >>/sbin/hotplug sleeps waiting for something, insert event occurs and >>completes immediately. Then the remove event completes, after the >>insert, resulting in out-of-order completion and a broken /dev. I have >>seen this several times with udev. >> >> > >I responded: > Yes this happens. I have a fix for this for udev itself. No > kernel changes are needed. I'll show it at OLS in July if you > want to see it :) > >You responded: > There is a problem in udev that fork doesn't wait for the parent > to exit (mknod) which is solved by using the mknod system call. > But there is still another problem that /sbin/hotplug can > execute out of order. I'll be at OLS so I'll take a look at > your solution then. > >So we agree to talk about this at OLS, fine. > > > >>4) early device enumeration (character devices, etc) are missed because >>/sbin/hotplug is not available during early device init. >> >> > >I responded: > Then use early initramfs to catch this. I've done this and seen > it work already. If you don't want to do early init, walk the > sysfs tree yourself from userspace, it works just as well, and > we don't have to do any buffering in kernelspace at all. > >You responded: > Initramfs still misses early events. That leaves walking sysfs. > > I have considered and rejected using a user space program to > walk sysfs for this information. There are several problems > 1) kobject device name is not present in sysfs (how do you know > if its a character or block, then without some external mapping > db, yuck) > 2) scan of sysfs is not atomic, resulting in potential lost > events and lost device enumerations > >I responded: > You don't know this from hotplug either. You have to figure it > out from the kobject name either way. Actually it's quite easy, > only block devices show up in /sys/block, everything else is a > character device. > > No, this is quite simple to fix: > - set up /sbin/hotplug udev handler after init has > started. > - start walking sysfs to "cold plug" devices. > - any new events that happen are caught by udev and > handled there, while you are walking the tree. > - no events are lost. > >You responded: > Events could be handled in this situation twice, which is > definately not what is desireable. > >I now respond: >Doing a mknod() on a node that is already present doesn't harm anything. >And since you have to keep a database around of what devices are >present, and what you have called them, removing them after already >removing them again, is detectable. So handing the same event twice >isn't a problem. > > I suppose you could handle multiple events twice, but the ability to detect a replacement without the OS removing the initial kobject is now removed. This is a real case which should be handled. > > >>To solve these problems, I've developed a driver and some minor >>modifications to the lib/kobject.c to queue these events and allow a >>user space program to read the events at its leisure. The driver >>supports poll/select for effecient use of the processor. >> >>The feature is called the System Device Enumeration Event Queue. I've >>attached a tarball which includes udev-0.1 modified to use the SDEQ, a >>kernel patch against linux 2.5.70 that implements the SDEQ, and a >>test/library that tests the SDEQ functionality. >> >> > >I responded: > Ok, I looked at your code. To put it kindly, it will never work > properly. See the long thread on lkml by Inaky after I > announced udev. He tried to create something like what you have > done, but got a big better implementation (yours has a number of > memory leaks, and permission problems...) > > See that thread for why doing this in the kernel is not the way > to go. Please don't waste your time on this anymore, it's a > loosing battle... > >You responded: > It obviously works properly when using the udev application so I > am at a loss as to how you can make a claim that "it will never > work". Please explain your comments about memory leaks. Data > structures are allocated and then freed at appropriate times. > Even in the case of a crash during a transaciton (get event, > clera events), there are no leaks. Please explain permission > problems. Permissions are controlled through policy in the > filesystem (/dev/sdeq) which fully controls access to the system > device enumeration event queue. > > I did look at Inaky's code, and frankly it was too complicated > to solve the simple problem that needed to be solved. We also > have a full-blown event mechanism (for redundant system slot > compact pci support), but feel its too heavy-weight to be of > general purpose use to the kernel. This particular > implementation is lightweight and solves a specific need with > real applications that can really use the functionality. > > I have read the thread and atleast 70% of the comments where "we > should not use /sbin/hotplug, but instead events should be > queued in the kernel". The thread raised all of the issues that > are solved with this kernel patch. Several core kernel > maintainers were the authors of the comments, so I'm not alone > in this belief. > > Your obviously opposed to using character devices to queue > kobject creation/deletion events for some reason which I don't > understand. /sbin/hotplug is an inferior solution to queueing > events in the kernel. Your not the maintainer of the code this > feature is going into (which is the linux driver model). I'd > really like to hear Pat's thoughts on this issue as he is the > one that has to live with the changes. If you want to continue > using /sbin/hotplug, with all of its numerous problems for your > work, your more then welcome to do so. No one is forcing you to > use sdeq, however, for those other vendors that may want to make > persistent device naming solutions, this is a core feature that > can not easily be solved by poorly thought out hacks. > >I will respond now: > >Ok, permission problems can be solved by relying on the permission of >the device node, you are correct. > >As for the memory issues, if no one ever reads from the character node, >it will constantly fill up with events, right? Shouldn't they be >eventually flushed out at some point in time? > > This is a problem... I wasn't quite sure how to handle this. The two choices are to include timeouts in events (after a certain amount of time, events are timed out and freed) or to allow only a certain number of events, after which events at the front of the queue are flushed. The reality though, is that the user will be running the daemon to clear out the events. If they don't, then they get what they deserve :) >Also for trying to remove events, why is userspace allowed to remove a >multiple of events at once? I would think that a valid read would >remove that event, you can have two applications grab the same event, >which is not what I think you want to have happen. > I probably should have explained why this decision was made. I believe it is desireable to be able to handle multiple events at once. The process is: user reads all events on event queue, user processes events, user deletes all events atomically. This solves the problem of what happens if the daemon crashes during deletion of events, or during processing of events. The daemon will be able to totally recover and process events properly. > >And finally... > >Yes, I know I'm not the maintainer of the code that this feature is >going into. But my position is that this code is not needed in the main >kernel tree, and offer that opinion to the maintainer of this code. > >I agree, most of the arguments in the previous udev thread talked about >out of order events, but I think I've gotten that solved now (I can wave >my hands about how it will be done, or I can show you working code at >OLS. I think I'll wait until OLS, as code matters, wavy hands don't.) > >So, because of this, I still think that everything you are trying to do >can be successfully done from userspace today (or use devfsd today, it >will give you the same info!) And due to that, I do not think this code >is necessary to be in the kernel. > > Everything could possibly be done in userspace, but performance is poor, and I am still concerned about the out-of-order events issue. >Thanks for reading this far, and my apologies if I got any of the above >quotes out of order or misspoken, I just wanted to cut to the chase, and >not have to go through 4 more rounds of email to make it to this point >in the conversation again. > > Thanks greg -steve >greg k-h > > > > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/