2004-01-07 10:16:08

by Olaf Hering

[permalink] [raw]
Subject: Re: udev and devfs - The final word

On Wed, Dec 31, [email protected] wrote:

> On Wed, Dec 31, 2003 at 05:20:18PM -0500, Rob Love wrote:
> > On Wed, 2003-12-31 at 17:01, Nathan Conrad wrote:

> > Uh, Unix systems (Linux included) do not use the filename of the device
> > node at all. Those are just names for you, the user.
> >
> > The kernel uses the device number to understand what device user-space
> > is trying to access. The kernel associates the device with a device
> > number. Normally that number is static, and known a priori, so we just
> > create a huge /dev directory with all possible devices and their
> > assigned numbers (you can see these numbers with ls -la).
> >
> > But if the kernel _tells_ user-space what the device number is, for each
> > device as it is created, we do not need a static /dev directory. We can
> > assemble the directory on the fly and device numbers really no longer
> > matter. This is what udev does.
>
> I think you've missed a point here. There are several places where kernel
> deals with device identification.
> a) when normal pathname lookup results in a device node on filesystem.
> That's the regular way.
> b) when we create a new device node; device number is passed to
> ->mknod() and new device node is created. Also a normal codepath.
> c) when late-boot code mounts the final root. It used to be black
> magic, but these days it's done by regular syscalls. Namely, we parse the
> "device name" (most of the work is done by lookups in sysfs), do mknod(2)
> and mount(2). It's still done from the kernel mode, but it could be moved
> to userland. Should be, actually.
> d) when kernel deals with resume/suspend stuff. Currently - black
> magic. Should be moved to early userland (same parser as for final root
> name + mknod on rootfs + open() to get the device in question).
> e) in several pathological syscalls we pass device number to
> identify a device. ustat(2) and its ilk - bad API that can't die.
> f) /dev/raw passes device number to bind raw device to block device.
> Bad API; we probably ought to replace it with saner one at some point.
> g) RAID setup - mix of both pathologies; should be done in userland
> and interfaces are in bad need of cleanup.
> h) nfsd uses device number as a substitute for export ID if said
> ID is not given explicitly. That, BTW, is a big problem for crackpipe
> dreams about random device numbers - export ID _must_ be stable across
> reboots.
> i) mtdblk parses "device name" on boot; should be take to early
> userland, same as RAID et.al.

This is about the /proc/self/mounts format:

Why does it contain stuff like "/dev/root" or "/dev/sda3" or
"/dev/myblockdevice"? Does anyone __really__ care about it? I doubt
that. What I have here (with 2.4) is:

olh@melon:~> cat /proc/mounts
rootfs / rootfs rw 0 0
/dev/root / ext3 rw 0 0
proc /proc proc rw 0 0
devpts /dev/pts devpts rw 0 0
/dev/vg_melon/abuild /abuild ext3 rw 0 0
/dev/vg_melon/data1 /data1 ext3 rw 0 0
/dev/vg_melon/data2 /data2 ext3 rw 0 0
shmfs /dev/shm shm rw 0 0
automount(pid937) /suse autofs rw 0 0
wotan:/real-home/jplack /suse/jplack nfs rw,nosuid,v3,rsize=8192,wsize=8192,hard,intr,tcp,nolock,addr=wotan 0 0
wotan:/real-home/olh /suse/olh nfs rw,nosuid,v3,rsize=8192,wsize=8192,hard,intr,tcp,nolock,addr=wotan 0 0

Now, thats just fine and it was always been that way.
What if I chroot into /foo, proc is mounted on /foo/proc,
and run fsck /dev/sda3 in that chroot?
That silly app looks for /etc/mtab (oh my...) and start the work.
Fine. Now, /dev/root is in reality /dev/sda3. Bad for me.

the whole thing would work as expected of /proc/self/mounts would have
a sane format:
olh@melon:~> cat /proc/mounts
0:0 / rootfs rw 0 0
8:3 / ext3 rw 0 0
proc /proc proc rw 0 0
devpts /dev/pts devpts rw 0 0
58:0 /abuild ext3 rw 0 0
58:1 /data1 ext3 rw 0 0
58:2 /data2 ext3 rw 0 0
shmfs /dev/shm shm rw 0 0
automount(pid937) /suse autofs rw 0 0
wotan:/real-home/jplack /suse/jplack nfs rw,nosuid,v3,rsize=8192,wsize=8192,hard,intr,tcp,nolock,addr=wotan 0 0
wotan:/real-home/olh /suse/olh nfs rw,nosuid,v3,rsize=8192,wsize=8192,hard,intr,tcp,nolock,addr=wotan 0 0

Now fsck could look for /dev/sda3, realize that it is a block
device node and look for that in the kernel mount table.
If it is mounted, abort with a nice and meaningful error message.

So my question is: why was this strange format invented in the first place?
And: will 2.7 get a sane /proc/self/mounts format for block devices?

--
USB is for mice, FireWire is for men!

sUse lINUX ag, nÜRNBERG


2004-01-07 11:18:37

by Al Viro

[permalink] [raw]
Subject: Re: udev and devfs - The final word

On Wed, Jan 07, 2004 at 11:15:59AM +0100, Olaf Hering wrote:
> Now, thats just fine and it was always been that way.
> What if I chroot into /foo, proc is mounted on /foo/proc,
> and run fsck /dev/sda3 in that chroot?
> That silly app looks for /etc/mtab (oh my...) and start the work.
> Fine. Now, /dev/root is in reality /dev/sda3. Bad for me.

Huh?

> the whole thing would work as expected of /proc/self/mounts would have
> a sane format:
> olh@melon:~> cat /proc/mounts
> 0:0 / rootfs rw 0 0
> 8:3 / ext3 rw 0 0
> proc /proc proc rw 0 0
> devpts /dev/pts devpts rw 0 0
> 58:0 /abuild ext3 rw 0 0
> 58:1 /data1 ext3 rw 0 0
> 58:2 /data2 ext3 rw 0 0
> shmfs /dev/shm shm rw 0 0
> automount(pid937) /suse autofs rw 0 0
> wotan:/real-home/jplack /suse/jplack nfs rw,nosuid,v3,rsize=8192,wsize=8192,hard,intr,tcp,nolock,addr=wotan 0 0
> wotan:/real-home/olh /suse/olh nfs rw,nosuid,v3,rsize=8192,wsize=8192,hard,intr,tcp,nolock,addr=wotan 0 0

It's *NOT* a sane format.

> Now fsck could look for /dev/sda3, realize that it is a block
> device node and look for that in the kernel mount table.
> If it is mounted, abort with a nice and meaningful error message.
>
> So my question is: why was this strange format invented in the first place?
> And: will 2.7 get a sane /proc/self/mounts format for block devices?

Yes. It already has one.

Note that you're not only adding ad-hackery (which filesystems get that
major:minor printed and which do not?), you *STILL* hadn't solved your
problem. Why? Because you still won't catch e.g. ext3 on /dev/sda5 with
external journal on /dev/sda3. And if you hack parsing ext3 lines in
/proc/mounts, there's always reiserfs, jfs, etc., etc. _And_ there's
RAID with the same problems wrt. access to components. Real funny
when you have raid0 on md0, have md0 mounted and try to fsck one of
components.

Scanning /etc/mtab or /proc/mounts in such situations is wrong. If fsck
is doing that, it's broken. The right way to fix it depends on what you
really want and whatever the hell it is, putting new and new fs-specific
code that would parse /proc/mounts lines into fsck(8) is not an answer.

2004-01-07 13:00:41

by Olaf Hering

[permalink] [raw]
Subject: Re: udev and devfs - The final word

On Wed, Jan 07, [email protected] wrote:

> On Wed, Jan 07, 2004 at 11:15:59AM +0100, Olaf Hering wrote:
> > Now, thats just fine and it was always been that way.
> > What if I chroot into /foo, proc is mounted on /foo/proc,
> > and run fsck /dev/sda3 in that chroot?
> > That silly app looks for /etc/mtab (oh my...) and start the work.
> > Fine. Now, /dev/root is in reality /dev/sda3. Bad for me.
>
> Huh?

For short: noone knows that /dev/sda3 is busy/used.

> Note that you're not only adding ad-hackery (which filesystems get that
> major:minor printed and which do not?), you *STILL* hadn't solved your
> problem. Why? Because you still won't catch e.g. ext3 on /dev/sda5 with
> external journal on /dev/sda3. And if you hack parsing ext3 lines in
> /proc/mounts, there's always reiserfs, jfs, etc., etc. _And_ there's
> RAID with the same problems wrt. access to components. Real funny
> when you have raid0 on md0, have md0 mounted and try to fsck one of
> components.

That makes sense. Is there a sane way to inform userland apps that some
stuff is used (mounted, part of a volume group or raid)? Sure, the raid
or lvm specific tools will tell you...

> Scanning /etc/mtab or /proc/mounts in such situations is wrong. If fsck
> is doing that, it's broken. The right way to fix it depends on what you
> really want and whatever the hell it is, putting new and new fs-specific
> code that would parse /proc/mounts lines into fsck(8) is not an answer.

Ok, it was mkfs.minix and an older distro. But still, is '/dev/root' or
'/dev/fred' really correct?

--
USB is for mice, FireWire is for men!

sUse lINUX ag, nÜRNBERG

2004-01-07 13:27:01

by Al Viro

[permalink] [raw]
Subject: Re: udev and devfs - The final word

On Wed, Jan 07, 2004 at 02:00:36PM +0100, Olaf Hering wrote:

> Ok, it was mkfs.minix and an older distro.

mkfs should simply pass O_EXCL to open(). Which is what you really want
and yes, it should work on 2.6 (not sure if it got backported on 2.4).

2004-01-07 13:28:04

by Olaf Hering

[permalink] [raw]
Subject: Re: udev and devfs - The final word

On Wed, Jan 07, [email protected] wrote:

> On Wed, Jan 07, 2004 at 02:00:36PM +0100, Olaf Hering wrote:
>
> > Ok, it was mkfs.minix and an older distro.
>
> mkfs should simply pass O_EXCL to open(). Which is what you really want
> and yes, it should work on 2.6 (not sure if it got backported on 2.4).

Thanks! I will play with it and see what current tools use.

--
USB is for mice, FireWire is for men!

sUse lINUX ag, nÜRNBERG