2001-03-24 14:27:17

by Andries E. Brouwer

[permalink] [raw]
Subject: Larger dev_t

Dear Linus and all,

One of these days we must change dev_t.

There are several aspects to this, but this letter touches
only the kernel-*libc interface.

We need a size, and I am strongly in favor of sizeof(dev_t) = 8;
this is already true in glibc.

The two main uses of dev_t are in struct stat and as parameter
of the mknod system call. There are a few other occurrences,
such as ustat() and the use of dev_t as a field in struct loopinfo
returned by some ioctl.
- For stat all is fine already since we got stat64.
- For mknod a little work is required.
- The state of affairs with loopinfo is sad today (the fact that
kernel and glibc use dev_t of different size causes problems)
but all will be well with 64-bit dev_t.
- With ustat the converse is true. Of course it is obsolete,
but with 64-bit dev_t we are forced to throw it out - libc5
has xustat just like xstat and xmknod, but glibc hasn't so it
is not easy to save it. There are still some programs that use it:
in CD players to test (before eject) whether a CD is mounted;
in various programs such as sendmail and uucp to test how much
free space we have.
So, glibc will have to return EINVAL (or EOVERFLOW) here
for device numbers that actually use more than 16 bits.

The transformation from 64 bits to pair (major,minor) is
if ((major = (dev >> 32)) != 0)
minor = (dev & 0xffffffff);
else if ((major = (dev >> 16)) != 0)
minor = (dev & 0xffff);
else {
major = (dev >> 8);
minor = (dev & 0xff);
}
This means that old device numbers remain valid.

The stuff below describes a working interface where 64-bit values
are passed to and from the kernel, and to and from the filesystem.
That is, it is dev_t stuff. Some other time on kernel-internal matters,
that is, kdev_t stuff.

Details of my setup of today (with 64-bit dev_t):
(i) ext2:

diff -r linux-2.4.2/linux/fs/ext2/inode.c linux-2.4.2kdevt/linux/fs/ext2/inode.c
1076,1078c1076,1081
< } else
< init_special_inode(inode, inode->i_mode,
< le32_to_cpu(raw_inode->i_block[0]));
---
> } else {
> unsigned int lo = le32_to_cpu(raw_inode->i_block[0]);
> unsigned int hi = le32_to_cpu(raw_inode->i_block[1]);
> dev_t devno = ((unsigned long long) hi << 32) | lo;
> init_special_inode(inode, inode->i_mode, devno);
> }
1211,1213c1214,1221
< if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode))
< raw_inode->i_block[0] = cpu_to_le32(kdev_t_to_nr(inode->i_rdev))
;
< else for (block = 0; block < EXT2_N_BLOCKS; block++)
---
> if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
> /* we use that EXT2_N_BLOCKS > 1 */
> dev_t devno = kdev_t_to_nr(inode->i_rdev);
> unsigned int hi = (devno >> 32);
> unsigned int lo = (devno & 0xffffffff);
> raw_inode->i_block[0] = cpu_to_le32(lo);
> raw_inode->i_block[1] = cpu_to_le32(hi);
> } else for (block = 0; block < EXT2_N_BLOCKS; block++)

Ted, please complain if I am mistaken in thinking that
raw_inode->i_block[1] can be used.
There is a minor conversion problem here: there is no guarantee
that raw_inode->i_block[1] will be zero in old systems.


(ii) vfs:

diff -r linux-2.4.2/linux/fs/devices.c linux-2.4.2kdevt/linux/fs/devices.c
200c200
< void init_special_inode(struct inode *inode, umode_t mode, int rdev)
---
> void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)

(iii) mknod:
Then there is the prototype of mknod.
I changed it for all filesystems to

diff -r linux-2.4.2/linux/fs/ext2/namei.c linux-2.4.2kdevt/linux/fs/ext2/namei.c
387c387,388
< static int ext2_mknod (struct inode * dir, struct dentry *dentry, int mode, int rdev)
---
> static int ext2_mknod (struct inode * dir, struct dentry *dentry, int mode,
> dev_t rdev)

The system call itself cannot easily be changed to take a larger dev_t,
mostly because under old glibc the high order part would be random.
So, mknod64, with

diff linux-2.4.2/linux/fs/namei.c linux-2.4.2kdevt/linux/fs/namei.c
1205c1208
< asmlinkage long sys_mknod(const char * filename, int mode, dev_t dev)
---
> static long mknod_common(const char * filename, int mode, dev_t dev)
1245a1249,1259
> }
>
> asmlinkage long sys_mknod64(const char * filename, int mode,
> unsigned int ma, unsigned int mi)
> {
> return mknod_common(filename, mode, ((dev_t) ma << 32) | mi);
> }
>
> asmlinkage long sys_mknod(const char * filename, int mode, unsigned short dev)
> {
> return mknod_common(filename, mode, dev);

and __NR_mknod64 in unistd.h and .long SYMBOL_NAME(sys_mknod64) in entry.S.

Changes to glibc2:

--- glibc-2.2.1/sysdeps/unix/sysv/linux/xmknod.c Fri Jul 7 19:57:38 2000
+++ glibc-2.2.1mk/sysdeps/unix/sysv/linux/xmknod.c Sat Mar 24 13:53:50 2001
@@ -29,6 +29,13 @@
extern int __syscall_mknod (const char *__unbounded, unsigned short int,
unsigned short int);

+extern int __syscall_mknod64 (const char *__unbounded, unsigned short int,
+ dev_t);
+
+#ifndef __NR_mknod64
+#define __NR_mknod64 223
+#endif
+
/* Create a device file named PATH, with permission and special bits MODE
and device number DEV (which can be constructed from major and minor
device numbers with the `makedev' macro above). */
@@ -36,6 +43,7 @@
__xmknod (int vers, const char *path, mode_t mode, dev_t *dev)
{
unsigned short int k_dev;
+ unsigned int ma, mi;

if (vers != _MKNOD_VER)
{
@@ -43,9 +51,17 @@
return -1;
}

+ if ((*dev >> 16) != 0)
+ {
+ /* need mknod64 */
+ /* pass the 64-bit arg as two 32-bit integers (le) */
+ ma = ((*dev) >> 32);
+ mi = ((*dev) & 0xffffffff);
+ return INLINE_SYSCALL (mknod64, 4, CHECK_STRING (path), mode, ma, mi);
+ }
+
/* We must convert the value to dev_t type used by the kernel. */
k_dev = ((major (*dev) & 0xff) << 8) | (minor (*dev) & 0xff);
-
return INLINE_SYSCALL (mknod, 3, CHECK_STRING (path), mode, k_dev);
}

I almost submitted this as an actual patch, but it changes vfs
prototypes, and probably that is against the rules during a stable series.
If something in this style is OK I'll make a patch as soon as 2.5
is started.

Andries

[the above is running on the machine I send this mail from]


2001-03-24 14:41:26

by Jeff Garzik

[permalink] [raw]
Subject: Re: Larger dev_t

Also for 2.5, kdev_t needs to go away, along with all those arrays based
on major number, and be replaced with either "struct char_device" or
"struct block_device" depending on the device.

I actually went through the kernel in 2.4.0-test days and did this.
Most kdev_t usages should really be changed to "struct block_device".
The only annoyance in the conversion was ROOT_DEV and similar things
that are tied into the boot process. I didn't want to change that and
potentially break the boot protocol...

--
Jeff Garzik | May you have warm words on a cold evening,
Building 1024 | a full moon on a dark night,
MandrakeSoft | and a smooth road all the way to your door.

2001-03-24 15:01:19

by Alexander Viro

[permalink] [raw]
Subject: Re: Larger dev_t



On Sat, 24 Mar 2001, Jeff Garzik wrote:

> Also for 2.5, kdev_t needs to go away, along with all those arrays based
> on major number, and be replaced with either "struct char_device" or
> "struct block_device" depending on the device.
>
> I actually went through the kernel in 2.4.0-test days and did this.
> Most kdev_t usages should really be changed to "struct block_device".
> The only annoyance in the conversion was ROOT_DEV and similar things
> that are tied into the boot process. I didn't want to change that and
> potentially break the boot protocol...

Jeff, check the namespace patch - it simplifies this area big way.
Since it's independent from the namespace code I can pull these
parts into separate patch.

Basic idea: _always_ use ramfs (initially empty) for absolute root.
"real" root is overmounted atop of it. The thing being, by the time
when we deal with initrd, unpacking, mounting real root, etc. we
have full-blown fs context. I.e. we can call sys_mknod(), sys_mount(),
etc. I've pulled the late stages of boot into init/do_mounts.c and
had rewritten them - essentially into sequence of syscalls. The thing
became seriosuly simpler (not to mention the fact that it's not scattered
anymore).

If anyone is interested in seeing it as a separate patch - tell and I'll
do it; it's pretty straightforward.
Al

2001-03-24 16:16:17

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: Larger dev_t

> Also for 2.5, kdev_t needs to go away, along with all those arrays

Yes, it has been said many times, and I get the impression
that many people actually did it.

Maybe everybody with code or at least a detailed setup
should demonstrate what was done so that we can compare merits
of several approaches.

The stuff I sent earlier today was the dev_t part.
The next part I hope to send one of these days is the
interface between dev_t and kdev_t.
(Most people think that kdev_t is an integer, I think that
it is a pointer. Since dev_t now can be large and arrays
cannot be used, we need some hash lookup to find the
structure corresponding to the number. And the code is
roughly speaking identical to Al's bdev code, only now used
both for bdev and cdev.)

(Funny enough Al's code does not solve the only small problem
I had six years ago: a mknod with funny numbers does not mean
that some such device actually exists. In reality we only
want to convert number into device pointer when the device is
opened, but the current kernel code does
init_special_inode(inode, mode, rdev);
for a mknod, and if it was a block device
inode->i_bdev = bdget(rdev);
so that it does allocate a struct to this nonsense device.)

Andries

2001-03-25 03:25:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: Larger dev_t



On Sat, 24 Mar 2001 [email protected] wrote:
>
> We need a size, and I am strongly in favor of sizeof(dev_t) = 8;
> this is already true in glibc.

The fact that glibc is a quivering mass of bloat, and total and utter crap
makes you suggest that the Linux kernel should try to be as similar as
possible?

Not a very strong argument.

There is no way in HELL I will ever accept a 64-bit dev_t.

I _will_ accept a 32-bit dev_t, with 12 bits for major numbers, and 20
bits for minor numbers.

If people cannot fit their data in that size, they have some serious
problems. And for people who think that you should have meaningful minor
numbers where the bit patterns get split up some way, I can only say "get
a frigging clue". That's what you have filesystem namespaces for. Don't
try to make binary name-spaces.

And I don't care one _whit_ about the fact that Ulrich Drepper thinks that
it's a good idea to make things too large.

Linus

2001-03-25 12:32:52

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: Larger dev_t

From [email protected] Sun Mar 25 05:26:51 2001

On Sat, 24 Mar 2001 [email protected] wrote:
>
> We need a size, and I am strongly in favor of sizeof(dev_t) = 8;
> this is already true in glibc.

The fact that glibc is a quivering mass of bloat, and total and utter crap
makes you suggest that the Linux kernel should try to be as similar as
possible?

Not a very strong argument.

There is no way in HELL I will ever accept a 64-bit dev_t.

I don't care one _whit_ about the fact that Ulrich Drepper thinks that
it's a good idea to make things too large.

Funny.

Now what I wrote is that *I* am strongly in favor of sizeof(dev_t) = 8.
You think that I want bloat - in reality sizeof(dev_t) = 8 makes life
simpler.

My system here has for example in super.c:

static dev_t next_unnamed_device = 0x10000000000ULL;

kdev_t get_unnamed_dev(void) {
return to_kdev_t(next_unnamed_device++);
}

void put_unnamed_dev(kdev_t dev) {
}

a large name space allows one to omit checking what part can be
reused - reuse is unnecessary. That is also why I use a 64-bit pid:
upon a fork one does not have to search for pids, pgrps, sessions
with a given pid, and getpid() can be

static int get_pid(unsigned long flags) {
if (flags & CLONE_PID)
return current->pid;
spin_lock(&lastpid_lock);
++last_pid;
spin_unlock(&lastpid_lock);
return last_pid;
}

fast, simple, avoiding obscure security problems.
Yes, a large name space makes life simpler.

Now concerning this dev_t:
Outside the kernel we have glibc and it is 64 bits.
Inside the kernel we have a pointer to a device struct.
The kernel idea of the size of dev_t only plays a role
on the system call interface.

Really, I see no advantages at all restricting the interface
to something smaller than what user space and kernel use.
And saying "12 bits is enough for a major" somehow sounds funny.

Andries

2001-03-25 14:36:17

by Martin Dalecki

[permalink] [raw]
Subject: Re: Larger dev_t

Jeff Garzik wrote:
>
> Also for 2.5, kdev_t needs to go away, along with all those arrays based
> on major number, and be replaced with either "struct char_device" or
> "struct block_device" depending on the device.
>
> I actually went through the kernel in 2.4.0-test days and did this.
> Most kdev_t usages should really be changed to "struct block_device".
> The only annoyance in the conversion was ROOT_DEV and similar things
> that are tied into the boot process. I didn't want to change that and
> potentially break the boot protocol...

Please se the patches I have send roughly a year to the list as well.
It's actually NOT easy. In esp the SCSI and IDE-CD usage of minor arrays
is
a huge obstacle.

--
- phone: +49 214 8656 283
- job: eVision-Ventures AG, LEV .de (MY OPINIONS ARE MY OWN!)
- langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort:
ru_RU.KOI8-R

2001-03-25 14:48:56

by Martin Dalecki

[permalink] [raw]
Subject: Re: Larger dev_t

Linus Torvalds wrote:
>
> On Sat, 24 Mar 2001 [email protected] wrote:
> >
> > We need a size, and I am strongly in favor of sizeof(dev_t) = 8;
> > this is already true in glibc.
>
> The fact that glibc is a quivering mass of bloat, and total and utter crap
> makes you suggest that the Linux kernel should try to be as similar as
> possible?
>
> Not a very strong argument.
>
> There is no way in HELL I will ever accept a 64-bit dev_t.
>
> I _will_ accept a 32-bit dev_t, with 12 bits for major numbers, and 20
> bits for minor numbers.
>
> If people cannot fit their data in that size, they have some serious
> problems. And for people who think that you should have meaningful minor
> numbers where the bit patterns get split up some way, I can only say "get
> a frigging clue". That's what you have filesystem namespaces for. Don't
> try to make binary name-spaces.
>
> And I don't care one _whit_ about the fact that Ulrich Drepper thinks that
> it's a good idea to make things too large.

Amen. It's entierly sufficent to take a size similiar to the one
on systems which don't have the problems linux has in this area.
Our daily motto should be: "Maybe we don't know a shit about
OS design - but we known very well up to the ground how Solaris works."

Please forgive me If I stressed your sense of humour a bit too much :-)

2001-03-25 15:36:19

by Wichert Akkerman

[permalink] [raw]
Subject: Re: Larger dev_t

In article <[email protected]>,
<[email protected]> wrote:
>a large name space allows one to omit checking what part can be
>reused - reuse is unnecessary.

You are just delaying the problem then, at some point your uptime will
be large enough that you have run through all 64bit pids for example.

Wichert.


--
________________________________________________________________
/ Generally uninteresting signature - ignore at your convenience \
| [email protected] http://www.liacs.nl/~wichert/ |
| 1024D/2FA3BC2D 576E 100B 518D 2F16 36B0 2805 3CB8 9250 2FA3 BC2D |

2001-03-25 16:16:29

by Mitchell Blank Jr

[permalink] [raw]
Subject: Re: Larger dev_t

Wichert Akkerman wrote:
> You are just delaying the problem then, at some point your uptime will
> be large enough that you have run through all 64bit pids for example.

64 bits is enough to fork 1 million processes per second for over
500,000 years. I think that's putting the problem off far enough.

-Mitch

2001-03-25 16:56:00

by Michel Wilson

[permalink] [raw]
Subject: RE: Larger dev_t

> Wichert Akkerman wrote:
> > You are just delaying the problem then, at some point your uptime will
> > be large enough that you have run through all 64bit pids for example.
>
> 64 bits is enough to fork 1 million processes per second for over
> 500,000 years. I think that's putting the problem off far enough.
>
> -Mitch
> -
Ever thought about how you would kill a process: kill -9 127892752 doesn't
sound very appealing to me.
So you'd also need to implement a mechanism that allows for 'easy' selection
of processes to kill, for example giving every process with the same name
a unique identifier (like httpd_0, httpd_1, httpd_2 and so on).

2001-03-25 17:02:10

by Jamie Lokier

[permalink] [raw]
Subject: Re: Larger dev_t

Mitchell Blank Jr wrote:
> Wichert Akkerman wrote:
> > You are just delaying the problem then, at some point your uptime will
> > be large enough that you have run through all 64bit pids for example.
>
> 64 bits is enough to fork 1 million processes per second for over
> 500,000 years. I think that's putting the problem off far enough.

The year is 2006. IBM's latest supercluster has 1000 boxes, each with 4
x 8-way SMT processors running at 1THz. Dense optical interconnect
provides NUMA-style cache coherency, and the entire system runs like a
giant SMP box (using kernel data structure replication). Each active
thread is able to clone() 500,000,000 threads per second, in a pid space
shared throughout the cluster.

A virus arrives containing while(1){clone();}

Engineers observe pid wraparound approximately 2 weeks later :-)

-- Jamie

2001-03-25 17:08:50

by Anton Altaparmakov

[permalink] [raw]
Subject: RE: Larger dev_t

At 17:54 25/03/2001, Michel Wilson wrote:
> > Wichert Akkerman wrote:
> > > You are just delaying the problem then, at some point your uptime will
> > > be large enough that you have run through all 64bit pids for example.
> >
> > 64 bits is enough to fork 1 million processes per second for over
> > 500,000 years. I think that's putting the problem off far enough.
> >
> > -Mitch
> > -
>Ever thought about how you would kill a process: kill -9 127892752 doesn't
>sound very appealing to me.
>So you'd also need to implement a mechanism that allows for 'easy' selection
>of processes to kill, for example giving every process with the same name
>a unique identifier (like httpd_0, httpd_1, httpd_2 and so on).

Ever heard of cut-and-paste? Surely you can afford a mouse... And for when
you you are not inputting manually but running a script/whatever, who cares
what the numbers are...

Cheers,

Anton


--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://sourceforge.net/projects/linux-ntfs/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2001-03-25 17:13:50

by Jeff Garzik

[permalink] [raw]
Subject: Re: Larger dev_t

Michel Wilson wrote:
> Ever thought about how you would kill a process: kill -9 127892752 doesn't
> sound very appealing to me.

man killall(1). Kill processes by name.

> So you'd also need to implement a mechanism that allows for 'easy' selection
> of processes to kill, for example giving every process with the same name
> a unique identifier (like httpd_0, httpd_1, httpd_2 and so on).

huh?

--
Jeff Garzik | May you have warm words on a cold evening,
Building 1024 | a full moon on a dark night,
MandrakeSoft | and a smooth road all the way to your door.

2001-03-25 17:38:42

by Michel Wilson

[permalink] [raw]
Subject: RE: Larger dev_t

> Ever heard of cut-and-paste? Surely you can afford a mouse... And
> for when
> you you are not inputting manually but running a script/whatever,
> who cares
> what the numbers are...
>
> Cheers,
>
> Anton
Oops. Okay, you're right.

2001-03-25 17:56:23

by Gerry

[permalink] [raw]
Subject: Re: Larger dev_t

Ok, i don't really know much about the kernel at all, but here's my opinion
anyway..

To use 64bit pids when 32bit is enough just to "make things easier" doesn't
sound like a good idea to me. Eventually it might wrap around (fx. as on that
supercomputer Jamie Lokier talked about) to overwrite running processes, and
cause death and destruction. Bye bye stability.

Even if it doesn't wrap, using double the space necessarry for something
every single process has is a waste of space. Linux is supposed to be able to
run on a large range of systems, and some of them don't have that kind of
luxury. Sure, the kernel can be modified for those (rare) cases, but still,
using something that's not necessary just sounds like bad practice to me..

Never assume luxury..

Gerry

2001-03-25 18:22:14

by Guest section DW

[permalink] [raw]
Subject: Re: Larger dev_t

On Sun, Mar 25, 2001 at 05:35:01PM +0200, Wichert Akkerman wrote:
> In article <[email protected]>,
> <[email protected]> wrote:
> >a large name space allows one to omit checking what part can be
> >reused - reuse is unnecessary.
>
> You are just delaying the problem then, at some point your uptime will
> be large enough that you have run through all 64bit pids for example.
>
> Wichert.

Yes indeed. If my box, after continually spawning 1000000000 processes
per second for 500 years crashes because pid_t overflows, I'll think
about whether I should put the test back in, or should upgrade to a
128-bit machine.

Andries

2001-03-25 20:30:43

by diego

[permalink] [raw]
Subject: Re: Larger dev_t

On Sun, 25 Mar 2001, Guest section DW wrote:

> On Sun, Mar 25, 2001 at 05:35:01PM +0200, Wichert Akkerman wrote:
> > In article <[email protected]>,
> > <[email protected]> wrote:
> > >a large name space allows one to omit checking what part can be
> > >reused - reuse is unnecessary.
> >
> > You are just delaying the problem then, at some point your uptime will
> > be large enough that you have run through all 64bit pids for example.
> >
> > Wichert.
>
> Yes indeed. If my box, after continually spawning 1000000000 processes
> per second for 500 years crashes because pid_t overflows, I'll think
> about whether I should put the test back in, or should upgrade to a
> 128-bit machine.

this is a no point thread, we are not going to live 500 years? a 64 bits
space is more that we are going to need anyway...

Juan Diego

2001-03-26 21:28:37

by John Byrne

[permalink] [raw]
Subject: Re: Larger dev_t

> Re: Larger dev_t
>
On Sat Mar 24 2001 Linus Torvalds ([email protected]) wrote:
> There is no way in HELL I will ever accept a 64-bit dev_t.
>
> I _will_ accept a 32-bit dev_t, with 12 bits for major numbers, and 20
> bits for minor numbers.
>

Do you have any interest in doing away with the concept of major and
minor numbers altogether; turning the dev_t into an opaque unique id?

At the application level, the kinds of information that is derived from
the major/minor number should probably be derived in some other manner
such as a library or system call. Code that determines device type by
comparing with the major/minor numbers should probably be discouraged in
the long run and this could be a good time to start.

John Byrne

2001-03-26 22:15:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: Larger dev_t



On Mon, 26 Mar 2001, John Byrne wrote:

> > Re: Larger dev_t
> >
> On Sat Mar 24 2001 Linus Torvalds ([email protected]) wrote:
> > There is no way in HELL I will ever accept a 64-bit dev_t.
> >
> > I _will_ accept a 32-bit dev_t, with 12 bits for major numbers, and 20
> > bits for minor numbers.
>
> Do you have any interest in doing away with the concept of major and
> minor numbers altogether; turning the dev_t into an opaque unique id?

Inside the kernel we'll eventually do that.

However, outside the kernel you still need the notion of device numbers if
for no other reasons than legacy /dev space (other applications like 'tar'
care too, but they only care about uniqueness, not about much else).

> At the application level, the kinds of information that is derived from
> the major/minor number should probably be derived in some other manner
> such as a library or system call.

It is. It's called "stat()", and a lot of people do depend on a
device number being available. Few people care what that number actually
_is_, though.

So device numbers aren't going away, they are very much part of the UNIX
legacy. We don't need to care about them too much inside the kernel,
though. What most drivers really want to know is "sub-unit number", and
not much else.

Linus

2001-03-26 23:42:48

by Guest section DW

[permalink] [raw]
Subject: Re: Larger dev_t

On Mon, Mar 26, 2001 at 01:18:06PM -0800, John Byrne wrote:

> Do you have any interest in doing away with the concept of major and
> minor numbers altogether; turning the dev_t into an opaque unique id?
>
> At the application level, the kinds of information that is derived from
> the major/minor number should probably be derived in some other manner
> such as a library or system call. Code that determines device type by
> comparing with the major/minor numbers should probably be discouraged in
> the long run and this could be a good time to start.

Programs that use explicit major/minor information are probably broken
or at least very nonportable.
On the other hand, unfortunately the Unix API has a few explicit
occurrences of major/minor. For example, one has ls(1) and mknod(1).

2001-03-27 06:05:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: Larger dev_t



On Sun, 25 Mar 2001 [email protected] wrote:
>
> Now what I wrote is that *I* am strongly in favor of sizeof(dev_t) = 8.
> You think that I want bloat - in reality sizeof(dev_t) = 8 makes life
> simpler.
>
> My system here has for example in super.c:
>
> static dev_t next_unnamed_device = 0x10000000000ULL;
>
> kdev_t get_unnamed_dev(void) {
> return to_kdev_t(next_unnamed_device++);
> }

Fine.

You'e now forced every piece of code that needs a dev_t to carry along the
overhead of having a 64-bit field, for the advantage of making
"get_unnamed_dev()" smaller and faster.

The thing is, I have _never_ EVER seen "get_unnamed_dev()" on any kernel
profile.

And I don't remember when (if ever) we had a bug in it.

So the advantage of making it smaller/faster/simpler seems to be a purely
theoretical one.

> a large name space allows one to omit checking what part can be
> reused - reuse is unnecessary. That is also why I use a 64-bit pid:
> upon a fork one does not have to search for pids, pgrps, sessions
> with a given pid, and getpid() can be

Hey, 5 years ago we could have said the same for a 32-bit pid.

The fact is, that there are programs out there that use "int" for pids.

It's equally true that changing "pid_t" will require that you recompile
every single app that might have a kernel interface to the current 32-bit
pid_t.

AND you just created tons of problems for things like the non-obvious
stuff like

ioctl(fd, FASETOWN, arg);

because "arg" is defined to be a single word.

In short, you've just broken existing binaries in ways that will be _damn_
hard to debug (they magically start breaking only after the pid-space has
wrapped the first 32 bits).

And that's a DOCUMENTED interface. Never mind all the undocumented stuff
that assumes (for all the reasonable historical reasons) that "pid" fits
in an "int". Tell me there aren't applications like that, and I'll laugh
in your face.

In short, both your arguments are totally bogus. Your "simpler" function
is in fact a horrible rats nest and a source of subtle bugs that you
apparently never even thought about.

And that's without ever actually mentioning the word "bloat" and "data
cache usage".

Linus

2001-03-27 09:31:01

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: Larger dev_t

> On Sun, 25 Mar 2001 [email protected] wrote:

Aha - after your previous explosion I had concluded that working
on *dev_t was useless, but it seems we are still talking.

>> [a large name space is useful since it allows new types of usage]

> Fine.
> You'e now forced every piece of code that needs a dev_t
> to carry along the overhead of having a 64-bit field

Let me repeat: there is no such code. In user space dev_t already is
64 bits, whether you like it or not. We cannot go back to libc5.

In kernel space we all want to use pointers to a device struct,
and major and minor are fields in that struct. There is no advantage
in making those fields narrow. And what is carried around is the
pointer, a 32-bit object.

In other words, inside the kernel the normal obvious coding will
give us ints major, minor. Outside the kernel we have a 64-bit dev_t.
And there is only the interface of system calls that uses this
narrow 16-bit straw. Of course, making it 32-bit is an improvement
but I can really not see any reason to make things difficult for
ourselves by widening this straw to 32-bits only.
Changing kernel and filesystems and glibc is a bit of a hassle -
not very difficult, but we started six years ago and still have
not finished - so it is better to do things right at once.


> It's equally true that changing "pid_t" will require that you recompile
> every single app that might have a kernel interface to the current 32-bit
> pid_t.

It was an example showing the advantage of having a 64-bit object.
Code gets simpler and faster.
But while dev_t already is 64-bits in user space, the same does not
hold for pid_t. I think that I once sent you a patch that would make
the pid use 2^31 instead of 2^15 values. Changing the size of pid_t
is not really possible, it would require a new version of the glibc
library.

In short, both your arguments are totally bogus. Your "simpler" function
is in fact a horrible rats nest and a source of subtle bugs that you
apparently never even thought about.

And that's without ever actually mentioning the word "bloat" and "data
cache usage".

Not so pessimistic. Think and reconsider.

Andries



2001-03-27 18:50:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: Larger dev_t



On Tue, 27 Mar 2001 [email protected] wrote:
>
> Let me repeat: there is no such code. In user space dev_t already is
> 64 bits, whether you like it or not. We cannot go back to libc5.

Now you're back to the argument that "glibc is bloated, so we might as
well be too".

The fact is, that I don't like that argument. I don't buy into that kind
of philosophy. If somebody else made a mistake, that doesn't force me to
do the same mistake.

> In kernel space we all want to use pointers to a device struct,
> and major and minor are fields in that struct. There is no advantage
> in making those fields narrow. And what is carried around is the
> pointer, a 32-bit object.

Agreed. I'm not worried about that part.

But I have a holy crusade. I dislike waste. I dislike over-engineering. I
absolutely detest the "because we can" mentality. I think small is
beautiful, and the guildeline should always be that performance and size
are more important than features.

> Of course, making it 32-bit is an improvement
> but I can really not see any reason to make things difficult for
> ourselves by widening this straw to 32-bits only.

I don't see the "difficulty".

The fact is, that there are two uses for dev_t's:
- the "unnamed" ones. Where we've had 8 bits of namespace for ten years,
and even now it's only mildly painful (ie most people never notice). We
probably don't need more than 10-11 bits in practice (even with
automounting on large sites, I doubt you'll find a site that needs to
have mroe than a few thousand mounts active at the same time). 20 bits
is _plenty_.

- /dev.

And let's take a look at /dev. Do a "ls -l /dev" and think about it. Every
device needs a unique number. Do you ever envision seeing that "ls -l"
taking about 500 billion years to complete? I don't. I don't think you do.
But that's how ludicrous a 64-bit device number is.

So in /dev, there are two problems: we are getting painfully close to
major numbers with 8 bits, and we've run out of minors several times. In
fact, a lot of the reason for the dearthness of major numbers is the fact
that we use multiple majors for some stuff that really wants many minors.

So 8 bits for major is actually fairly close to perfectly livable - or
at least would be if we had more minors. And there is no question about
it: you need a lookup table for major numbers. Which means that 32 bits of
major numbers is ridiculous. As is 20. Which is why I suggested 12. A nice
size, that is reasonable in real life, and that can easily be used for
table lookups. It's also sixteen times larger than what we have today,
which would probably be acceptable in itself.

For minors, we have the problem of "dynamic" devices. The main one
probably being pty's, in fact. It's easily conceivable to have thousands
of pty's - I suspect that for various other reasons most system
administrators would prefer to farm things out so that it isn't _needed_,
but clearly we want at _least_ 16 bits here. 20 bits is reasonable.

And remember: for the future, what we want to move towards is _name_
lookup, not device number lookup. Stupid SCSI people have wanted to
partition the minor number for a long time, and that's always been
idiotic. If you have a sparse name-space, you should use names, not
numbers.

So people who want to see /dev/scsi/bus0/dev12/lun4/part0, use devfs or
something, don't try to make the number space be sparse. Sparse numbers
are a stupid idea for _anything_ but maybe CPU design (I'm willing to
concede that a 256-bit address space might be useful on a CPU level,
because a CPU really cannot afford to do name lookups when looking up
addresses, even if it has been tried).

In short, a 64-bit dev_t is unnecessary. And according to the maxim of
"don't go overboard just because you _can_", I don't want to see it.

Also, I have looked at your argument for "simplicity", and I dismiss it. I
do not believe that the cases you claim are "simpler" are really simpler.
I showed that your pid_t example was completely unrealistic, and as far as
I can tell your "dev_t" example absolutely _hinges_ on the fact that it
makes anonymous dev_t allocation simpler.

And that falls flat on its face simply because it's _already_ so simple
that it doesn't matter.

Linus

2001-03-27 19:31:10

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Larger dev_t

This is my opinion on the issue. Short summary: "I'm sick of the
administrative burden associated with keeping dev_t dense."

Linus Torvalds wrote:
>
> And let's take a look at /dev. Do a "ls -l /dev" and think about it. Every
> device needs a unique number. Do you ever envision seeing that "ls -l"
> taking about 500 billion years to complete? I don't. I don't think you do.
> But that's how ludicrous a 64-bit device number is.
>

That's how ludicrous a *dense* 64-bit device number is. I have to say I
disagree with you that sparse number spaces are a bad idea. The
IPv4->IPv6 transition people have looked at the issues of number spaces
and how much harder they get to keep dense when the size of the
numberspace grows, because your lookup operation becomes so much more
painful. Any time you have to take a larger number space and squeeze it
into a smaller number space, you get some serious pain.

Part of the reason we haven't -- quite -- run out of 8-bit majors yet is
because I have been an absolute *bastard* with registrants lately. It
would cut down on my workload if I could assign majors without worrying
too much about whether or not that particular driver is really going to
be made public.

64 bits is obviously excessive, but I really don't feel comfortable
saying that only 12 bits of major is sufficient. 16 I would buy, but I
don't think 16 bits of minor is sufficient. Given that, it seems to me
-- especially since dev_t isn't exactly the most accessed data type in
the universe -- that the conceptual simplicity of keeping the major and
minor separate in individual 32-bit words really is just as well. YES,
it's overengineering, but the cost is very small; the cost of
underengineering is having to go through yet another painful transition.
Unfortunately, the Linux community seems to have some serious problems
with getting system-wide transitions to happen, especially the ones that
involve ABI changes. This needs to be taken into account.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-03-27 19:28:41

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: Larger dev_t

[email protected] writes:
> [Linus Torvalds]

>> You'e now forced every piece of code that needs a dev_t
>> to carry along the overhead of having a 64-bit field
>
> Let me repeat: there is no such code. In user space dev_t already is
> 64 bits, whether you like it or not. We cannot go back to libc5.
...
> In other words, inside the kernel the normal obvious coding will
> give us ints major, minor. Outside the kernel we have a 64-bit dev_t.
...
> But while dev_t already is 64-bits in user space, the same does not

In your dreams!!!!

int c_has_loose_type_checking(char *name){
struct stat sbuf;
/* ... */
return sbuf.st_rdev;
}

Then we have NFSv2, archive file formats, and zillions of
little tools.

I enjoy truncating dev_t to a reasonable size. Sometimes I check
my input arguments for illogically huge values, and other times I
just relish the opportunity to inflict data loss on you personally.

2001-03-27 19:53:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: Larger dev_t



On Tue, 27 Mar 2001, H. Peter Anvin wrote:
>
> Part of the reason we haven't -- quite -- run out of 8-bit majors yet is
> because I have been an absolute *bastard* with registrants lately. It
> would cut down on my workload if I could assign majors without worrying
> too much about whether or not that particular driver is really going to
> be made public.

No.

The real problem, in my opinion, is needing device numbers in the first
place, for stuff that really shouldn't use them.

I don't want to make allocation easy. In fact, I want to make it _harder_.
I like it being painful, because it should not be done.

I've seen _way_ too many instances of "let's create a special device" for
no good reason. For example, all the crap about mice was (and is) a
mistake. And that's the least of the problems. Some devices on the device
list are there mainly as just a way to hook in an ioctl or something. It's
sad, and it's wrong.

And I'm sorry, but I do NOT want to envision a future where you can say
"ok, majors in the range 512-576 are PPC-specific, and you can go wild".
Yes, it would make your job easier. But it would make for a BAD SYSTEM,
which is what _I_ care about.

We should encourage people to not need major numbers. It's easy. The
driver exports a /proc entry in /proc/driver/xxx or similar . Or the
driver writer says "if you want to use this device, use devfs", and
exports the name there.

Don't get the issue of "it would make my life easier" override the issue
of "it's the wrong thing to do".

Another example: all the stupid pseudo-SCSI drivers that got their own
major numbers, and wanted their very own names in /dev. They are BAD for
the user. Install-scripts etc used to be able to just test /dev/hd[a-d]
and /dev/sd[0-x] and they'd get all the disks. Deficiencies in the SCSI
layer made it impossible for a driver writer to be nice to the user, so
instead they got their own major numbers.

But again, you're arguing for _more_ badness. While I'm of the opinion
that we _already_ have too many major numbers, and we should realize that,
and not make it worse.

A 64-bit dev_t only makes it _easier_ to continue to be stupid about
things.

And that, btw, is the hallmark of "bloat". Bloat is not about being big.
Bloat is about being slow and stupid and not realizing that it's because
of design mistakes.

Linus

2001-03-27 21:20:48

by Alan

[permalink] [raw]
Subject: Re: Larger dev_t

> Another example: all the stupid pseudo-SCSI drivers that got their own
> major numbers, and wanted their very own names in /dev. They are BAD for
> the user. Install-scripts etc used to be able to just test /dev/hd[a-d]
> and /dev/sd[0-x] and they'd get all the disks. Deficiencies in the SCSI

Sorry here I have to disagree. This is _policy_ and does not belong in the
kernel. I can call them all /dev/hdfoo or /dev/disc/blah regardless of
major/minor encoding. If you dont mind kernel based policy then devfs
with /dev/disc already sorts this out nicely.

IMHO more procfs crud is also not the answer. procfs is already poorly
managed with arbitary and semi-random namespace. Its a beautiful example of
why adhoc naming is as broken as random dev_t allocations. Maybe Al Viro's
per device file systems solve that.

> layer made it impossible for a driver writer to be nice to the user, so
> instead they got their own major numbers.

Not deficiencies in the SCSI layer, there is no way the scsi layer can
handle high end raid controllers. In fact one of the reasons we can beat
NT with some of these controllers is because NT does exactly what you
suggest with scsi miniport driver hacks and it _sucks_. Its an ugly hack.

A 20bit minor space actually solves most of this anyway. All the drivers
taking 8 majors suddenely need only one. We go back to 1 major per raid
controller class worst case. and we just about have enough minor numbers for the
extreme S/390 configuration of 65536 DASD volumes with 16 partitions per
volume.

12:20 sounds good to me. If need be we can have extend the small allocations
space as we have with 10,* now.

Alan

2001-03-27 21:37:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: Larger dev_t



On Tue, 27 Mar 2001, Alan Cox wrote:
>
> > layer made it impossible for a driver writer to be nice to the user, so
> > instead they got their own major numbers.
>
> Not deficiencies in the SCSI layer, there is no way the scsi layer can
> handle high end raid controllers. In fact one of the reasons we can beat
> NT with some of these controllers is because NT does exactly what you
> suggest with scsi miniport driver hacks and it _sucks_. Its an ugly hack.

We could do this fairly _trivially_ today.

With absolutely no performance degradation.

With a simple "queue" mapping for the SCSI majors. Just look up which
queue to use for requests to which major, and you're done. The actual
IO may by-pass the SCSI layer altogether.

So I'm absolutely not advocating using the SCSI layer for the
high-end-disks. Rather the reverse. I'm advocating the SCSI layer not
hogging a major number, but letting low-level drivers get at _their_
requests directly.

Linus

2001-03-27 22:03:37

by Andre Hedrick

[permalink] [raw]
Subject: Re: Larger dev_t

On Tue, 27 Mar 2001, Linus Torvalds wrote:

>
>
> On Tue, 27 Mar 2001, Alan Cox wrote:
> >
> > > layer made it impossible for a driver writer to be nice to the user, so
> > > instead they got their own major numbers.
> >
> > Not deficiencies in the SCSI layer, there is no way the scsi layer can
> > handle high end raid controllers. In fact one of the reasons we can beat
> > NT with some of these controllers is because NT does exactly what you
> > suggest with scsi miniport driver hacks and it _sucks_. Its an ugly hack.
>
> We could do this fairly _trivially_ today.
>
> With absolutely no performance degradation.
>
> With a simple "queue" mapping for the SCSI majors. Just look up which
> queue to use for requests to which major, and you're done. The actual
> IO may by-pass the SCSI layer altogether.
>
> So I'm absolutely not advocating using the SCSI layer for the
> high-end-disks. Rather the reverse. I'm advocating the SCSI layer not
> hogging a major number, but letting low-level drivers get at _their_
> requests directly.

Am I hearing you state you want dynamic device points and dynamic majors?
Thus would be nice because the ridge structure now prevents a lot if
things from developing.

Andre Hedrick
Linux ATA Development
ASL Kernel Development
-----------------------------------------------------------------------------
ASL, Inc. Toll free: 1-877-ASL-3535
1757 Houret Court Fax: 1-408-941-2071
Milpitas, CA 95035 Web: http://www.aslab.com

2001-03-27 22:18:27

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Larger dev_t

Alan Cox wrote:
>
> > high-end-disks. Rather the reverse. I'm advocating the SCSI layer not
> > hogging a major number, but letting low-level drivers get at _their_
> > requests directly.
>
> A major for 'disk' generically makes total sense. Classing raid controllers
> as 'scsi' isnt neccessarily accurate. A major for 'serial ports' would also
> solve a lot of misery
>

But it might also cause just as much misery, specifically because things
move around too much.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-03-27 22:16:17

by Alan

[permalink] [raw]
Subject: Re: Larger dev_t

> high-end-disks. Rather the reverse. I'm advocating the SCSI layer not
> hogging a major number, but letting low-level drivers get at _their_
> requests directly.

A major for 'disk' generically makes total sense. Classing raid controllers
as 'scsi' isnt neccessarily accurate. A major for 'serial ports' would also
solve a lot of misery

2001-03-27 22:15:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Larger dev_t

Alan Cox wrote:
>
> > Another example: all the stupid pseudo-SCSI drivers that got their own
> > major numbers, and wanted their very own names in /dev. They are BAD for
> > the user. Install-scripts etc used to be able to just test /dev/hd[a-d]
> > and /dev/sd[0-x] and they'd get all the disks. Deficiencies in the SCSI
>
> Sorry here I have to disagree. This is _policy_ and does not belong in the
> kernel. I can call them all /dev/hdfoo or /dev/disc/blah regardless of
> major/minor encoding. If you dont mind kernel based policy then devfs
> with /dev/disc already sorts this out nicely.
>
> IMHO more procfs crud is also not the answer. procfs is already poorly
> managed with arbitary and semi-random namespace. Its a beautiful example of
> why adhoc naming is as broken as random dev_t allocations. Maybe Al Viro's
> per device file systems solve that.
>

In some ways, they make matters worse -- now you have to effectively keep
a device list in /etc/fstab. Not exactly user friendly.

devfs -- in the abstract -- really isn't that bad of an idea; after all,
device names really do specify an interface. Something I suggested also,
at some point, was to be able to pass strings onto character device
drivers (so that if /dev/foo is a char device, /dev/foo/bar would access
the same device with the string "bar" passed on to the device driver --
this would help deal with "same device, different options" such as
/dev/ttyS0 versus /dev/cua0 -- having flags to open() is really ugly
since there tends to be no easy way to pass them down through multiple
layers of user-space code.)

The problems with devfs (other than kernel memory bloat, which is pretty
much guaranteed to be much worse than the bloat a larger dev_t would
entail) is that it needs complex auxilliary mechanisms to make
"chmod /dev/foo" work as expected (the change to /dev/foo is to be
permanent, without having to edit some silly config file) -- this is
where the policy comes in, much more so than namespace -- and the fact
that it tries to impose a namespace on character devices which is utterly
different from the currently established interface. It may very well be
"better" (although /dev/misc/ is much too ugly to live -- if you have to
separate things up, do so on functional lines!!!), but it is still
*different*, which means it breaks anything that accesses char devices.
Block devices, obviously, is not a problem -- that's what /etc/fstab is
for.

At OLS, I discussed the following issues with Richard and Alan. We
didn't really reach an agreement -- I hope we can discuss it again at the
kernel summit -- but I wouldn't object to devfs if it resolved these
issues:

a) A way to allocate device nodes without automatically instantiating
them in kernel space. For devices where each minor doesn't require
kernel memory until used, the devfs overhead easily becomes unacceptable.

b) Use the established namespace, or put forward a comprehensive plan to
alter the namespace -- and do the necessary legwork to obtain buy-in from
everyone concerned. In the case of tty's, this means modifying the
locking protocol; this in itself isn't a bad thing (the locking protocol
has some serious flaws), but it needs to be explicit, written down, and
widely publicized, well ahead of time. A flag day of this magnitude will
*HURT*.

c) Make sure chown/chmod/link/symlink/rename/rm etc does the right thing,
without the need for "tar hacks" or anything equivalently gross.


Richard indicated being willing to fix (a) and (c). (b) is the main
sticking point at this stage.

That being said, I will be perfectly happy to acknowledge that using a
device filesystem has some nice features, especially in conjunction with
hotplugging devices. It is definitely far better than /proc hackery, and
does permit putting the object-oriented aspects of the VFS to good
advantage. The bloat is an issue, but with the memory sizes available
today it's less than it has been in the past.

Modulo the issues I have listed above, I would at this stage be in favour
to move to a devfs-based system, especially after Al Viro's "one
filesystem" (filesystem always exists in exactly one copy, regardless of
if it is mounted or not) changes. I know this is probably a bit of a
shock to lots of people, but times change; hotplugging is a major issue
these days, big memories are available without requiring a matching big
budget, and there seems to be a bigger willingness to work out the
remaining issues. What I would like to see is working out the issues
listed above, and then rather quickly move to a devfs-*BASED* system
(devfs is the only way to do devices), so that we can take advantage of
the VFS.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-03-27 22:39:47

by Jesse Pollard

[permalink] [raw]
Subject: Re: Larger dev_t

--------- Received message begins Here ---------

>
> Alan Cox wrote:
> >
> > > high-end-disks. Rather the reverse. I'm advocating the SCSI layer not
> > > hogging a major number, but letting low-level drivers get at _their_
> > > requests directly.
> >
> > A major for 'disk' generically makes total sense. Classing raid controllers
> > as 'scsi' isnt neccessarily accurate. A major for 'serial ports' would also
> > solve a lot of misery
> >
>
> But it might also cause just as much misery, specifically because things
> move around too much.

That can be handled. It calls for using a volume name or UUID on file
systems and allowing mount to accept the volume name.

One way would be to add the volume identifier (whatever it ends up being)
to the /proc/partitions file. Then mount could search that table for
the volume name and use the associated device definitions to accomplish
the mount.

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2001-03-27 22:44:37

by Russell King

[permalink] [raw]
Subject: Re: Larger dev_t

On Tue, Mar 27, 2001 at 02:16:37PM -0800, H. Peter Anvin wrote:
> Alan Cox wrote:
> > A major for 'disk' generically makes total sense. Classing raid controllers
> > as 'scsi' isnt neccessarily accurate. A major for 'serial ports' would also
> > solve a lot of misery
>
> But it might also cause just as much misery, specifically because things
> move around too much.

Actually, it probably won't. As has already been said in the past, the
names are effectively a user space issue, but major numbers aren't.

I for one would like to see a major number for all 'serial ports' whether
they be embedded ARM serial ports _or_ standard 16550 ports, but at the
moment its not easily acheivable without introducing more mess.

Ted indicated to me a while ago (just after I wrote serial_core.c for
yet-another-type-of-ARM-serial-port) his visions of the direction serial
stuff should take in 2.5; this is obviously one of the things that I'm
keen to discuss and solve in 2.5.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2001-03-27 22:46:48

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Larger dev_t

Jesse Pollard wrote:
> > >
> > > > high-end-disks. Rather the reverse. I'm advocating the SCSI layer not
> > > > hogging a major number, but letting low-level drivers get at _their_
> > > > requests directly.
> > >
> > > A major for 'disk' generically makes total sense. Classing raid controllers
> > > as 'scsi' isnt neccessarily accurate. A major for 'serial ports' would also
> > > solve a lot of misery
> > >
> >
> > But it might also cause just as much misery, specifically because things
> > move around too much.
>
> That can be handled. It calls for using a volume name or UUID on file
> systems and allowing mount to accept the volume name.
>
> One way would be to add the volume identifier (whatever it ends up being)
> to the /proc/partitions file. Then mount could search that table for
> the volume name and use the associated device definitions to accomplish
> the mount.
>

Since when have serial ports had a UUID or volume name?

Seriously, folks, don't look too much at block devices, especially not
block devices that are mounted. That's the easy -- nay, trivial --
case. Char devices is where the rubber hits the road.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-03-27 22:57:37

by Dan Hollis

[permalink] [raw]
Subject: Re: Larger dev_t

On Tue, 27 Mar 2001, H. Peter Anvin wrote:
> c) Make sure chown/chmod/link/symlink/rename/rm etc does the right thing,
> without the need for "tar hacks" or anything equivalently gross.

write-through filesystem, like overlaying a r/w ext2 on top of an iso9660
fs.

-Dan

2001-03-27 23:00:07

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Larger dev_t

Dan Hollis wrote:
>
> On Tue, 27 Mar 2001, H. Peter Anvin wrote:
> > c) Make sure chown/chmod/link/symlink/rename/rm etc does the right thing,
> > without the need for "tar hacks" or anything equivalently gross.
>
> write-through filesystem, like overlaying a r/w ext2 on top of an iso9660
> fs.
>

This is not necessarily the right way to do it, since it may not carry
with it the appropriate information. Richard, I belive, was planning to
implement this using devfsd.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-03-27 23:43:44

by Richard Gooch

[permalink] [raw]
Subject: Re: Larger dev_t

H. Peter Anvin writes:
> Dan Hollis wrote:
> >
> > On Tue, 27 Mar 2001, H. Peter Anvin wrote:
> > > c) Make sure chown/chmod/link/symlink/rename/rm etc does the right thing,
> > > without the need for "tar hacks" or anything equivalently gross.
> >
> > write-through filesystem, like overlaying a r/w ext2 on top of an iso9660
> > fs.
>
> This is not necessarily the right way to do it, since it may not
> carry with it the appropriate information. Richard, I belive, was
> planning to implement this using devfsd.

I did, back in April 2000. I'm fairly sure I told you at OLS :-)

Create and change events can be passed to devfsd and this may be
recorded in a filesystem.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-03-27 23:48:14

by Andrew Pimlott

[permalink] [raw]
Subject: Re: Larger dev_t

On Tue, Mar 27, 2001 at 02:13:47PM -0800, H. Peter Anvin wrote:
> The problems with devfs (other than kernel memory bloat, which is pretty
> much guaranteed to be much worse than the bloat a larger dev_t would
> entail) is that it needs complex auxilliary mechanisms to make
> "chmod /dev/foo" work as expected (the change to /dev/foo is to be
> permanent, without having to edit some silly config file)

The elegant solution seems obvious to me. What we have today is two
namespaces--device major/minor, and filesystem--that are bridged by
special files. Special files live in the filesystem namespace and
point into the major/minor namespace. Objects in the major/minor
namespace are directly accessible only by root (ie, only root can
mknod(2)); but when accessed through special files, access control
comes from the special file.

The concept that makes this work is that the special file is a
"pointer with permissions". To make devfs work, you want the same
thing--except a pointer into filesystem space, not major/minor
space. Unix doesn't have this, but it would be a simple cross of
symlinks (pointer living in the filesystem and pointing into the
filesystem) and special files (pointers with permissions).

To be concrete: You'd have a root-only (or perhaps the directories
could be a+rx--but minimal policy) hierarchy under /devices, and the
admin would populate /dev with "special symlinks" that point into
/devices, and give the appropriate permissions to users.

I have no idea whether this is feasible, but it is much more
attractive to me than devfsd, or layered mounts, or tar at
shutdown, or anything else I've heard.

Andrew

2001-03-27 23:59:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: Larger dev_t



On Tue, 27 Mar 2001, Andre Hedrick wrote:
>
> Am I hearing you state you want dynamic device points and dynamic majors?

Yes and no.

We need static structures for user space - from a user perspective it
makes a ton more sense to say "I want to see all disks" than it does to
know that you have to do /dev/hd*, /dev/sd* plus all the extra magic
combinations that can happen (USB etc).

So in a sense what I'm arguing for is for _stricter_ device numbers to the
outside world.

But internally, it would be reasonably easy to make a mapping from those
user-visible numbers to a much looser version.

One example of this is going to happen very early in 2.5.x: the whole
"partitioning" stuff is going to go away from the driver, and into the
ll_rw_block layer as just another disk re-mapping thing. We already do
those kinds of re-mappings for LVM reasons anyway, and partitioning is not
something a disk driver should know about, really.

And that kind of partitioning mapping automatically means that we'd need
to remap minor numbers, and do it on a per-major basis (because the
partitioning mapping right now is not actually the same between SCSI and
IDE: IDE uses six bits of partitioning, while SCSI uses just four bits).
And once you do that, you might as well start "remapping" major numbers
too.

So let's say that you have two separate SCSI controllers - they would both
show up on major #8, and different minor numbers. Right now, for example,
controller 1 might have one disk, with minors 0-15 (for the whole disk and
15 partitions), and controller 2 might have two disks using minors 16-47.

As it stands now, the SCSI layer needs to do the remapping, and because
the SCSI layer does the remapping, nothing but SCSI layer devices can use
major #8.

But once you start doing partition mapping in ll_rw_block.c, you might as
well get rid of the notion that "SCSI is major 8". You could easily have
many different drivers, with many different queues, and remap them all to
have major 8 (and different minors) so that it looks simple for a user
that just wants to see SCSI disks.

Which is not to say that the same disk might not show up somewhere else
too, if anybody wants it to. The _driver_ should just know "unit x on
queue y", and then the driver might do whatever it wants (it might be, for
example, that the driver actually wants to show multiple controllers as
one queue, if the driver really wants to for some reason). And it should
be possible to have two drivers that really have no idea at ALL about each
other to just share the same major numbers.

Linus

2001-03-29 03:54:55

by Alan

[permalink] [raw]
Subject: Re: Larger dev_t

> Why do you worry about installers? New distro - new kernel - new
> installer

Because the same code tends to be shared with post install configuration
tools too.


2001-03-29 11:16:22

by Martin Dalecki

[permalink] [raw]
Subject: Re: Larger dev_t

Alan Cox wrote:
>
> > Why do you worry about installers? New distro - new kernel - new
> > installer
>
> Because the same code tends to be shared with post install configuration
> tools too.

So change them as well for a new distribution. What's there problem.
There isn't anything out there you can't do by hand.
Fortunately so!

2001-03-30 07:12:10

by kaih

[permalink] [raw]
Subject: Re: Larger dev_t

[email protected] (Martin Dalecki) wrote on 28.03.01 in <[email protected]>:

> Alan Cox wrote:
> >
> > > Exactly. It's just that for historical reasons, I think the major for
> > > "disk" should be either the old IDE or SCSI one, which just can show
> > > more devices. That way old installers etc work without having to
> > > suddenly start knowing about /dev/disk0.
> >
> > They will mostly break. Installers tend to parse /proc/scsi and have
> > fairly complex ioctl based relationships based on knowing ide v scsi.
> >
> > /dev/disc/ is a little un-unix but its clean
>
> Why do you worry about installers? New distro - new kernel - new
> installer
> that's they job to worry about it. They will change the installer anyway
> and this kind of change actually is going to simplyfy the code there, I
> think,
> a bit.
>
> Just kill the old device major suddenly and place it in the changelog
> of the new kernel that the user should mknod and add it to /dev/fstab
> before rebooting into the new kernel. Hey that's developement anyway :-)
> If the developer boots back into the old kernel just other mounts
> in /dev/fstab will fail no problem for transition here in sight...

Make them finally use UUIDs and /proc/partitions. Except for the root fs,
problem solved. (Well ok, someone needs to create the right device nodes.)

As for the root fs, even that part isn't hard with an initrd.

MfG Kai

2001-03-28 00:09:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: Larger dev_t



On Tue, 27 Mar 2001, Alan Cox wrote:
>
> A major for 'disk' generically makes total sense. Classing raid controllers
> as 'scsi' isnt neccessarily accurate. A major for 'serial ports' would also
> solve a lot of misery

Exactly. It's just that for historical reasons, I think the major for
"disk" should be either the old IDE or SCSI one, which just can show more
devices. That way old installers etc work without having to suddenly start
knowing about /dev/disk0.

But hey, maybe I'm wrong.

Linus

2001-03-28 00:26:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: Larger dev_t



On Tue, 27 Mar 2001, H. Peter Anvin wrote:
>
> They would still have to change, since now we'd have to worry about
> /dev/hd* having changed meanings;

This is why I'd select the SCSI major, which has always had more of a
"random disk" connotation, with fewer people being aware of the fact that
it's specifically IDE.

> also, you now cannot create a
> backward-compatible /dev since /dev/hdc is (22,0), etc, in the current
> scheme. The SCSI scheme is also not acceptable; it has been a
> long-standing problem that it doesn't allow enough partitions per disk.

Note that neither of these are really problematic, for the simple reason
that once you do mapping, the m:n mapping pretty much automatically falls
out of this. It's actually hard to think of a mapping that wouldn't allow
multiple major numbers to be mapped to the same devices (and in different
ways).

For example, it is not hard at all to have a IDE disk show up in three
places: the traditional /dev/hdx place, as /dev/sdx (the SCSI CD-ROM
emulation already ends up doing this, I think) _and_ potentially as a "new
and improved" non-backwards-compatibility place which would be /dev/diskx
and would take advantage of the larger minor number space.

For example, it would probably not be a bad idea to have something
explicitly in "high" major number space that would be something like

/dev/disk<n>p<m>

where <n> would be the disk number, and <m> would be the partition number,
and just map it to <major=256>, <minor=(n<<8)+m>. Old installers would
still see the device, but couldn't access more than 15/63 partitions (for
SCSI/IDE numbering respectively).

And the thing is, this would not complicate the mapping. The only worry
would be one of virtual aliases, but kdev_t should pretty much take care
of that.

So while we probably eventually want to switch everything over to a new
"disk n" numbering scheme, but for backwards compatibility reasons the
old numbers won't go away (and knowing how some people work and administer
their sites, they'll stay with us for a _loong_ time and will need to be
supported even with new drivers that don't actually share any code with
"IDE" or "SCSI").

So the IDE/SCSI numbers will have to stay. And they'll have to be
considered "supported", not just "old compatibility stuff that new drivers
don't have to care about". But done right, none of this will be _visible_
to drivers, so it should not add any ugliness. It shouldn't even be
visible to the mapping layer, except as yet another mapping (that just
happens to alias with other mappings).

Linus

2001-03-28 00:13:54

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Larger dev_t

Linus Torvalds wrote:
>
> On Tue, 27 Mar 2001, Alan Cox wrote:
> >
> > A major for 'disk' generically makes total sense. Classing raid controllers
> > as 'scsi' isnt neccessarily accurate. A major for 'serial ports' would also
> > solve a lot of misery
>
> Exactly. It's just that for historical reasons, I think the major for
> "disk" should be either the old IDE or SCSI one, which just can show more
> devices. That way old installers etc work without having to suddenly start
> knowing about /dev/disk0.
>
> But hey, maybe I'm wrong.
>

They would still have to change, since now we'd have to worry about
/dev/hd* having changed meanings; also, you now cannot create a
backward-compatible /dev since /dev/hdc is (22,0), etc, in the current
scheme. The SCSI scheme is also not acceptable; it has been a
long-standing problem that it doesn't allow enough partitions per disk.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-03-28 00:29:54

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: Larger dev_t

Andrew Pimlott writes:
> On Tue, Mar 27, 2001 at 02:13:47PM -0800, H. Peter Anvin wrote:

>> The problems with devfs (other than kernel memory bloat, which is pretty
>> much guaranteed to be much worse than the bloat a larger dev_t would
>> entail) is that it needs complex auxilliary mechanisms to make
>> "chmod /dev/foo" work as expected (the change to /dev/foo is to be
>> permanent, without having to edit some silly config file)
>
> The elegant solution seems obvious to me. What we have today is two
> namespaces--device major/minor, and filesystem--that are bridged by
> special files. Special files live in the filesystem namespace and
> point into the major/minor namespace. Objects in the major/minor
> namespace are directly accessible only by root (ie, only root can
> mknod(2)); but when accessed through special files, access control
> comes from the special file.
>
> The concept that makes this work is that the special file is a
> "pointer with permissions". To make devfs work, you want the same
> thing--except a pointer into filesystem space, not major/minor
> space. Unix doesn't have this, but it would be a simple cross of
> symlinks (pointer living in the filesystem and pointing into the
> filesystem) and special files (pointers with permissions).
>
> To be concrete: You'd have a root-only (or perhaps the directories
> could be a+rx--but minimal policy) hierarchy under /devices, and the
> admin would populate /dev with "special symlinks" that point into
> /devices, and give the appropriate permissions to users.

This can be done with an lchmod() and support for setuid symlinks.

Read can see where the link points
Write ignored, or XOR the on-disk data with 0222 and...?
Execute can follow the link
Setuid link followed as for the owner
Setgid link followed as for the owner's group
Sticky reserved for future use

Then you get:

lr-sr-xr-x 1 root root 17 Mar 21 2000 /dev/null -> /devices/mem/null

2001-03-28 01:03:26

by Paul Jakma

[permalink] [raw]
Subject: Re: Larger dev_t

On Tue, 27 Mar 2001, Dan Hollis wrote:

> On Tue, 27 Mar 2001, H. Peter Anvin wrote:
> > c) Make sure chown/chmod/link/symlink/rename/rm etc does the right thing,
> > without the need for "tar hacks" or anything equivalently gross.
>
> write-through filesystem, like overlaying a r/w ext2 on top of an iso9660
> fs.

functionality to do this is in devfs and devfsd already.

there are 2 ways:

1. devfs on /dev, maintain state in /dev-state.
2. regular ext2 (devfs naming style) /dev and devfs mounted on, eg,
/devfs for hotplug and module load/unload updates

1:

/dev-state -> regular ext2
/dev -> devfs

use the CREATE, CHANGE and REGISTER hooks that devfs has to
allow devfsd to transparently copy changes in /dev to the ext2
/dev-state. See the example devfsd.conf for appropriate entries, eg:

REGISTER .* COPY /dev-state/$devname $devpath
CHANGE .* COPY $devpath /dev-state/$devname
CREATE .* COPY $devpath /dev-state/$devname

2:

/dev -> regular ext2
/devfs -> devfs

use the devfs hooks for REGISTER and UNREGISTER to have devfsd update
the static /dev whenever hotplug events occur. Eg:

REGISTER .* COPY ${mntpnt}/$devname /dev/$devname
UNREGISTER .* CFUNCTION GLOBAL unlink /dev/$devname

seems to work for me:

[root@fogarty /devfs]# ls -l /dev{,fs}/misc/nvram
ls: /dev/misc/nvram: No such file or directory
ls: /devfs/misc/nvram: No such file or directory
[root@fogarty /devfs]# modprobe nvram
[root@fogarty /devfs]# ls -l /dev{,fs}/misc/nvram
crw-r----- 1 root root 10, 144 Jan 1 1970 /devfs/misc/nvram
crw-r----- 1 root root 10, 144 Mar 28 01:56 /dev/misc/nvram
[root@fogarty /devfs]# rmmod nvram
[root@fogarty /devfs]# ls -l /dev{,fs}/misc/nvram
ls: /dev/misc/nvram: No such file or directory
ls: /devfs/misc/nvram: No such file or directory

> -Dan

i prefer option 2 as /dev state is then not dependent on devfsd being
there and it just sidesteps the whole permissions issue. if devfsd
doesn't start then i still have a fully functional /dev.

but anyway... there seems to be loads of scope to do lots of
different things with devfsd, plus NIS support. :)

regards,
--
Paul Jakma [email protected] [email protected]
PGP5 key: http://www.clubi.ie/jakma/publickey.txt
-------------------------------------------
Fortune:
Premature optimization is the root of all evil.
-- D.E. Knuth

2001-03-28 01:36:56

by Alexander Viro

[permalink] [raw]
Subject: Re: Larger dev_t



On Wed, 28 Mar 2001, Paul Jakma wrote:

> On Tue, 27 Mar 2001, Dan Hollis wrote:
>
> > On Tue, 27 Mar 2001, H. Peter Anvin wrote:
> > > c) Make sure chown/chmod/link/symlink/rename/rm etc does the right thing,
> > > without the need for "tar hacks" or anything equivalently gross.
> >
> > write-through filesystem, like overlaying a r/w ext2 on top of an iso9660
> > fs.
>
> functionality to do this is in devfs and devfsd already.

Guys, before you get all hot and excited about devfsd - had _anyone_
audit the protocol implementation for races? Or test it for heavy
(un)loading drivers, for that matter. We had a long history of autofs
races and devfsd is not simpler.

Coll toys are cool toys, but I wouldn't bet a dime on devfsd ability to
deal with adding/removing entries 100% correct in all cases. And /dev
has slightly larger user base than autofs, so it _is_ a sensitive area.

In its current form devfs itself _still_ contains known races. Known
since last Summer. Adding devfsd into the mix doesn't make the picture
prettier. Unless some devfs proponent is willing to do such analysis
all references to devfsd are nothing but wishful thinking.

2001-03-28 02:18:23

by Alan

[permalink] [raw]
Subject: Re: Larger dev_t

> Exactly. It's just that for historical reasons, I think the major for
> "disk" should be either the old IDE or SCSI one, which just can show more
> devices. That way old installers etc work without having to suddenly start
> knowing about /dev/disk0.

They will mostly break. Installers tend to parse /proc/scsi and have fairly
complex ioctl based relationships based on knowing ide v scsi.

/dev/disc/ is a little un-unix but its clean



2001-03-28 04:00:08

by Johan Kullstam

[permalink] [raw]
Subject: Re: Larger dev_t

"H. Peter Anvin" <[email protected]> writes:

> Alan Cox wrote:
> >
> > > Another example: all the stupid pseudo-SCSI drivers that got their own
> > > major numbers, and wanted their very own names in /dev. They are BAD for
> > > the user. Install-scripts etc used to be able to just test /dev/hd[a-d]
> > > and /dev/sd[0-x] and they'd get all the disks. Deficiencies in the SCSI
> >
> > Sorry here I have to disagree. This is _policy_ and does not belong in the
> > kernel. I can call them all /dev/hdfoo or /dev/disc/blah regardless of
> > major/minor encoding. If you dont mind kernel based policy then devfs
> > with /dev/disc already sorts this out nicely.
> >
> > IMHO more procfs crud is also not the answer. procfs is already poorly
> > managed with arbitary and semi-random namespace. Its a beautiful example of
> > why adhoc naming is as broken as random dev_t allocations. Maybe Al Viro's
> > per device file systems solve that.
> >
>
> In some ways, they make matters worse -- now you have to effectively keep
> a device list in /etc/fstab. Not exactly user friendly.
>
> devfs -- in the abstract -- really isn't that bad of an idea; after all,
> device names really do specify an interface. Something I suggested also,
> at some point, was to be able to pass strings onto character device
> drivers (so that if /dev/foo is a char device, /dev/foo/bar would access
> the same device with the string "bar" passed on to the device driver --
> this would help deal with "same device, different options" such as
> /dev/ttyS0 versus /dev/cua0 -- having flags to open() is really ugly
> since there tends to be no easy way to pass them down through multiple
> layers of user-space code.)
>
> The problems with devfs (other than kernel memory bloat, which is pretty
> much guaranteed to be much worse than the bloat a larger dev_t would
> entail) is that it needs complex auxilliary mechanisms to make
> "chmod /dev/foo" work as expected (the change to /dev/foo is to be
> permanent, without having to edit some silly config file)

the permanent storage for a PC computer is naturally the hard disk.
you could always make a device partition to store persistant state. i
think a few megabytes should be enough. it could be substatially less
if you had good defaults and disk storage was only used to override
the default.

of course, using disk brings us full circle back to device nodes on
filesystem. the impetus behind devfs was never (afaict) saving disk
space or getting around slow disk access. people want device nodes to
appear automatically and go away again when drivers are removed.

i think what all this means is that between kernel and collection of the
user space programs the filesystem semantics just doesn't have enough
going for it in order to do all that you want with devices.

it might be a mostly userspace solvable problem. a device daemon
could create new devices on the fly, only they'd be ordinary
filesystem devices. for example it might be better to hack ls to not
show dormant devices. a cronjob could call a grim device reaper to
cull nodes not used for a long time...

what do other vaguely unix-like systems do? does, say, plan9 have a
better way of dealing with all this?

--
J o h a n K u l l s t a m
[[email protected]]

2001-03-28 04:24:38

by Alexander Viro

[permalink] [raw]
Subject: Re: Larger dev_t



On 27 Mar 2001, Johan Kullstam wrote:

> it might be a mostly userspace solvable problem. a device daemon
> could create new devices on the fly, only they'd be ordinary
> filesystem devices. for example it might be better to hack ls to not
> show dormant devices. a cronjob could call a grim device reaper to
> cull nodes not used for a long time...

Why the hell do we have to reinvent every wheel out there? Especially
when we _already_ have that wheel... You don't need to hack ls in order
to deal with "dormant" entries. That's what autofs is for. Doing that
quite fine, thank you very much. Furrfu...

> what do other vaguely unix-like systems do? does, say, plan9 have a
> better way of dealing with all this?

Plan9 doesn't have devfs. You can say
bind -a #f /dev
and that gives you /dev/fd[0-3]{disk,ctl}. I.e. /dev is a union of
device filesystems you've mounted (bound, actually) there. Combined
with autofs (and replacing bind #f with mount -t devfloppy) it's
pretty much what we could use.

Think of devpts - it's the same kind of beast. As for putting stuff into
/etc/fstab - not funny. /etc/auto_dev would do just fine, especially
if /etc/fstab would contain the minimal (==permanent) subset. Nothing stops
you from examining /lib/modules/ and updating /etc/auto_dev from rc scripts
upon boot.

2001-03-28 07:10:52

by Andre Hedrick

[permalink] [raw]
Subject: Re: Larger dev_t

On Wed, 28 Mar 2001, Alan Cox wrote:

> They will mostly break. Installers tend to parse /proc/scsi and have fairly
> complex ioctl based relationships based on knowing ide v scsi.
>
> /dev/disc/ is a little un-unix but its clean

Then make a '/proc/block/{ide|scsi|raid|wtf|ram|net}' which has a string
name for a real mknod thingy that includes the major|minor of the animal.

Is this simpler than the problem?


Andre Hedrick
Linux ATA Development
ASL Kernel Development
-----------------------------------------------------------------------------
ASL, Inc. Toll free: 1-877-ASL-3535
1757 Houret Court Fax: 1-408-941-2071
Milpitas, CA 95035 Web: http://www.aslab.com

2001-03-28 11:53:12

by Pjotr Kourzanoff

[permalink] [raw]
Subject: Re: Larger dev_t

On Tue, 27 Mar 2001, Linus Torvalds wrote:

> ... lots of stuff removed ...
>
> So in /dev, there are two problems: we are getting painfully close to
> major numbers with 8 bits, and we've run out of minors several times. In
> fact, a lot of the reason for the dearthness of major numbers is the fact
> that we use multiple majors for some stuff that really wants many minors.
>

Well, one solution to this on the long term comes quite naturally:
make major/minor separation a policy, i.e., as in network/host
IP numbers. That is, in addition to 32-bit dev_t there would
be a 32-bit devmask_t that when &-ed with dev_t would yield the
major. The default, for compatibility reasons will be devmask of 255
(or -1+1<<12). For example, all floppies would fall into 2/8 major
(using IP address notation). Obviously, all this is not advantageous
on the short term, but in the absence of better ideas, this can give
some flexibility when allocating/shifting to new majors/minors...Of
course, namespaces approach is even better, but when will all drivers
be converted to support them?

Cheers,

Pjotr


2001-03-28 12:04:02

by Jesse Pollard

[permalink] [raw]
Subject: Re: Larger dev_t

On Tue, 27 Mar 2001, Johan Kullstam wrote:
>"H. Peter Anvin" <[email protected]> writes:
>
>> Alan Cox wrote:
>> >
>> > > Another example: all the stupid pseudo-SCSI drivers that got their own
>> > > major numbers, and wanted their very own names in /dev. They are BAD for
>> > > the user. Install-scripts etc used to be able to just test /dev/hd[a-d]
>> > > and /dev/sd[0-x] and they'd get all the disks. Deficiencies in the SCSI
>> >
>> > Sorry here I have to disagree. This is _policy_ and does not belong in the
>> > kernel. I can call them all /dev/hdfoo or /dev/disc/blah regardless of
>> > major/minor encoding. If you dont mind kernel based policy then devfs
>> > with /dev/disc already sorts this out nicely.
>> >
>> > IMHO more procfs crud is also not the answer. procfs is already poorly
>> > managed with arbitary and semi-random namespace. Its a beautiful example of
>> > why adhoc naming is as broken as random dev_t allocations. Maybe Al Viro's
>> > per device file systems solve that.
>> >
>>
>> In some ways, they make matters worse -- now you have to effectively keep
>> a device list in /etc/fstab. Not exactly user friendly.
>>
>> devfs -- in the abstract -- really isn't that bad of an idea; after all,
>> device names really do specify an interface. Something I suggested also,
>> at some point, was to be able to pass strings onto character device
>> drivers (so that if /dev/foo is a char device, /dev/foo/bar would access
>> the same device with the string "bar" passed on to the device driver --
>> this would help deal with "same device, different options" such as
>> /dev/ttyS0 versus /dev/cua0 -- having flags to open() is really ugly
>> since there tends to be no easy way to pass them down through multiple
>> layers of user-space code.)
>>
>> The problems with devfs (other than kernel memory bloat, which is pretty
>> much guaranteed to be much worse than the bloat a larger dev_t would
>> entail) is that it needs complex auxilliary mechanisms to make
>> "chmod /dev/foo" work as expected (the change to /dev/foo is to be
>> permanent, without having to edit some silly config file)
>
>the permanent storage for a PC computer is naturally the hard disk.
>you could always make a device partition to store persistant state. i
>think a few megabytes should be enough. it could be substatially less
>if you had good defaults and disk storage was only used to override
>the default.
>
>of course, using disk brings us full circle back to device nodes on
>filesystem. the impetus behind devfs was never (afaict) saving disk
>space or getting around slow disk access. people want device nodes to
>appear automatically and go away again when drivers are removed.
>
>i think what all this means is that between kernel and collection of the
>user space programs the filesystem semantics just doesn't have enough
>going for it in order to do all that you want with devices.
>
>it might be a mostly userspace solvable problem. a device daemon
>could create new devices on the fly, only they'd be ordinary
>filesystem devices. for example it might be better to hack ls to not
>show dormant devices. a cronjob could call a grim device reaper to
>cull nodes not used for a long time...
>
>what do other vaguely unix-like systems do? does, say, plan9 have a
>better way of dealing with all this?

IRIX allows the partition (logical volume identification) to list
every volumn name as a device. Then you can mount via the name. Disks
can be relocated while the system is shutdown and the mounts still
work after boot.

My suggestion would be to add a filesystem label (optional) to the
homeblock of all filesystmes, then load that identifier into the
/proc/partitions file. This would allow a search to locate the
device parameters for any filesystem being mounted. If the label
is unavailable, then it must be mounted manually or via the current
structure. This would work for floppy/CD/DVD (although SCSI versions
would have a relocation problem for these devices).

--
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2001-03-28 12:09:32

by Tim Jansen

[permalink] [raw]
Subject: Re: Larger dev_t

On Tue, Mar 27, 2001 at 10:48:13AM -0800, Linus Torvalds wrote:
> So in /dev, there are two problems: we are getting painfully close to
> major numbers with 8 bits, and we've run out of minors several times. In
> fact, a lot of the reason for the dearthness of major numbers is the fact
> that we use multiple majors for some stuff that really wants many minors.

Are the major/minor numbers neccessary at all? They were a kludge to
represent devices in the file system because there wasnt something like
DevFS in the dark ages of Unix. But if you take DevFS as given you can get
rid of all these problems that major/minor numbers cause, just register your
file_operation structure directly instead of your maj/min pair (plus, maybe,
some private data that replaces any information that was encoded in the
minor number).

While DevFS may not be very popular this could actually reduce the amount of
code in the kernel, which was one of the main arguments against DevFS AFAIK.
And if you dont like the fact that DevFS assignes the file names you could
mount DevFS in a different directory and use symlinks from /dev to the
device tree (like Solaris does?).


bye...

2001-03-28 17:01:42

by Jeff Randall

[permalink] [raw]
Subject: Re: Larger dev_t

Russell King wrote:
> I for one would like to see a major number for all 'serial ports' whether
> they be embedded ARM serial ports _or_ standard 16550 ports, but at the
> moment its not easily acheivable without introducing more mess.
>
> Ted indicated to me a while ago (just after I wrote serial_core.c for
> yet-another-type-of-ARM-serial-port) his visions of the direction serial
> stuff should take in 2.5; this is obviously one of the things that I'm
> keen to discuss and solve in 2.5.

A change to a 12:20 major:minor dev_t would be a great help for the various
serial drivers that I write and help maintain. We currently as a company
maintain 4 different serial device drivers for linux and all of them
currently use between 4 and 10 majors in order to have enough raw minors
available to identify the maximum port count supported. We had to do the
same think on SunOS (which also has 8:8) in order to support reasonable port
counts there. I'd absoultely love the ability to get back on a single major
per driver.

I'd like to see all of the serial drivers shipped in the kernel tree be
configured by default to use the same major.. but I wouldn't want to have
external drivers forced onto that major as well.

--
Jeff Randall - [email protected] "A paranoid person is never alone,
he knows he's always the center
of attention..."

2001-03-28 18:14:47

by Oliver Neukum

[permalink] [raw]
Subject: Re: Larger dev_t

> My suggestion would be to add a filesystem label (optional) to the
> homeblock of all filesystmes, then load that identifier into the
> /proc/partitions file. This would allow a search to locate the
> device parameters for any filesystem being mounted. If the label
> is unavailable, then it must be mounted manually or via the current
> structure. This would work for floppy/CD/DVD (although SCSI versions
> would have a relocation problem for these devices).

And what would you do if the names collide ?
This might work for drives with unique identifiers in hardware, but for
anything else it is a nice addition, but I wouldn't identify an essential
partition that way. Furthermore you need to address removable media. There a
way to specify a drive opposed to a filesystem or medium is needed.

Regards
Oliver

2001-03-28 19:06:29

by Jesse Pollard

[permalink] [raw]
Subject: Re: Larger dev_t

Oliver Neukum <[email protected]>:
>
> > My suggestion would be to add a filesystem label (optional) to the
> > homeblock of all filesystmes, then load that identifier into the
> > /proc/partitions file. This would allow a search to locate the
> > device parameters for any filesystem being mounted. If the label
> > is unavailable, then it must be mounted manually or via the current
> > structure. This would work for floppy/CD/DVD (although SCSI versions
> > would have a relocation problem for these devices).
>
> And what would you do if the names collide ?

refuse to mount - give the admin time to fix them in single user mode
changing a volumn name only should not be prevented. How to fix... let
the admin look in the /proc/partitions, take one (I'd pick the second
one seen) and change its name. Mount the first using the devfs associated
name and verify that the contents are what is expected. Mount the second
and see what it should be. This situation should only occur via a dd copy
of an entire volumn; the procedure on copying should include changing the
copied volumn name... This is almost equivalent to having multiple mirror
partitions, in which case a "mount the first seen" would be reasonable.

> This might work for drives with unique identifiers in hardware, but for
> anything else it is a nice addition, but I wouldn't identify an essential
> partition that way. Furthermore you need to address removable media. There a
> way to specify a drive opposed to a filesystem or medium is needed.

I didn't mean to say that there should be NO way to reach a specific drive.
There should be a devfs entry that corresponds to the entries in the
/proc/partitions list. This is what I think mount should do anyway.
First search the /proc/partitions list for the volumn; then use the
associated entry in devfs to actually do the mount. It's just a way
to allow the reorganization of volume to device name mapping.

I'm still thinking about how the root filesystem could be mounted during
boot where devfs and /proc are not yet mounted.

There should be a similar way to map removable media devices (even if it
takes using device serial numbers) to fixed device names. That way a
symbolic link could be created to point to the correct physical device:

ie: I want my SCSI tape drive (serial number 06408-XXX) to be called "tape"

locate the serial number in /proc/scsi/scsi. use devfs name that
corresponds to this device (scsi2/target 6/lun/00 or similar) and
create a symbolic link for it. This does assume that the serial number or
equivalent is available to be searched for. It also assumes that the
devfs name can be derived from the entry in /proc/scsi/scsi (or where ever
the specification ends up).

Is this reasonable? Perhaps not for small systems, but when lots of dynamic
devices are available it is needed

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2001-03-28 19:52:12

by Oliver Neukum

[permalink] [raw]
Subject: Re: Larger dev_t

> > And what would you do if the names collide ?
>
> refuse to mount - give the admin time to fix them in single user mode

That means that it could only be used for optional filesystems otherwise
booting unattended is put into question.
A user set for a practical joke could prevent booting by leaving a medium in
the drive. You could add options for not considering removable media, etc,
but you get to a stage where you design workarounds. That'd be bad for core
filesystems. Thus the need for a second improved system remains.

Aside from that adding the name to /proc/partions is a good idea but not
universally usable.

> I'm still thinking about how the root filesystem could be mounted during
> boot where devfs and /proc are not yet mounted.

Enable the kernel command line to understand devfs names.

> locate the serial number in /proc/scsi/scsi. use devfs name that
> corresponds to this device (scsi2/target 6/lun/00 or similar) and
> create a symbolic link for it. This does assume that the serial number or
> equivalent is available to be searched for. It also assumes that the

This is the problem. I wouldn't trust it.

> Is this reasonable? Perhaps not for small systems, but when lots of dynamic
> devices are available it is needed

It is reasonable. GUIs could use a unified way to learn volume names.

Regards
Oliver

2001-03-28 21:08:44

by Martin Dalecki

[permalink] [raw]
Subject: Re: Larger dev_t

"H. Peter Anvin" wrote:
>
> This is my opinion on the issue. Short summary: "I'm sick of the
> administrative burden associated with keeping dev_t dense."
>
> Linus Torvalds wrote:
> >
> > And let's take a look at /dev. Do a "ls -l /dev" and think about it. Every
> > device needs a unique number. Do you ever envision seeing that "ls -l"
> > taking about 500 billion years to complete? I don't. I don't think you do.
> > But that's how ludicrous a 64-bit device number is.
> >
>
> That's how ludicrous a *dense* 64-bit device number is. I have to say I
> disagree with you that sparse number spaces are a bad idea. The
> IPv4->IPv6 transition people have looked at the issues of number spaces
> and how much harder they get to keep dense when the size of the
> numberspace grows, because your lookup operation becomes so much more
> painful. Any time you have to take a larger number space and squeeze it
> into a smaller number space, you get some serious pain.
>
> Part of the reason we haven't -- quite -- run out of 8-bit majors yet is
> because I have been an absolute *bastard* with registrants lately. It
> would cut down on my workload if I could assign majors without worrying
> too much about whether or not that particular driver is really going to
> be made public.
>
> 64 bits is obviously excessive, but I really don't feel comfortable
> saying that only 12 bits of major is sufficient. 16 I would buy, but I
> don't think 16 bits of minor is sufficient. Given that, it seems to me
> -- especially since dev_t isn't exactly the most accessed data type in
> the universe -- that the conceptual simplicity of keeping the major and
> minor separate in individual 32-bit words really is just as well. YES,
> it's overengineering, but the cost is very small; the cost of
> underengineering is having to go through yet another painful transition.
> Unfortunately, the Linux community seems to have some serious problems
> with getting system-wide transitions to happen, especially the ones that
> involve ABI changes. This needs to be taken into account.
>
> -hpa

Then just tell me please why the PCI name space is just 32 bit?

Majros are for drivers Minors are for device driver instances
(yes linux does split minors in a stiupid way by forexample
using the same major for IDE disks and ide CD-ROM, which are in
fact compleatly different devices just sharing driver code...
(Dirrerent block sizes, different interface protokoll and so on....)


Those are the reaons solaris is using a split 24/12 (Major/Minor)
and they don't have our problems here.

>
> --
> <[email protected]> at work, <[email protected]> in private!
> "Unix gives you enough rope to shoot yourself in the foot."
> http://www.zytor.com/~hpa/puzzle.txt
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
- phone: +49 214 8656 283
- job: eVision-Ventures AG, LEV .de (MY OPINIONS ARE MY OWN!)
- langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort:
ru_RU.KOI8-R

2001-03-28 21:26:44

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Larger dev_t

Martin Dalecki wrote:
> >
> > devfs -- in the abstract -- really isn't that bad of an idea; after all,
>
> Devfs is from a desing point of view the duplication for the bad /proc
> design for devices. If you need a good design for general device
> handling with names - network interfaces are the thing too look at.
> mount() should be more like a select()... accept()!
>

And what on earth makes this better? I have always thought the socket
interface to be hideously ugly and full of ad-hockery. Its abstractions
for handle multiple address families by and large don't work, and it
introduces new system calls left, right and center -- sometimes for good
reasons, but please do tell me why I can't open() an AF_UNIX socket, but
have to use a special system call called connect() instead.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-03-28 21:25:54

by Martin Dalecki

[permalink] [raw]
Subject: Re: Larger dev_t

"H. Peter Anvin" wrote:
>
> Alan Cox wrote:
> >
> > > Another example: all the stupid pseudo-SCSI drivers that got their own
> > > major numbers, and wanted their very own names in /dev. They are BAD for
> > > the user. Install-scripts etc used to be able to just test /dev/hd[a-d]
> > > and /dev/sd[0-x] and they'd get all the disks. Deficiencies in the SCSI
> >
> > Sorry here I have to disagree. This is _policy_ and does not belong in the
> > kernel. I can call them all /dev/hdfoo or /dev/disc/blah regardless of
> > major/minor encoding. If you dont mind kernel based policy then devfs
> > with /dev/disc already sorts this out nicely.
> >
> > IMHO more procfs crud is also not the answer. procfs is already poorly
> > managed with arbitary and semi-random namespace. Its a beautiful example of
> > why adhoc naming is as broken as random dev_t allocations. Maybe Al Viro's
> > per device file systems solve that.
> >
>
> In some ways, they make matters worse -- now you have to effectively keep
> a device list in /etc/fstab. Not exactly user friendly.
>
> devfs -- in the abstract -- really isn't that bad of an idea; after all,

Devfs is from a desing point of view the duplication for the bad /proc
design for devices. If you need a good design for general device
handling with names - network interfaces are the thing too look at.
mount() should be more like a select()... accept()!

2001-03-28 21:33:55

by Martin Dalecki

[permalink] [raw]
Subject: Re: Larger dev_t

Alan Cox wrote:
>
> > high-end-disks. Rather the reverse. I'm advocating the SCSI layer not
> > hogging a major number, but letting low-level drivers get at _their_
> > requests directly.
>
> A major for 'disk' generically makes total sense. Classing raid controllers
> as 'scsi' isnt neccessarily accurate. A major for 'serial ports' would also
> solve a lot of misery

And IDE disk ver CD-ROM f and block vers. raw devices
and so so at perpetuum. Those are the reaons why the
density of majros ver. minors is exactly
revers in solaris with respect to the proposal of Linus..

And then we have all those VERY SPARSE static arrays of
major versus minor devices information (if you look at which cells
from those arrays are used on a running system which maybe about
6-8 devices actually attached!)

The main sheer practical problem to changing kdev_t is
the HUGE number of in fact entierly differnt drivers sharing the same
major
and splitting up the minor number space and then hooking
devices with differnt block sizes and such on the same major.
Many things in the block device layer handling could
be simplefied significalty if one could assume for
example that all the devices on one single major
have the same block size and so on...

2001-03-28 21:43:04

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Larger dev_t

Martin Dalecki wrote:
>
> Then please please please demangle other cases as well!
> IDE is the one which is badging my head most. SCSI as well...
>
> Granted I wouldn't mind a rebot with new /dev/* once!
>

This seems to me to really be the kind of thing devfs does better than
trying to play number games. devfs (and I'm talking in the abstract, not
necessarily the existing implementation) can present things in multiple
views, using hard links. This is a Good Thing, because it lets you ask
different questions and get appropriate answers (one question is "what
are my disks", another is "what are my SCSI devices".)

As far as IDE is concerned, I repeat my call for "generic ATAPI" to go
along with "generic SCSI"...

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-03-28 21:38:34

by Martin Dalecki

[permalink] [raw]
Subject: Re: Larger dev_t

Linus Torvalds wrote:
>
> On Tue, 27 Mar 2001, Andre Hedrick wrote:
> >
> > Am I hearing you state you want dynamic device points and dynamic majors?
>
> Yes and no.
>
> We need static structures for user space - from a user perspective it
> makes a ton more sense to say "I want to see all disks" than it does to
> know that you have to do /dev/hd*, /dev/sd* plus all the extra magic
> combinations that can happen (USB etc).
>
> So in a sense what I'm arguing for is for _stricter_ device numbers to the
> outside world.
>
> But internally, it would be reasonably easy to make a mapping from those
> user-visible numbers to a much looser version.
>
> One example of this is going to happen very early in 2.5.x: the whole
> "partitioning" stuff is going to go away from the driver, and into the
> ll_rw_block layer as just another disk re-mapping thing. We already do
> those kinds of re-mappings for LVM reasons anyway, and partitioning is not
> something a disk driver should know about, really.
>
> And that kind of partitioning mapping automatically means that we'd need
> to remap minor numbers, and do it on a per-major basis (because the
> partitioning mapping right now is not actually the same between SCSI and
> IDE: IDE uses six bits of partitioning, while SCSI uses just four bits).
> And once you do that, you might as well start "remapping" major numbers
> too.
>
> So let's say that you have two separate SCSI controllers - they would both
> show up on major #8, and different minor numbers. Right now, for example,
> controller 1 might have one disk, with minors 0-15 (for the whole disk and
> 15 partitions), and controller 2 might have two disks using minors 16-47.
>
> As it stands now, the SCSI layer needs to do the remapping, and because
> the SCSI layer does the remapping, nothing but SCSI layer devices can use
> major #8.
>
> But once you start doing partition mapping in ll_rw_block.c, you might as
> well get rid of the notion that "SCSI is major 8". You could easily have
> many different drivers, with many different queues, and remap them all to
> have major 8 (and different minors) so that it looks simple for a user
> that just wants to see SCSI disks.
>
> Which is not to say that the same disk might not show up somewhere else
> too, if anybody wants it to. The _driver_ should just know "unit x on
> queue y", and then the driver might do whatever it wants (it might be, for
> example, that the driver actually wants to show multiple controllers as
> one queue, if the driver really wants to for some reason). And it should
> be possible to have two drivers that really have no idea at ALL about each
> other to just share the same major numbers.

Then please please please demangle other cases as well!
IDE is the one which is badging my head most. SCSI as well...

Granted I wouldn't mind a rebot with new /dev/* once!

2001-03-28 21:47:45

by Martin Dalecki

[permalink] [raw]
Subject: Re: Larger dev_t

Alan Cox wrote:
>
> > Exactly. It's just that for historical reasons, I think the major for
> > "disk" should be either the old IDE or SCSI one, which just can show more
> > devices. That way old installers etc work without having to suddenly start
> > knowing about /dev/disk0.
>
> They will mostly break. Installers tend to parse /proc/scsi and have fairly
> complex ioctl based relationships based on knowing ide v scsi.
>
> /dev/disc/ is a little un-unix but its clean

Why do you worry about installers? New distro - new kernel - new
installer
that's they job to worry about it. They will change the installer anyway
and this kind of change actually is going to simplyfy the code there, I
think,
a bit.

Just kill the old device major suddenly and place it in the changelog
of the new kernel that the user should mknod and add it to /dev/fstab
before rebooting into the new kernel. Hey that's developement anyway :-)
If the developer boots back into the old kernel just other mounts
in /dev/fstab will fail no problem for transition here in sight...

2001-03-28 21:47:45

by Andre Hedrick

[permalink] [raw]
Subject: Re: Larger dev_t

On Wed, 28 Mar 2001, Martin Dalecki wrote:

> Then please please please demangle other cases as well!
> IDE is the one which is badging my head most. SCSI as well...
>
> Granted I wouldn't mind a rebot with new /dev/* once!

diff -urN linux-2.4.3-p8-pristine/include/linux/major.h linux-2.4.3-p8/include/linux/major.h
--- linux-2.4.3-p8-pristine/include/linux/major.h Sat Dec 30
11:23:14 2000+++ linux-2.4.3-p8/include/linux/major.h Sun Mar 25
22:16:42 2001
@@ -171,4 +171,18 @@
return SCSI_BLK_MAJOR(m);
}

+/*
+ * Tests for IDE devices
+ */
+#define IDE_DISK_MAJOR(M) ((M) == IDE0_MAJOR || (M) == IDE1_MAJOR || \
+ (M) == IDE2_MAJOR || (M) == IDE3_MAJOR || \
+ (M) == IDE4_MAJOR || (M) == IDE5_MAJOR || \
+ (M) == IDE6_MAJOR || (M) == IDE7_MAJOR || \
+ (M) == IDE8_MAJOR || (M) == IDE9_MAJOR)
+
+static __inline__ int ide_blk_major(int m)
+{
+ return IDE_DISK_MAJOR(m);
+}
+
#endif

Well I banged my head and learned a scsi-trick....

Andre Hedrick
Linux ATA Development
ASL Kernel Development
-----------------------------------------------------------------------------
ASL, Inc. Toll free: 1-877-ASL-3535
1757 Houret Court Fax: 1-408-941-2071
Milpitas, CA 95035 Web: http://www.aslab.com

2001-03-28 21:50:55

by Alexander Viro

[permalink] [raw]
Subject: Re: Larger dev_t



On Wed, 28 Mar 2001, H. Peter Anvin wrote:

> Martin Dalecki wrote:
> > >
> > > devfs -- in the abstract -- really isn't that bad of an idea; after all,
> >
> > Devfs is from a desing point of view the duplication for the bad /proc
> > design for devices. If you need a good design for general device
> > handling with names - network interfaces are the thing too look at.
> > mount() should be more like a select()... accept()!
> >
>
> And what on earth makes this better? I have always thought the socket
> interface to be hideously ugly and full of ad-hockery. Its abstractions
> for handle multiple address families by and large don't work, and it
> introduces new system calls left, right and center -- sometimes for good
> reasons, but please do tell me why I can't open() an AF_UNIX socket, but
> have to use a special system call called connect() instead.

Aye. The real problem with mount is that it always had been pretty
heavy-weight. Especially mount(8). I've done some (very rough) testing
on my tree - for ramfs-style filesystem latency of mount(2) is about
20% worse than latency of open(2). And it definitely can be improved -
right now I'm interested in getting the code cleaned.

mount(8) is a problem, but in nosuid namespace we can seriously cut
down on checks in the thing. And I'm very interested in designs that
would allow killing /etc/mtab - dropping it would allow very easy
mounting.

2001-03-28 21:51:45

by Martin Dalecki

[permalink] [raw]
Subject: Re: Larger dev_t

> what do other vaguely unix-like systems do? does, say, plan9 have a
> better way of dealing with all this?

Yes.

Normal UNIX has as well. For reffernece see: block ver raw
devices on docs.sun.com :-).

2001-04-02 20:02:36

by Alan

[permalink] [raw]
Subject: Re: Larger dev_t

> So change them as well for a new distribution. What's there problem.
> There isn't anything out there you can't do by hand.
> Fortunately so!

So users cannot go back and forward between new and old kernels. Very good.
Try explaining that to serious production -users- of a system and see how
it goes down

2001-04-02 20:18:56

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: Larger dev_t

OK - everybody back from San Jose - pity I couldnt come -
and it is no longer April 1st, so we can continue quarreling
a little.

Interesting that where I had divided stuff in the trivial part,
the interesting part and the lot-of-work part we already start
fighting on the trivial part. Maybe it is not very important -
still I'd prefer to do things right from the start.

Yes. We need a larger dev_t as everybody agrees.
How large?

What is dev_t used for? It is a communication channel from
filesystem to user space (via the stat() system call)
and from user space to filesystem (via the mknod() system call).

So, it seems the kernel interface must allow passing the values
that actually occur, in present or future file systems.
Making the interface narrow is only asking for problems later.
Are there already any filesystems that use 64-bits?
I would say that that is irrelevant - what we don't have today
may come tomorrow - but in fact the NFSv3 interface uses
a 64-bit device number.

So glibc comes with 64 bits, the kernel has to hand these bits
over to NFS but is unwilling to - you are not going to get
more than 32. Why?

> I have a holy crusade.

I fail to see the connection. There is no bloat here, the kernel
is hardly involved. Some values are passed. If the values are
larger than the filesystem likes it will probably return EINVAL.
But the kernel has no business getting in the way.

There is no matter of efficiency either - mknod is not precisely
the most frequently used system call, and our stat interface, which
really is important, is 64 bits today.

Not using 64 also gives interesting small problems with Solaris or
FreeBSD NFS mounts. One uses 14+18, the other 8+24, so with 12+20
we cannot handle Solaris' majors and we cannot handle FreeBSD's minors.

[Then there were discussions about naming.
These are interesting, but independent.
The current discussion is almost entirely about mknod.]

Andries

2001-04-02 21:44:50

by Alan

[permalink] [raw]
Subject: Re: Larger dev_t

> Not using 64 also gives interesting small problems with Solaris or
> FreeBSD NFS mounts. One uses 14+18, the other 8+24, so with 12+20
> we cannot handle Solaris' majors and we cannot handle FreeBSD's minors.

Mount NFS device areas with NFSv2. Thats the standard workaround for the
fact the NFSv3 designers got a good idea slightly wrong. There are other
approaches too that also do not need 64bits.

2001-04-02 22:01:10

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: Larger dev_t

> Mount NFS device areas with NFSv2. Thats the standard workaround

Oh, sure. We survived with 16 bits and we'll survive with 32.
Nevertheless it is a bad sign that you have to start talking
about workarounds even before the new system has been implemented.

(And NFSv2 has its quirks as well.
Solaris will split the 32-bit number (the size given in a CREATE
request) into 14+18 when it is not a 16-bit value, while it will
split it into 8+8 if it is. FreeBSD will regard it as a 8+24 dev_t.
So, in general, different systems will parse the same dev_t in
different ways, and hence see different (major,minor) for the
same device.)

Andries

2001-04-03 07:39:24

by Martin Dalecki

[permalink] [raw]
Subject: Re: Larger dev_t

Alan Cox wrote:
>
> > So change them as well for a new distribution. What's there problem.
> > There isn't anything out there you can't do by hand.
> > Fortunately so!
>
> So users cannot go back and forward between new and old kernels. Very good.
> Try explaining that to serious production -users- of a system and see how
> it goes down

If anything I'm a *SERIOUS* production user. And I wouldn't allow
*ANYBODY* here to run am explicitly tagged as developement kernel
here anyway in an production enviornment. That's what releases are for
damn.
Or do you think that Linux should still preserve DOS compatibility
in to the eternity as other "popular" systems do?

2001-04-03 07:41:44

by Martin Dalecki

[permalink] [raw]
Subject: Re: Larger dev_t

[email protected] wrote:
>
> OK - everybody back from San Jose - pity I couldnt come -
> and it is no longer April 1st, so we can continue quarreling
> a little.
>
> Interesting that where I had divided stuff in the trivial part,
> the interesting part and the lot-of-work part we already start
> fighting on the trivial part. Maybe it is not very important -
> still I'd prefer to do things right from the start.
>
> Yes. We need a larger dev_t as everybody agrees.
> How large?
>
> What is dev_t used for? It is a communication channel from
> filesystem to user space (via the stat() system call)
> and from user space to filesystem (via the mknod() system call).
>
> So, it seems the kernel interface must allow passing the values
> that actually occur, in present or future file systems.
> Making the interface narrow is only asking for problems later.
> Are there already any filesystems that use 64-bits?
> I would say that that is irrelevant - what we don't have today
> may come tomorrow - but in fact the NFSv3 interface uses
> a 64-bit device number.
>
> So glibc comes with 64 bits, the kernel has to hand these bits
> over to NFS but is unwilling to - you are not going to get
> more than 32. Why?
>
> > I have a holy crusade.
>
> I fail to see the connection. There is no bloat here, the kernel
> is hardly involved. Some values are passed. If the values are
> larger than the filesystem likes it will probably return EINVAL.
> But the kernel has no business getting in the way.
>
> There is no matter of efficiency either - mknod is not precisely
> the most frequently used system call, and our stat interface, which
> really is important, is 64 bits today.

I think the only reason for Linux to take 12 bit major is the
fact that then he only has to increas the lenght of the static
major device pointers in the kernel and it will be there...
However the problem is mostly that the aforementioned array
of pointers shouldn't me there in first place.

>
> Not using 64 also gives interesting small problems with Solaris or
> FreeBSD NFS mounts. One uses 14+18, the other 8+24, so with 12+20
> we cannot handle Solaris' majors and we cannot handle FreeBSD's minors.
>
> [Then there were discussions about naming.
> These are interesting, but independent.
> The current discussion is almost entirely about mknod.]
>
> Andries
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
- phone: +49 214 8656 283
- job: eVision-Ventures AG, LEV .de (MY OPINIONS ARE MY OWN!)
- langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort:
ru_RU.KOI8-R

2001-04-03 10:10:13

by Ingo Oeser

[permalink] [raw]
Subject: Re: Larger dev_t

On Mon, Apr 02, 2001 at 10:17:02PM +0200, [email protected] wrote:
> What is dev_t used for? It is a communication channel from
> filesystem to user space (via the stat() system call)
> and from user space to filesystem (via the mknod() system call).

The question is WHAT do we communicate (and don't answer "major
minor" here, since this is only numbers) and WHY do we need this
communication.

Devfs aims to associate device names with dynamic, flat device
numbers. So we have a scalable solution for the kernel -> user
space communication. What we DON't have, is a similar simple way
to tell it the other way around.

The reasons, why we need to know where a file is located on are:
- to only include files from one media
- to run certain optimizations like fsck does with disk
spindles
- ...

So instead of just shifting the problems into the future and
making the same mistake again, we should better think of
interfaces, that give us the information we need and let this
error prone (ever had a typo on mknod?) and never large enough
static interface die.

Maybe there should be a way to translate a dynamic associated
device number into a real device name, like the devfs name of it.
May be a reverse mapping in devfs (/dev/by_dev_no/[0-9]+) would
work. If these are symlinks, a readlink() would suffice. Very
simple solution.

For comparing inode1.media == inode2.media (one of the most
important uses for device numbers) we don't need to change
anything.

For getting the device number of the spindle, the block devices
which support partitions or are remapping a (set of) block
device(s) could get IOCTLs (where this information belongs into
and is as reliable as the driver).

For all these things, we can have a flat and dynamic device
number namespace.

Device numbers have to be uniqe only during one power on -> run ->
power off cycle. For the rest applications should store device
names instead anyway. The applications, that don't are buggy by
defintion.

Note: I certainly overlooked sth., so please flame me ;-)

> The current discussion is almost entirely about mknod.]

Yes: Let "mknod /dev/foo [bc] x y" die!

Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>

2001-04-03 12:07:20

by Alan

[permalink] [raw]
Subject: Re: Larger dev_t

> Device numbers have to be uniqe only during one power on -> run ->
> power off cycle. For the rest applications should store device
> names instead anyway. The applications, that don't are buggy by
> defintion.

Device numbers/names have to be constant in order to detect disk layout changes
across boots.

Alan

2001-04-03 12:19:30

by Alan

[permalink] [raw]
Subject: Re: Larger dev_t

> If anything I'm a *SERIOUS* production user. And I wouldn't allow
> *ANYBODY* here to run am explicitly tagged as developement kernel
> here anyway in an production enviornment. That's what releases are for
> damn.
> Or do you think that Linux should still preserve DOS compatibility
> in to the eternity as other "popular" systems do?

You still break 2.4-2.6. Thats a production release jump. Right now I can
and do run 2.0->2.4 on the same box. If you dont understand why to many
people that is a requirement please talk to folks who run real business on
Linux

2001-04-03 12:21:30

by Ingo Oeser

[permalink] [raw]
Subject: Re: Larger dev_t

On Tue, Apr 03, 2001 at 01:06:33PM +0100, Alan Cox wrote:
> Device numbers/names have to be constant in order to detect
> disk layout changes across boots.

Names stay constant, but why the NUMBERS? The names should stay
constant and represent the actual layout on each busses (say:
sane hierachic enumeration) of course.

But /dev/ide/host0/bus0/target0/lun0/part1 could get a new device
number on every reboot, right?

I'm sure, I'm missing some important usage of device of device
numers here (not counting the ones listed already), but I don't
know what ;-)

Otherwise it would be too easy to remove static major/minors and
all the fun allocating them. And LANANA would have one thing less
to worry about ;-)

One thing I certainly miss: DevFS is not mandatory (yet).

Thanks & Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>

2001-04-03 12:28:20

by Martin Dalecki

[permalink] [raw]
Subject: Re: Larger dev_t

Alan Cox wrote:
>
> > If anything I'm a *SERIOUS* production user. And I wouldn't allow
> > *ANYBODY* here to run am explicitly tagged as developement kernel
> > here anyway in an production enviornment. That's what releases are for
> > damn.
> > Or do you think that Linux should still preserve DOS compatibility
> > in to the eternity as other "popular" systems do?
>
> You still break 2.4-2.6. Thats a production release jump. Right now I can
> and do run 2.0->2.4 on the same box. If you dont understand why to many
> people that is a requirement please talk to folks who run real business on
> Linux

You have possible no imagination about how real the business is I do
:-).
What's worth it to be able running 2.0 and 2.4 on the same box?
I just intendid to tell you that there are actually people in the
REAL BUSINESS out there who know about and are willing to sacifier
compatibility until perpetuum for contignouus developement.

BTW we don't run much of Cyrix486 hardware anymore here.. More like
boxes with few gigs of ram 4 CPU's RAID and so on...
The single biggest memmory hog here is currently the Oracle 9i AS.

2001-04-03 12:32:10

by Martin Dalecki

[permalink] [raw]
Subject: Re: Larger dev_t

> One thing I certainly miss: DevFS is not mandatory (yet).

That's "only" due to the fact that DevFS is an insanely racy and
instable
piece of CRAP. I'm unhappy it's there anyway...

2001-04-03 12:38:30

by Alan

[permalink] [raw]
Subject: Re: Larger dev_t

> What's worth it to be able running 2.0 and 2.4 on the same box?
> I just intendid to tell you that there are actually people in the
> REAL BUSINESS out there who know about and are willing to sacifier
> compatibility until perpetuum for contignouus developement.

And many people who require the ability to drop back one or two versions (major
versions) on a problem. Every upgrade requires a getout path

2001-04-03 12:42:20

by Alan

[permalink] [raw]
Subject: Re: Larger dev_t

> Names stay constant, but why the NUMBERS? The names should stay
> constant and represent the actual layout on each busses (say:
> sane hierachic enumeration) of course.

You can do it that way too

> But /dev/ide/host0/bus0/target0/lun0/part1 could get a new device
> number on every reboot, right?

It could be a different device each boot too. Who is doing the bus
enumeration in a constant manner.

> Otherwise it would be too easy to remove static major/minors and
> all the fun allocating them. And LANANA would have one thing less
> to worry about ;-)

There are a very large number of reasons you need them and things that depend
on constant numbering for block devices such as backup tools. They can Im sure
be taught constant naming, but there is no provision for names not device ids
in them.

Things like tar will no longer work on Linux for example because tar does not
know how to archive the name of a device node.

> One thing I certainly miss: DevFS is not mandatory (yet).

devfs solves a different problem - enumeration of dynamically configured
resources. Its unrelated to the fundamental problem.

2001-04-03 12:54:02

by Alan

[permalink] [raw]
Subject: Re: Larger dev_t

> > One thing I certainly miss: DevFS is not mandatory (yet).
>
> That's "only" due to the fact that DevFS is an insanely racy and
> instable
> piece of CRAP. I'm unhappy it's there anyway...

It certainly seems to have some race conditions but other than that and the
slight problem it puts policy in the kernel it seems ok. I'd prefer it was
userspace and implemented via /sbin/hotplug - but that isnt possible yet and
opens a whole other set of interesting races to ponder

2001-04-03 14:51:21

by Wayne.Brown

[permalink] [raw]
Subject: Re: Larger dev_t



Ingo Oeser <[email protected]> wrote:

>Yes: Let "mknod /dev/foo [bc] x y" die!

I hope this never happens. Improving the major/minor device scheme is
reasonable; abandoning it would be a sad occurrence. It would make Linux too
"un-UNIXish" (how's THAT for an an ugly neologism!) for my tastes.

Wayne


2001-04-03 15:37:46

by Bart Trojanowski

[permalink] [raw]
Subject: Re: Larger dev_t

On Tue, 3 Apr 2001, [email protected] wrote:

> Ingo Oeser <[email protected]> wrote:
>
> >Yes: Let "mknod /dev/foo [bc] x y" die!
>
> I hope this never happens. Improving the major/minor device scheme is
> reasonable; abandoning it would be a sad occurrence. It would make Linux too
> "un-UNIXish" (how's THAT for an an ugly neologism!) for my tastes.

I don't know... the command 'mknod' should probably remain for
compatibility reasons. But the way that it does create the node can be
completely different. For example the call could just be a wrapper to a
syscall or a write to a proc file.

I think Ingo had qualms with the process of creating of a device file
which is totally detached of the kernel's ability to service that device.

But I am with you. The compatibility between *NIX should not be severed
so fast.

B.

--
WebSig: http://www.jukie.net/~bart/sig/



2001-04-03 16:07:22

by Richard Gooch

[permalink] [raw]
Subject: Re: Larger dev_t

Alan Cox writes:
> > > One thing I certainly miss: DevFS is not mandatory (yet).
> >
> > That's "only" due to the fact that DevFS is an insanely racy and
> > instable
> > piece of CRAP. I'm unhappy it's there anyway...
>
> It certainly seems to have some race conditions but other than that
> and the slight problem it puts policy in the kernel it seems ok. I'd
> prefer it was userspace and implemented via /sbin/hotplug - but that
> isnt possible yet and opens a whole other set of interesting races
> to ponder

Yes, devfs has some races. They are in the process of getting
fixed. Yes, it's taken a long time (moving house twice in 6 months and
several travel trips take their toll on productivity).

However, a large number of people run devfs on small to large systems,
and these "races" aren't causing problems. People tell me it's quite
stable. I run devfs on my systems, and not once have I had a problem
due to devfs "races". So I feel it's quite unfair to paint such a dire
picture (I'm referring to Martin's comments here, not Alan's).

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-04-03 16:35:26

by Alexander Viro

[permalink] [raw]
Subject: Re: Larger dev_t



On Tue, 3 Apr 2001, Richard Gooch wrote:

> However, a large number of people run devfs on small to large systems,
> and these "races" aren't causing problems. People tell me it's quite
> stable. I run devfs on my systems, and not once have I had a problem
> due to devfs "races". So I feel it's quite unfair to paint such a dire
> picture (I'm referring to Martin's comments here, not Alan's).

And _that_ approach is the reason why I absolutely refuse to run your code
on any of my boxen. Sorry. If devfs (without serious cleanup) will become
mandatory I'll fork the tree - better backporting patches to Linus' one than
depending on current devfs. You've been sitting on known (and easily fixable)
bugs and asking to leave fixing them to you for what, 10 months already?
Furrfu... You are maintainer of that code. You keep insisting on having
everything and a kitchen sink in the devfs and refuse to split the
functionality into reasonable pieces. Essentially you are saying that it's
all or nothing deal. Fine with me - out of these options I certainly
prefer the latter.
Al

2001-04-03 16:55:00

by Alan

[permalink] [raw]
Subject: Re: Larger dev_t

> However, a large number of people run devfs on small to large systems,
> and these "races" aren't causing problems. People tell me it's quite

They dont have users actively trying to exploit them. I don't consider it a
big problem for development trees though. devfs has a maintainer at least

Alan

2001-04-03 17:00:10

by Richard Gooch

[permalink] [raw]
Subject: Re: Larger dev_t

Alexander Viro writes:
>
>
> On Tue, 3 Apr 2001, Richard Gooch wrote:
>
> > However, a large number of people run devfs on small to large systems,
> > and these "races" aren't causing problems. People tell me it's quite
> > stable. I run devfs on my systems, and not once have I had a problem
> > due to devfs "races". So I feel it's quite unfair to paint such a dire
> > picture (I'm referring to Martin's comments here, not Alan's).
>
> And _that_ approach is the reason why I absolutely refuse to run
> your code on any of my boxen. Sorry. If devfs (without serious
> cleanup) will become mandatory I'll fork the tree - better
> backporting patches to Linus' one than depending on current devfs.

Al, I've told you that the races will be fixed. Calm down. I know you
take a very theoretical and hard-line approach. All I said was that
the races aren't causing problems for people in real life. That's why
some vendors are using it. I never disagreed with you about the
existence of the races.
Peace, OK?

> You've been sitting on known (and easily fixable) bugs and asking to
> leave fixing them to you for what, 10 months already? Furrfu...

Yeah, 10 months during which I've gone to 7 conferences/workshops,
written 2 papers, moved house twice, took two holidays (sorry, I have
a life), moved/split/unsplit our lab network twice, caught the flu at
least once, and sundry other distractions. Pardon me for being busy.

> You are maintainer of that code. You keep insisting on having
> everything and a kitchen sink in the devfs and refuse to split the
> functionality into reasonable pieces. Essentially you are saying
> that it's all or nothing deal. Fine with me - out of these options
> I certainly prefer the latter.

The claim that splitting it into pieces will be an improvement is just
hand-waving. I've not seen a solid argument that shows how it will
help. Especially not after I remove the FS database code in devfs and
just use the dcache to store my tree. That will trim the code by 50%
or more. I'm going to wait and see how my next versions of devfs turn
out before I make any hard claims.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-04-03 17:05:00

by Richard Gooch

[permalink] [raw]
Subject: Re: Larger dev_t

Alan Cox writes:
> > However, a large number of people run devfs on small to large systems,
> > and these "races" aren't causing problems. People tell me it's quite
>
> They dont have users actively trying to exploit them. I don't
> consider it a big problem for development trees though. devfs has a
> maintainer at least

Agreed. If I were a sysadmin where I had users I didn't trust, then
I'd be worried. Actually, I'd simply not enable module autoloading.
In fact, I don't run autoloading because I don't like it personally.
And I'm lucky that I have users on my network that I feel I can
trust. Besides, I know where they live, or at least where they store
their data/theses :-)

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-04-03 23:30:08

by Tim Wright

[permalink] [raw]
Subject: Re: Larger dev_t

On Tue, Apr 03, 2001 at 02:20:24PM +0200, Ingo Oeser wrote:
> On Tue, Apr 03, 2001 at 01:06:33PM +0100, Alan Cox wrote:
> > Device numbers/names have to be constant in order to detect
> > disk layout changes across boots.
>
> Names stay constant, but why the NUMBERS? The names should stay
> constant and represent the actual layout on each busses (say:
> sane hierachic enumeration) of course.
>

This ignores the issue that in some cases you cannot give a physical location.
Take the case of fibre-channel connected disks, potentially using multi-path
I/O. There is no "actual layout" since you don't have a fixed physical path.
At that point you have to have a more sophisticated naming scheme than the
physical location of the disk, since physical location loses its meaning.

You absolutely must avoid device name slippage. Whether this involves major
and minor numbers is pretty much orthogonal. Major and minor numbers provided
a nice and simple way for the kernel to map a device open into a driver and an
argument to said driver. There are obviously other (more complex ways) of
achieving the same thing. An obvious answer for hard disks is some form of
labelling. Equally obviously, this does not solve the problem of e.g.
fibre-channel connected tape drives.

Regards,

Tim

--
Tim Wright - [email protected] or [email protected] or [email protected]
IBM Linux Technology Center, Beaverton, Oregon
Interested in Linux scalability ? Look at http://lse.sourceforge.net/
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI

2001-04-04 08:10:26

by Rogier Wolff

[permalink] [raw]
Subject: Re: Larger dev_t

Alan Cox wrote:
> > What's worth it to be able running 2.0 and 2.4 on the same box?
> > I just intendid to tell you that there are actually people in the
> > REAL BUSINESS out there who know about and are willing to sacifier
> > compatibility until perpetuum for contignouus developement.

> And many people who require the ability to drop back one or two
> versions (major versions) on a problem. Every upgrade requires a
> getout path

Right. So if we go to 64 bits NOW (in 2.4), then when after 3.2 we
actually start needing > 16 bits of dev_t everyone can downgrade to
2.0, except those people who use drivers that require those extra
bits.

The further away from "the deadline" that we switch, the easier it
becomes to provide a smooth upgrade path. When we have 65536 devices
in use, when we finally switch, you can bet your ass we'll be using
the "new" device number space right away. However, if we're still
comfortable with the 16 bits, we can upgrade the infrastructure ASAP,
and make the "no return" switch later. Much later.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots.
* There are also old, bald pilots.