2003-03-27 00:58:12

by Andries E. Brouwer

[permalink] [raw]
Subject: 64-bit kdev_t - just for playing

>> Maybe I should send another patch tonight, just for playing.

> Please, I'd like that.

Below a random version of kdev_t.h.
(The file is smaller than the patch.)

kdev_t is the kernel-internal representation
dev_t is the kernel idea of the user space representation
(of course glibc uses 64 bits, split up as 8+8 :-)

kdev_t can be equal to dev_t.

The file below completely randomly makes kdev_t
64 bits, split up 32+32, and dev_t 32 bits, split up 12+20.

Andries

------------------------------------------------------------
#ifndef _LINUX_KDEV_T_H
#define _LINUX_KDEV_T_H
#ifdef __KERNEL__

typedef struct {
unsigned long long value;
} kdev_t;

#define KDEV_MINOR_BITS 32
#define KDEV_MAJOR_BITS 32
#define KDEV_MINOR_MASK ((1ULL << KDEV_MINOR_BITS) - 1)

#define __mkdev(major, minor) (((unsigned long long)(major) << KDEV_MINOR_BITS) + (minor))

#define mk_kdev(major, minor) ((kdev_t) { __mkdev(major,minor) } )

/*
* The "values" are just _cookies_, usable for
* internal equality comparisons and for things
* like NFS filehandle conversion.
*/
static inline unsigned long long kdev_val(kdev_t dev)
{
return dev.value;
}

static inline kdev_t val_to_kdev(unsigned long long val)
{
kdev_t dev;
dev.value = val;
return dev;
}

#define HASHDEV(dev) (kdev_val(dev))
#define NODEV (mk_kdev(0,0))

extern const char * kdevname(kdev_t); /* note: returns pointer to static data! */

static inline int kdev_same(kdev_t dev1, kdev_t dev2)
{
return dev1.value == dev2.value;
}

#define kdev_none(d1) (!kdev_val(d1))

#define minor(dev) (unsigned int)((dev).value & KDEV_MINOR_MASK)
#define major(dev) (unsigned int)((dev).value >> KDEV_MINOR_BITS)

/* These are for user-level "dev_t" */
#define MINORBITS 8
#define MINORMASK ((1U << MINORBITS) - 1)
#define DEV_MINOR_BITS 20
#define DEV_MAJOR_BITS 12
#define DEV_MINOR_MASK ((1U << DEV_MINOR_BITS) - 1)
#define DEV_MAJOR_MASK ((1U << DEV_MAJOR_BITS) - 1)

#include <linux/types.h> /* dev_t */

#define MAJOR(dev) ((unsigned int)(((dev) & 0xffff0000) ? ((dev) >> DEV_MINOR_BITS) & DEV_MAJOR_MASK : ((dev) >> 8) & 0xff))
#define MINOR(dev) ((unsigned int)(((dev) & 0xffff0000) ? ((dev) & DEV_MINOR_MASK) : ((dev) & 0xff)))
#define MKDEV(ma,mi) ((dev_t)((((ma) & ~0xff) == 0 && ((mi) & ~0xff) == 0) ? (((ma) << 8) | (mi)) : (((ma) << DEV_MINOR_BITS) | (mi))))

/*
* Conversion functions
*/

static inline int kdev_t_to_nr(kdev_t dev)
{
unsigned int ma = major(dev);
unsigned int mi = minor(dev);
return MKDEV(ma, mi);
}

static inline kdev_t to_kdev_t(dev_t dev)
{
unsigned int ma = MAJOR(dev);
unsigned int mi = MINOR(dev);
return mk_kdev(ma, mi);
}

#else /* __KERNEL__ */

/*
Some programs want their definitions of MAJOR and MINOR and MKDEV
from the kernel sources. These must be the externally visible ones.
Of course such programs should be updated.
*/
#define MAJOR(dev) ((dev)>>8)
#define MINOR(dev) ((dev) & 0xff)
#define MKDEV(ma,mi) ((ma)<<8 | (mi))
#endif /* __KERNEL__ */
#endif


2003-03-27 19:12:32

by Roman Zippel

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Hi,

On Thu, 27 Mar 2003 [email protected] wrote:

> kdev_t is the kernel-internal representation
> dev_t is the kernel idea of the user space representation
> (of course glibc uses 64 bits, split up as 8+8 :-)
>
> kdev_t can be equal to dev_t.
>
> The file below completely randomly makes kdev_t
> 64 bits, split up 32+32, and dev_t 32 bits, split up 12+20.

It would really help, if you would explain how a larger dev_t will work
during 2.6.
Let's take an example, the first user is likely the scsi layer, so how
can I manage thousands of disks under 2.6?
How are these disks registered and how will the dev_t number look like?
How will the user know about these numbers?
Who creates these device entries (user or daemon)?
How have user space utilities to be changed, which know about dev_t (e.g.
ls or fdisk)?
How is backward compatibility done, so that I can still boot a 2.4 kernel?

bye, Roman

2003-03-27 20:16:24

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

> It would really help, if you would explain how a larger dev_t
> will work during 2.6.

> How is backward compatibility done, so that I can still boot a 2.4 kernel?

Old device numbers remain valid, so all changes are completely
transparent.

> How have user space utilities to be changed, which know about
> dev_t (e.g. ls or fdisk)?

If you do not use mknod to create device nodes with large device numbers,
then no new utilities are needed. If you really want to use large
device numbers, you need a new glibc; some utilities will require
recompilation because of the use of sysmacros.h.

Andries

2003-03-27 22:01:08

by Roman Zippel

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Hi,

On Thu, 27 Mar 2003 [email protected] wrote:

You must have overlooked some of my questions:

How are these disks registered and how will the dev_t number look like?
How will the user know about these numbers?
Who creates these device entries (user or daemon)?

> > How is backward compatibility done, so that I can still boot a 2.4 kernel?
>
> Old device numbers remain valid, so all changes are completely
> transparent.

SCSI has multiple majors, disks 0-15 are at major 8, disks 16-31 are at
65, ...., disks 112-127 are at major 71. Will this stay the same? Where
are the disk 128-xxx?
Can I have now more than 15 partitions?

bye, Roman

2003-03-27 22:26:14

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

> Can I have now more than 15 partitions?

Two years ago I amused myself creating a few hundred partitions
on a SCSI disk. Yes, no doubt the availability of numbers will
remove the current limits on the number of partitions of a disk.

But, as I answered you several times already,
right now my topic is dev_t, not devices or partitions.
Just the number.

Andries

2003-03-27 22:43:05

by Alan

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

On Thu, 2003-03-27 at 22:12, Roman Zippel wrote:
> How are these disks registered and how will the dev_t number look like?

Al Viro's work so far makes those issues you can defer nicely.

> How will the user know about these numbers?

Devices.txt or dynamic assignment

> Who creates these device entries (user or daemon)?

Who cares 8) Thats just the devfs argument all over again 8)

> SCSI has multiple majors, disks 0-15 are at major 8, disks 16-31 are at
> 65, ...., disks 112-127 are at major 71. Will this stay the same? Where
> are the disk 128-xxx?
> Can I have now more than 15 partitions?

It becomes possible, more importantly we can begin to support
partitioned CD-ROM both for multisession and for real partition
tables on CD (eg Macintrash)

2003-03-27 22:44:45

by Roman Zippel

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Hi,

On Thu, 27 Mar 2003 [email protected] wrote:

> But, as I answered you several times already,
> right now my topic is dev_t, not devices or partitions.
> Just the number.

Well, what can I do with that number? Your patches must provide some sort
of benefit when we have that number. I'm currently trying to find out,
what happens after we have this number, so I would be really grateful, if
you would answer my questions:

How are these disks registered and how will the dev_t number look like?
How will the user know about these numbers?
Who creates these device entries (user or daemon)?
SCSI has multiple majors, disks 0-15 are at major 8, disks 16-31 are at
65, ...., disks 112-127 are at major 71. Will this stay the same? Where
are the disk 128-xxx?
Can I have now more than 15 partitions?

bye, Roman

2003-03-27 23:08:14

by Roman Zippel

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Hi,

On 27 Mar 2003, Alan Cox wrote:

> > How are these disks registered and how will the dev_t number look like?
>
> Al Viro's work so far makes those issues you can defer nicely.

I know his work, I'm just trying to find out, whether Andries understands
it too, or if he maybe knows something I don't.

> > How will the user know about these numbers?
>
> Devices.txt or dynamic assignment

The first case means a /dev directory with millions of dev entries.
How does the user find out about the number of partitions in the second
case?

> > Who creates these device entries (user or daemon)?
>
> Who cares 8) Thats just the devfs argument all over again 8)

Why? I specifically didn't mention the kernel.
Anyone has to care, somehow this large number space must be managed.

> > SCSI has multiple majors, disks 0-15 are at major 8, disks 16-31 are at
> > 65, ...., disks 112-127 are at major 71. Will this stay the same? Where
> > are the disk 128-xxx?
> > Can I have now more than 15 partitions?
>
> It becomes possible, more importantly we can begin to support
> partitioned CD-ROM both for multisession and for real partition
> tables on CD (eg Macintrash)

How exactly does this become possible?

bye, Roman

2003-03-27 23:38:11

by Greg KH

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

On Fri, Mar 28, 2003 at 12:19:25AM +0100, Roman Zippel wrote:
> > > How will the user know about these numbers?
> >
> > Devices.txt or dynamic assignment
>
> The first case means a /dev directory with millions of dev entries.
> How does the user find out about the number of partitions in the second
> case?

They point and guess, just like they do today :)

> > > Who creates these device entries (user or daemon)?
> >
> > Who cares 8) Thats just the devfs argument all over again 8)
>
> Why? I specifically didn't mention the kernel.
> Anyone has to care, somehow this large number space must be managed.

Yes, some of us are working on this. But this has nothing to do with
the kernel, or Andries's patches. It's a userspace issue.

thanks,

greg k-h

2003-03-28 09:36:05

by Roman Zippel

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Hi,

On Thu, 27 Mar 2003, Greg KH wrote:

> > > Devices.txt or dynamic assignment
> >
> > The first case means a /dev directory with millions of dev entries.
> > How does the user find out about the number of partitions in the second
> > case?
>
> They point and guess, just like they do today :)

I think the users which need this most won't be particular happy.

> > > > Who creates these device entries (user or daemon)?
> > >
> > > Who cares 8) Thats just the devfs argument all over again 8)
> >
> > Why? I specifically didn't mention the kernel.
> > Anyone has to care, somehow this large number space must be managed.
>
> Yes, some of us are working on this. But this has nothing to do with
> the kernel, or Andries's patches. It's a userspace issue.

Somehow the kernel and the userspace have to work together. What I want to
know is whether we just create another crutch, barely usable for the
desperate or if we create a solution which has a small chance to still
work in the future.

bye, Roman

2003-03-28 10:58:47

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Roman, Your questions are misguided.
A larger dev_t is infrastructure.
A sand road that is turned into an asphalt road.

Nobody has to use this improved infrastructure.
But many uses are conceivable.

Long ago I reserved 2^40 values for dynamically
assigned anonymous devices. Convenient, a very
small fraction of the available space.

I can imagine that there will be people wanting
to take part of the available space for a universal
hash of disk serial number or partition label or
I don't know what, so that devices are addressable
by content instead of path.

A lot of room can be put to many uses.

Andries

2003-03-28 11:25:10

by Roman Zippel

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Hi,

On Fri, 28 Mar 2003 [email protected] wrote:

> Roman, Your questions are misguided.

Thanks for your trust. :-(

> A larger dev_t is infrastructure.
> A sand road that is turned into an asphalt road.
>
> Nobody has to use this improved infrastructure.
> But many uses are conceivable.

The size of dev_t doesn't matter at all, what matters is how this number
is managed and used. The kernel has somehow to generate a number for a
device and tell the user about it, so that he can use it to access the
device. This requires infrastructure and the actual size of this number is
only a small detail in the whole picture. I want to know how the whole
picture looks like, so could you please stop talking bullshit and answer
my questions?

> I can imagine that there will be people wanting
> to take part of the available space for a universal
> hash of disk serial number or partition label or
> I don't know what, so that devices are addressable
> by content instead of path.

This won't happen, dev_t is the wrong place to encode such information.

bye, Roman

2003-03-28 11:35:20

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

> the actual size of this number is only a small detail

Yes. It is that detail I am concerned with.

2003-03-28 11:46:04

by Roman Zippel

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Hi,

On Fri, 28 Mar 2003 [email protected] wrote:

> > the actual size of this number is only a small detail
>
> Yes. It is that detail I am concerned with.

If you don't want to or can't answer my question, it means I can revert
your character device changes?

bye, Roman

2003-03-28 15:22:11

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

On Fri, 28 Mar 2003 [email protected] wrote:

> > the actual size of this number is only a small detail
>
> Yes. It is that detail I am concerned with.

If you don't want to or can't answer my question, it means I can revert
your character device changes?

I am not Linus. You can send him whatever you think best
and see whether he applies it.

I would prefer if you waited a bit. This little detail,
changing the size of dev_t, requires an audit of the
kernel source. That takes some time.
I would much prefer postponing discussion about device
handling until after number handling is in good shape.

Generally it is a bad idea when two people simultaneously
change the same code.

Andries


2003-03-28 15:38:47

by Roman Zippel

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Hi,

On Fri, 28 Mar 2003 [email protected] wrote:

> I would prefer if you waited a bit. This little detail,
> changing the size of dev_t, requires an audit of the
> kernel source. That takes some time.
> I would much prefer postponing discussion about device
> handling until after number handling is in good shape.

You already changed the device handling and you don't want to discuss it?
My patch does the same without the questionable device handling changes.

> Generally it is a bad idea when two people simultaneously
> change the same code.

Ignoring other people's comments and questions is generally a bad idea as
well.

bye, Roman

2003-03-28 17:57:21

by Joel Becker

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

On Fri, Mar 28, 2003 at 10:47:14AM +0100, Roman Zippel wrote:
> On Thu, 27 Mar 2003, Greg KH wrote:
> > They point and guess, just like they do today :)
>
> I think the users which need this most won't be particular happy.

I represent the users which need this most, and I an tell you we
will be 100x happier pointing and guessing at enough dev_t space. If we
were to have to stick with the ancient, serously outdated limits of the
current space, we will be terribly unhappy.
Not having the perfect solution all at once doesn't mean you do
nothing. The size of dev_t is orthogonal to device naming. Once this
is in, the current device naming (however poor it is) can handle the
number of devices we need. Future device naming strategies (like the
one Greg is working on) will work with a large dev_t just fine.

Joel

--

"Vote early and vote often."
- Al Capone

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: [email protected]
Phone: (650) 506-8127

2003-03-28 18:37:04

by Roman Zippel

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Hi,

On Fri, 28 Mar 2003, Joel Becker wrote:

> I represent the users which need this most, and I an tell you we
> will be 100x happier pointing and guessing at enough dev_t space. If we
> were to have to stick with the ancient, serously outdated limits of the
> current space, we will be terribly unhappy.
> Not having the perfect solution all at once doesn't mean you do
> nothing. The size of dev_t is orthogonal to device naming. Once this
> is in, the current device naming (however poor it is) can handle the
> number of devices we need. Future device naming strategies (like the
> one Greg is working on) will work with a large dev_t just fine.

I don't mind temporary solutions, but I prefer the ones, which don't have
to be thrown away completely (e.g. like Andries char device changes).
If Andries would actually explain, what he wants to do with the larger
dev_t, it would be a lot easier to help him, so that we can at least avoid
the biggest mistakes.

bye, Roman

2003-03-30 19:53:55

by H. Peter Anvin

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Followup to: <Pine.LNX.4.44.0303281219350.5042-100000@serv>
By author: Roman Zippel <[email protected]>
In newsgroup: linux.dev.kernel
>
> The size of dev_t doesn't matter at all,
>

Bullshit.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64

2003-03-30 20:01:49

by Roman Zippel

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Hi,

On 30 Mar 2003, H. Peter Anvin wrote:

> > The size of dev_t doesn't matter at all,
>
> Bullshit.

Huh?
(Do I love all these detailed explanations... *sigh*)

bye, Roman

2003-03-30 19:59:41

by H. Peter Anvin

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Followup to: <[email protected]>
By author: [email protected]
In newsgroup: linux.dev.kernel
>
> >> Maybe I should send another patch tonight, just for playing.
>
> > Please, I'd like that.
>
> Below a random version of kdev_t.h.
> (The file is smaller than the patch.)
>
> kdev_t is the kernel-internal representation
> dev_t is the kernel idea of the user space representation
> (of course glibc uses 64 bits, split up as 8+8 :-)
>
> kdev_t can be equal to dev_t.
>
> The file below completely randomly makes kdev_t
> 64 bits, split up 32+32, and dev_t 32 bits, split up 12+20.
>

I have a few brief questions:

a) Along all of these you have assumed that it's more efficient to
have a separate type (kdev_t) for kernel-internal "decoded" device number
handling, as opposed to "encoded" device number handling. At some
point, however, it ends up being a struct char_dev * or struct
block_dev *. How big is this gap and does it really make sense
introducing a special type for it?

b) If we do have such a type, would it make more sense to have cdev_t
and bdev_t, and have per-type distinction of block- versus charness?

c) If we do have such a type, any reason to have it be a "unsigned
long long" (really should be u64), instead of "u32 major; u32 minor;"?

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64

2003-03-31 08:20:36

by bert hubert

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

On Fri, Mar 28, 2003 at 07:48:13PM +0100, Roman Zippel wrote:

> If Andries would actually explain, what he wants to do with the larger
> dev_t, it would be a lot easier to help him, so that we can at least avoid
> the biggest mistakes.

Can you envision solutions based on 16 bit kdev_t infrastructure?

Regards,

bert

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2003-03-31 08:41:41

by Roman Zippel

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Hi,

On Mon, 31 Mar 2003, bert hubert wrote:

> > If Andries would actually explain, what he wants to do with the larger
> > dev_t, it would be a lot easier to help him, so that we can at least avoid
> > the biggest mistakes.
>
> Can you envision solutions based on 16 bit kdev_t infrastructure?

I know that 16bit is getting small (but with dynamic assignment even
that is still enough for most people), but OTOH I don't understand the
obsession for 64bit. 32bit is more than enough on a 32bit system.
Somehow it sounds that we just have to introduce a huge dev_t and all our
problems are magically solved and I doubt that. If people want to encode
random information into dev_t, then even 64bit will be soon not enough
anymore, so I want to know how people actually want to use that large
dev_t number. Is that really too much to ask?
(Slowly I'm getting the feeling that there is some sort of 64bit dev_t
conspiracy going on here, with the amount of answers I'm (not) getting
here.)

bye, Roman

2003-03-31 17:14:50

by Joel Becker

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

On Mon, Mar 31, 2003 at 10:52:53AM +0200, Roman Zippel wrote:
> > Can you envision solutions based on 16 bit kdev_t infrastructure?
>
> I know that 16bit is getting small (but with dynamic assignment even
> that is still enough for most people), but OTOH I don't understand the
> obsession for 64bit. 32bit is more than enough on a 32bit system.

There are big companies out there that require thousands of
disks. Many don't want to use hardware raid, they just want JBOD.
There are installations today with 2k-4k disks covering the gamut from
massive databases to HPC to research facilities. Today.
A 32bit dev_t with the 12/20 Linus split provides 64k minors.
That's 16k disks with our current 15-partitions-per limit. But if
someone wants 35 partitions (I've seen that somewhere) you're down to
8k. And the places using 2-4k disks today will be getting to 8k disks
soon after 2.6 becomes usable, if not before. They will likely hit 16k
disks before 2.6 becomes an afterthought.

> Somehow it sounds that we just have to introduce a huge dev_t and all our
> problems are magically solved and I doubt that. If people want to encode

64bit dev_t is not a panacea. However, if we have to do this
change, we want to do it once. This only solves the space issue. All
of the other issues, such as naming, are orthogonal to this change.
Holding one change until everything else has been written is silly.

> (Slowly I'm getting the feeling that there is some sort of 64bit dev_t
> conspiracy going on here, with the amount of answers I'm (not) getting
> here.)

There is no conspiracy. Everyone agrees we need more dev_t
space. Peter, Andries, and others see it like I do; we only want to do
this once, and we already can see a day when 20bits of minors isn't
enough.

Joel

--

Life's Little Instruction Book #207

"Swing for the fence."

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: [email protected]
Phone: (650) 506-8127

2003-03-31 21:21:41

by Roman Zippel

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Hi,

On Mon, 31 Mar 2003, Joel Becker wrote:

> > > Can you envision solutions based on 16 bit kdev_t infrastructure?
> >
> > I know that 16bit is getting small (but with dynamic assignment even
> > that is still enough for most people), but OTOH I don't understand the
> > obsession for 64bit. 32bit is more than enough on a 32bit system.
>
> There are big companies out there that require thousands of
> disks. Many don't want to use hardware raid, they just want JBOD.
> There are installations today with 2k-4k disks covering the gamut from
> massive databases to HPC to research facilities. Today.
> A 32bit dev_t with the 12/20 Linus split provides 64k minors.
> That's 16k disks with our current 15-partitions-per limit. But if
> someone wants 35 partitions (I've seen that somewhere) you're down to
> 8k. And the places using 2-4k disks today will be getting to 8k disks
> soon after 2.6 becomes usable, if not before. They will likely hit 16k
> disks before 2.6 becomes an afterthought.

Umm, I must be missing something, I get here 1024k minors, 64k disks with
15 partitions and 16k disks with 63 partitions.
Anyway, the sooner userspace accepts that dev_t number is just a number
the better. The major/minor split is useless information in userspace and
it becomes less and less important in the kernel.

> > Somehow it sounds that we just have to introduce a huge dev_t and all our
> > problems are magically solved and I doubt that. If people want to encode
>
> 64bit dev_t is not a panacea. However, if we have to do this
> change, we want to do it once. This only solves the space issue. All
> of the other issues, such as naming, are orthogonal to this change.
> Holding one change until everything else has been written is silly.

SCSI already enumerates the devices sequentially, without userspace tools
already now it's very painful to manage thousands of disk, but all the
kernel can do is to give you all the information about the disks it has
and the number you can use to access the disk. How you call that thing is
a userspace issue not a kernel issue.
A lot of information is already exported via sysfs, missing information
can be added. Changing the kernel to sequentially assign dev_t numbers
starting 0x10000, when they are registered via add_disk, is a rather
simple change. Then you have can have 16M disks with 256 partitions, that
should be enough for a while?
What mostly is missing now is the userspace code. I haven't seen a
request, that someone has simple daemon and needs now this and that
information from the kernel to map a disk reliably to a name.

> There is no conspiracy. Everyone agrees we need more dev_t
> space. Peter, Andries, and others see it like I do; we only want to do
> this once, and we already can see a day when 20bits of minors isn't
> enough.

If you only want to do it once, then do it right. There are 2 basic
questions, which have to be answered:
1. How do we want to manage dev_t numbers in the future?
2. What compromises can we make for 2.6?
Without answering these questions now, we risk to pay heavily for it
later. The ones who ask now for a larger dev_t the loudest are likely the
first to demand later not change anything for "compability", because they
hardcoded certain assumptions about dev_t into their applications.

bye, Roman

2003-03-31 22:59:10

by Joel Becker

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

On Mon, Mar 31, 2003 at 11:32:55PM +0200, Roman Zippel wrote:
> later. The ones who ask now for a larger dev_t the loudest are likely the
> first to demand later not change anything for "compability", because they
> hardcoded certain assumptions about dev_t into their applications.

I'm right here campaigning loudly for a larger dev_t. I intend
to never, ever make assumptions about dev_t. In fact, I'd rather not
deal with dev_t. But I do need a way to map 4k or 8k or 16k disks.
now.

Joel

--

"Up and down that road in our worn out shoes,
Talking bout good things and singing the blues."

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: [email protected]
Phone: (650) 506-8127

2003-03-31 23:06:38

by Alan

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

On Mon, 2003-03-31 at 22:32, Roman Zippel wrote:
> 1. How do we want to manage dev_t numbers in the future?

Mostly dynamically it seems

> 2. What compromises can we make for 2.6?

Defaulting char devices to 256 minors and a lot of space so stuff doesnt
break. Viro has done the block stuff and we have the scope to do sane
stuff like /dev/disk/.. for all disks now.

> Without answering these questions now, we risk to pay heavily for it
> later. The ones who ask now for a larger dev_t the loudest are likely the
> first to demand later not change anything for "compability", because they
> hardcoded certain assumptions about dev_t into their applications.

Glibc already has a bigger dev_t

2003-03-31 23:24:42

by Roman Zippel

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Hi,

On Mon, 31 Mar 2003, Joel Becker wrote:

> I'm right here campaigning loudly for a larger dev_t. I intend
> to never, ever make assumptions about dev_t. In fact, I'd rather not
> deal with dev_t. But I do need a way to map 4k or 8k or 16k disks.
> now.

Fine, so write the software to do this, but what exactly is there still
for the kernel to do?

bye, Roman

2003-03-31 23:31:31

by Roman Zippel

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

Hi,

On 31 Mar 2003, Alan Cox wrote:

> > 2. What compromises can we make for 2.6?
>
> Defaulting char devices to 256 minors and a lot of space so stuff doesnt
> break. Viro has done the block stuff and we have the scope to do sane
> stuff like /dev/disk/.. for all disks now.

What do you mean with "a lot of space so stuff doesnt break"?

> > Without answering these questions now, we risk to pay heavily for it
> > later. The ones who ask now for a larger dev_t the loudest are likely the
> > first to demand later not change anything for "compability", because they
> > hardcoded certain assumptions about dev_t into their applications.
>
> Glibc already has a bigger dev_t

and a broken mknod implementation...

bye, Roman

2003-03-31 23:33:26

by Badari Pulavarty

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing


> I'm right here campaigning loudly for a larger dev_t. I intend
> to never, ever make assumptions about dev_t. In fact, I'd rather not
> deal with dev_t. But I do need a way to map 4k or 8k or 16k disks.
> now.
>
> Joel

Hi Joel,

I have been playing with supporting 4000 disks on IA32 machines.
There are bunch of issues we need to resolve before we could
do that.

I am using scsi_debug to simulate 4000 disks. (Ofcourse, I had
to hack "sd" to support more than 256 disks). Anyway, I noticed
that I lost almost 350MB of my lowmem, when I simulated 4000 disks.
We are working on most of these. But there are userlevel issues
to be resolved. Here is the list ...

1) deadline_drq, blkdev_request consume 80 MB of low memory.
- Jens is looking at it. He is working on a patch to allocate
requests dynamically.

2) sysfs inode use up 50 MB of low memory
- 4000 disks without partitions create (4000 * 35) = 140,000 inodes in
/sysfs. So, it uses 50 MB of lowmem.

3) dcache is eating up 25 MB of low memory.

4) kmalloc() slabs are consuming 55 MB. We are in the process
of identifying the heavy consumers and fixing them.
- Jens is fixing hash table size issues with io-schedulers.
- I have patch to allocate "hd_struct" dynamically. So, if your disks
does not have any partitions, you don't use any more memory.

5) glibc and utility issues - lots of stuff are broken
(Need a new libc)
- mknod
- ls
- raw command
- etc..

I have not done any IO on these yet. When I mount all of these and do
IO on them, we might see new issues. So with all these, I will be doubtful
if we can ever reach 16k disks on IA32.

Thanks,
Badari


2003-03-31 23:43:33

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

On Mon, Mar 31, 2003 at 03:41:50PM -0800, Badari Pulavarty wrote:
> 2) sysfs inode use up 50 MB of low memory
> - 4000 disks without partitions create (4000 * 35) = 140,000 inodes in
> /sysfs. So, it uses 50 MB of lowmem.
> 3) dcache is eating up 25 MB of low memory.

These are actually the same thing (the inode pinning references are
actually held by the dentries IIRC). shaggy and I are brewing up
something to show to gregkh and mochel in the near future.


-- wli

2003-03-31 23:35:41

by Joel Becker

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

On Mon, Mar 31, 2003 at 11:18:54PM +0100, Alan Cox wrote:
> Glibc already has a bigger dev_t

Yes, but they hand-map 8:8 in functions like xmknod(). They
should really be using macros.

Joel

--

Life's Little Instruction Book #314

"Never underestimate the power of forgiveness."

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: [email protected]
Phone: (650) 506-8127

2003-03-31 23:47:37

by Joel Becker

[permalink] [raw]
Subject: Re: 64-bit kdev_t - just for playing

On Mon, Mar 31, 2003 at 03:41:50PM -0800, Badari Pulavarty wrote:
> I have been playing with supporting 4000 disks on IA32 machines.
> There are bunch of issues we need to resolve before we could
> do that.

That's the conversation I'm trying to kickstart.

> I am using scsi_debug to simulate 4000 disks. (Ofcourse, I had
> to hack "sd" to support more than 256 disks). Anyway, I noticed
> that I lost almost 350MB of my lowmem, when I simulated 4000 disks.
> We are working on most of these. But there are userlevel issues
> to be resolved. Here is the list ...

Wow, this is cool. Thanks for telling me about this.

> I have not done any IO on these yet. When I mount all of these and do
> IO on them, we might see new issues. So with all these, I will be doubtful
> if we can ever reach 16k disks on IA32.

We're going to have to find a way. IA32 is going to be around
for long enough, I think. Easily 8k disks, as soon as the folks who are
doing 4k disks today want to multipath.

Joel

--

Life's Little Instruction Book #226

"When someone hugs you, let them be the first to let go."

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: [email protected]
Phone: (650) 506-8127