2001-12-04 00:21:04

by Donald Becker

[permalink] [raw]
Subject: Linux/Pro -- clusters

Davide Libenzi ([email protected]) wrote

>And if you're the prophet and you think that the future of multiprocessing
>is UP on clusters, why instead of spreading your word between us poor
>kernel fans don't you pull out money from your pocket ( or investors ) and
>start a new Co. that will have that solution has primary and unique goal ?

I believe that the future of multiprocessing is clusters of small scale
SMP machines, 2-8 processors each. And the most important part of
clustering them together isn't single system image from the programmers
point of view, it's transparent administration for the end user. Thus
our system has a unified process space and a single point of control,
while imposing no overhead on processes.

You are right that there is no reason to convince people here -- I tried
to do that a few years ago. Instead I've put lots of my own time and
money, as well as investor money, into a company that does only cluster
system software.

Anyway, my real point is that while I'm a big proponent of designing
consistent interfaces rather than the haphazard, incompatible changes
that have been occurring, this is far from predict-the-future design.

The goal of designing the kernel to support 128 way SMP systems is a
perfect example of the difference. A few days or weeks of using a
proposed interface change will show if the advantages are worth the cost
of the change. We won't know for years if redesigning the kernel for
large scale SMP system is useful
- does it actually work,
- will big SMP machines be common, or even exist?
- will big SMP machines have the characteristics we predict
let alone worth the costs such as
- UP performance hit
- complexity increase slows other improvements
- difficult performance tuning


Donald Becker [email protected]
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993


2001-12-04 01:52:04

by Davide Libenzi

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

On Mon, 3 Dec 2001, Donald Becker wrote:

> Davide Libenzi ([email protected]) wrote
>
> >And if you're the prophet and you think that the future of multiprocessing
> >is UP on clusters, why instead of spreading your word between us poor
> >kernel fans don't you pull out money from your pocket ( or investors ) and
> >start a new Co. that will have that solution has primary and unique goal ?
>
> I believe that the future of multiprocessing is clusters of small scale
> SMP machines, 2-8 processors each. And the most important part of
> clustering them together isn't single system image from the programmers
> point of view, it's transparent administration for the end user. Thus
> our system has a unified process space and a single point of control,
> while imposing no overhead on processes.
>
> You are right that there is no reason to convince people here -- I tried
> to do that a few years ago. Instead I've put lots of my own time and
> money, as well as investor money, into a company that does only cluster
> system software.
>
> Anyway, my real point is that while I'm a big proponent of designing
> consistent interfaces rather than the haphazard, incompatible changes
> that have been occurring, this is far from predict-the-future design.
>
> The goal of designing the kernel to support 128 way SMP systems is a
> perfect example of the difference. A few days or weeks of using a
> proposed interface change will show if the advantages are worth the cost
> of the change. We won't know for years if redesigning the kernel for
> large scale SMP system is useful
> - does it actually work,
> - will big SMP machines be common, or even exist?
> - will big SMP machines have the characteristics we predict
> let alone worth the costs such as
> - UP performance hit
> - complexity increase slows other improvements
> - difficult performance tuning

Don't get me wrong Donald, I like clusters a for a certain type of
applications they fit perfectly and they've a quasi proportional
scalability over the number of computational units.
So, I like clusters.
Quite sadly the real world has applications that cannot be easily fit in a
cluster environment like, for example, heavy threaded applications ( pls
do not start a thread topic from here ) or a more general
share-everything-over-computational-units kind.
Now this kind of design, that we can or cannot call crappy, exist in the
real world and is currently working for a _lot_ of Company's applications
servers.
Now we, as we're all smart guys, know that "work" is GOOD and nobody,
expecially heavy payed managers, is willing to change/rearchitect
something that is currently working.
So these guys are happily running their own application server until one
day they suddendly realize that they need more power.
Now they're going to face a dilemma, that is 1) use a standard
architecture big SMP machine ( high price ) 2) embrace a clustered
solution ( lower price ) having in this way to redesign their application.
I feel quite confortable in excluding SSI for the kind of applications I'm
talking about.
Since I've seen this scenario many many times I'm telling You that, by
followin the 1st theorem of "work" that states that "work is GOOD", our
poor high-salary-senior-manager is going to snatch solution 1 w/out even
thinking about.
No, I do not believe in 128 single CPU SMP machines but, if I've to watch
inside my pretty dirty crystal ball, I see multi-core CPUs as a technology
response to SMP request.
Yes, because after the 1st theorem of "work" there's the 1st lemma of
technology that states that "technology will always follow the
market request".




- Davide


2001-12-04 02:04:35

by Donald Becker

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

On Mon, 3 Dec 2001, Davide Libenzi wrote:

> On Mon, 3 Dec 2001, Donald Becker wrote:
> > of the change. We won't know for years if redesigning the kernel for
> > large scale SMP system is useful
> > - does it actually work,
> > - will big SMP machines be common, or even exist?
> > - will big SMP machines have the characteristics we predict
> > let alone worth the costs such as
> > - UP performance hit
> > - complexity increase slows other improvements
> > - difficult performance tuning
...
> No, I do not believe in 128 single CPU SMP machines but, if I've to watch
> inside my pretty dirty crystal ball, I see multi-core CPUs as a technology
> response to SMP request.
> Yes, because after the 1st theorem of "work" there's the 1st lemma of
> technology that states that "technology will always follow the
> market request".

You haven't addressed the points above.
We haven't established that the market will request substantial numbers
of 128-way SMPs. Even if they do request single-address-space
multiprocessors, it's very likely that the result will be some form of cc-numa
where the structure will strongly influence the OS to treat the machine
as something besides a SMP.

To bring this branch back on point: we should distinguish between
design for an arbitrary and unpredictable goal (e.g. 128 way SMP)
vs. putting some design into things that we are supposed to already
understand
a SCSI device layer that isn't three half-finished clean-ups
a VFS layer that doesn't require the kernel to know a priori all of
the filesystem types that might be loaded


Donald Becker [email protected]
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993

2001-12-04 02:15:12

by Davide Libenzi

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

On Mon, 3 Dec 2001, Donald Becker wrote:

> On Mon, 3 Dec 2001, Davide Libenzi wrote:
>
> > On Mon, 3 Dec 2001, Donald Becker wrote:
> > > of the change. We won't know for years if redesigning the kernel for
> > > large scale SMP system is useful
> > > - does it actually work,
> > > - will big SMP machines be common, or even exist?
> > > - will big SMP machines have the characteristics we predict
> > > let alone worth the costs such as
> > > - UP performance hit
> > > - complexity increase slows other improvements
> > > - difficult performance tuning
> ...
> > No, I do not believe in 128 single CPU SMP machines but, if I've to watch
> > inside my pretty dirty crystal ball, I see multi-core CPUs as a technology
> > response to SMP request.
> > Yes, because after the 1st theorem of "work" there's the 1st lemma of
> > technology that states that "technology will always follow the
> > market request".
>
> You haven't addressed the points above.
> We haven't established that the market will request substantial numbers
> of 128-way SMPs. Even if they do request single-address-space
> multiprocessors, it's very likely that the result will be some form of cc-numa
> where the structure will strongly influence the OS to treat the machine
> as something besides a SMP.
>
> To bring this branch back on point: we should distinguish between
> design for an arbitrary and unpredictable goal (e.g. 128 way SMP)
> vs. putting some design into things that we are supposed to already
> understand
> a SCSI device layer that isn't three half-finished clean-ups
> a VFS layer that doesn't require the kernel to know a priori all of
> the filesystem types that might be loaded

Donald, I'm not even thinking about planning a 128 CPU scalability for
Linux.
The whole point of this discussion ( if any ) is that we've to design on
what we've or on what we expect to have in a very near future.
We cannot play with technology on long term plans because, no matter how
good we plan, it'll screw us up.




- Davide


2001-12-04 02:35:24

by Alexander Viro

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters



[apologies for over-the-head reply]

> On Mon, 3 Dec 2001, Donald Becker wrote:

> > a SCSI device layer that isn't three half-finished clean-ups
<nod>

> > a VFS layer that doesn't require the kernel to know a priori all of
> > the filesystem types that might be loaded

WTF? The only interpretation I can think of is about unions in struct
inode and struct superblock. _If_ you add a filesystem that
a) doesn't do separate allocation of fs-private parts of
inode/superblock (i.e. doesn't use ->u.gerneric_ip and ->u.generic_sbp) and
b) hadn't been known at kernel compile time and
c) has one of these fields (member in inode->u or sb->u) bigger than
all filesystems known at compile time

- yes, you've got a problem.

Solution: use ->u.generic_<...>. Works fine.

Not to mention the fact that VFS per se doesn't give a damn for fs types.
All it needs is sizeof(struct inode) and sizeof(struct superblock). And
any fs using ->generic_<...> (i.e. pointer to separately allocated private
objects) is OK, whether it was known at build time or not.

2001-12-04 09:02:07

by Alan

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> a SCSI device layer that isn't three half-finished clean-ups

Beginning (at last)

> a VFS layer that doesn't require the kernel to know a priori all of
> the filesystem types that might be loaded

That was done a while ago. File systems are one by one being moved from
using the union of stuff to the fs specific pointer. New file systems don't
have to go hack the inode etc

Alan

2001-12-04 09:31:14

by Thomas Langås

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

Alan Cox:
> > a SCSI device layer that isn't three half-finished clean-ups
> Beginning (at last)

So there's someone fixing the SCSI-layer code now? (It's marked as
unmaintained in the MAINTAINERS-file for 2.4-kernels, at least)

--
Thomas

2001-12-04 09:37:04

by Alan

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> Alan Cox:
> > > a SCSI device layer that isn't three half-finished clean-ups
> > Beginning (at last)
>
> So there's someone fixing the SCSI-layer code now? (It's marked as
> unmaintained in the MAINTAINERS-file for 2.4-kernels, at least)

Take a look at the 2.5 code and you'll notice various bits of old cruft
are vanishing rapidly (old eh support, clustering gloop etc)

2001-12-04 11:35:13

by Thomas Langås

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

Alan Cox:
> > So there's someone fixing the SCSI-layer code now? (It's marked as
> > unmaintained in the MAINTAINERS-file for 2.4-kernels, at least)
> Take a look at the 2.5 code and you'll notice various bits of old cruft
> are vanishing rapidly (old eh support, clustering gloop etc)

Ok, I see... I'm trying to read up on the design of the SCSI-internals (from
http://www.andante.org/scsi.html), and it seems like there's much to do
yet... Is the listing in scsi_todo.html still valid, or is much of what's
listed there already done?

--
Thomas

2001-12-04 14:44:31

by Daniel Phillips

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

On December 4, 2001 03:09 am, Donald Becker wrote:
> To bring this branch back on point: we should distinguish between
> design for an arbitrary and unpredictable goal (e.g. 128 way SMP)
> vs. putting some design into things that we are supposed to already
> understan
> [...]
> a VFS layer that doesn't require the kernel to know a priori all of
> the filesystem types that might be loaded

Right, there's a consensus that the fs includes have to fixed and that it
should be in 2.5.lownum. The precise plan isn't fully evolved yet ;)

See fsdevel for the thread, 3-4 months ago. IIRC, the favored idea (Linus's)
was to make the generic struct inode part of the fs-specific inode instead of
the other way around, which resolves the question of how the compiler
calculates the size/layout of an inode.

This is going to be a pervasive change that someone has to do all in one
day, so it remains to be seen when/if that is actually going to happen.

It's also going to break every out-of-tree filesystem.

--
Daniel

2001-12-04 15:21:57

by Jeff Garzik

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

Daniel Phillips wrote:
>
> On December 4, 2001 03:09 am, Donald Becker wrote:
> > To bring this branch back on point: we should distinguish between
> > design for an arbitrary and unpredictable goal (e.g. 128 way SMP)
> > vs. putting some design into things that we are supposed to already
> > understan
> > [...]
> > a VFS layer that doesn't require the kernel to know a priori all of
> > the filesystem types that might be loaded
>
> Right, there's a consensus that the fs includes have to fixed and that it
> should be in 2.5.lownum. The precise plan isn't fully evolved yet ;)
>
> See fsdevel for the thread, 3-4 months ago. IIRC, the favored idea (Linus's)
> was to make the generic struct inode part of the fs-specific inode instead of
> the other way around, which resolves the question of how the compiler
> calculates the size/layout of an inode.
>
> This is going to be a pervasive change that someone has to do all in one
> day, so it remains to be seen when/if that is actually going to happen.
>
> It's also going to break every out-of-tree filesystem.

ug. what's wrong with a single additional alloc for generic_ip? [if a
filesystem needs to do multiple allocs post-conversion, somebody's doing
something wrong]

Using generic_ip in its current form has the advantage of being able to
create a nicely-aligned kmem cache for your private inode data.

Jeff


--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2001-12-04 17:16:05

by Daniel Phillips

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

On December 4, 2001 04:19 pm, Jeff Garzik wrote:
> Daniel Phillips wrote:
> >
> > On December 4, 2001 03:09 am, Donald Becker wrote:
> > > To bring this branch back on point: we should distinguish between
> > > design for an arbitrary and unpredictable goal (e.g. 128 way SMP)
> > > vs. putting some design into things that we are supposed to already
> > > understan
> > > [...]
> > > a VFS layer that doesn't require the kernel to know a priori all of
> > > the filesystem types that might be loaded
> >
> > Right, there's a consensus that the fs includes have to fixed and that it
> > should be in 2.5.lownum. The precise plan isn't fully evolved yet ;)
> >
> > See fsdevel for the thread, 3-4 months ago. IIRC, the favored idea (Linus's)
> > was to make the generic struct inode part of the fs-specific inode instead of
> > the other way around, which resolves the question of how the compiler
> > calculates the size/layout of an inode.
> >
> > This is going to be a pervasive change that someone has to do all in one
> > day, so it remains to be seen when/if that is actually going to happen.
> >
> > It's also going to break every out-of-tree filesystem.
>
> ug. what's wrong with a single additional alloc for generic_ip? [if a
> filesystem needs to do multiple allocs post-conversion, somebody's doing
> something wrong]

Single additional alloc -> twice as many allocs, two slabs, more cachelines
dirty. This was hashed out on fsdevel, though apparently not to everyone's
satisfaction.

> Using generic_ip in its current form has the advantage of being able to
> create a nicely-aligned kmem cache for your private inode data.

I don't see why that's hard with the combined struct.

--
Daniel

2001-12-04 17:22:06

by Jeff Garzik

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

Daniel Phillips wrote:
> On December 4, 2001 04:19 pm, Jeff Garzik wrote:
> > ug. what's wrong with a single additional alloc for generic_ip? [if a
> > filesystem needs to do multiple allocs post-conversion, somebody's doing
> > something wrong]
>
> Single additional alloc -> twice as many allocs, two slabs, more cachelines
> dirty. This was hashed out on fsdevel, though apparently not to everyone's
> satisfaction.
>
> > Using generic_ip in its current form has the advantage of being able to
> > create a nicely-aligned kmem cache for your private inode data.
>
> I don't see why that's hard with the combined struct.

The advantage of having two structs means that both struct inode and the
private info can be aligned nicely. Yes it potentially wastes a tiny
bit more memory, but I challenge you to find an architecture where doing
this isn't a win. In a couple cases I looked at, additional slabs are
not even necessary, as kmalloc's standard ones do the job quite well.
'cat /proc/slabinfo' for a list of the sizes.

Note this only applies to inodes. There aren't enough superblocks in a
running system to worry about doing anything but simple kmalloc on the
superblock private info (before assigning to generic_sbp).

Jeff


--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2001-12-04 18:02:02

by Alan

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> Single additional alloc -> twice as many allocs, two slabs, more cachelines
> dirty. This was hashed out on fsdevel, though apparently not to everyone's
> satisfaction.

Al Viro's NFS in generic_ip saved me something like 130K of memory.

> > Using generic_ip in its current form has the advantage of being able to
> > create a nicely-aligned kmem cache for your private inode data.
>
> I don't see why that's hard with the combined struct.

Providing you end up with fs->alloc_inode() and the fs allocates a suitable
sized inode + private I see no problem.

2001-12-04 18:35:34

by Daniel Phillips

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

On December 4, 2001 07:04 pm, Alan Cox wrote:
> > Single additional alloc -> twice as many allocs, two slabs, more cachelines
> > dirty. This was hashed out on fsdevel, though apparently not to everyone's
> > satisfaction.
>
> Al Viro's NFS in generic_ip saved me something like 130K of memory.

Yes, all of these proposals would do that, by getting away from all inodes
being the same size (basically the size of the ext2 inode).

--
Daniel

2001-12-04 20:22:13

by Andrew Morton

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

Daniel Phillips wrote:
>
> On December 4, 2001 07:04 pm, Alan Cox wrote:
> > > Single additional alloc -> twice as many allocs, two slabs, more cachelines
> > > dirty. This was hashed out on fsdevel, though apparently not to everyone's
> > > satisfaction.
> >
> > Al Viro's NFS in generic_ip saved me something like 130K of memory.
>
> Yes, all of these proposals would do that, by getting away from all inodes
> being the same size (basically the size of the ext2 inode).
>

ext3 is the pig at present. I think Andreas has half-a-patch
to move it to generic_ip.

2001-12-05 13:21:44

by Martin Dalecki

[permalink] [raw]
Subject: Deep look into VFS

Yerstoday I had a look into the virtual VFS. Out of this
the following question araises for me.

Inside fs/inode.c we have a generic clear_inode(). All fine
all well at one palce the usage of this function goes as follows:
(the function in question is iput() from the same file)

if (op && op->delete_inode) {
void (*delete)(struct inode *) = op->delete_inode;
if (!is_bad_inode(inode))
DQUOT_INIT(inode);
/* s_op->delete_inode internally recalls clear_inode() */
delete(inode);
} else
clear_inode(inode);

Well my tought was, that it would be nice to avoid
the explicit callback to inode from driver code in the middle for
nowhere, which would allow us to change the above code sequence into
the much cleaner:

if (op && op->delete_inode) {
void (*delete)(struct inode *) = op->delete_inode;
if (!is_bad_inode(inode))
DQUOT_INIT(inode);
delete(inode);
}
clear_inode(inode);

Therefore I have looked at all the places, where clear_inode
is actually called inside the FS implementation code.

shmmem() told me that the above change would be entierly fine with it.

We have however the following in ext2/ialloc.c:


/*
* NOTE! When we get the inode, we're the only people
* that have access to it, and as such there are no
* race conditions we have to worry about. The inode
* is not on the hash-lists, and it cannot be reached
* through the filesystem because the directory entry
* has been deleted earlier.
*
* HOWEVER: we must make sure that we get no aliases,
* which means that we have to call "clear_inode()"
* _before_ we mark the inode not in use in the inode
* bitmaps. Otherwise a newly created file might use
* the same inode number (not actually the same pointer
* though), and then we'd have two inodes sharing the
* same inode number and space on the harddisk.
*/
void ext2_free_inode (struct inode * inode)
{
...

lock_super (sb);
...
/* Do this BEFORE marking the inode not in use or returning an error */
clear_inode (inode);

...
unlock_super (sb);
}

Unless I'm compleatly misguided the lock on the superblock
should entierly prevent the race described inside the header comment
and we should be able to delete clear_inode from this function.

Question is: Can someone with more knowlendge of the intimidate
inner workings of the VFS tell me whatever my suspiction is
right or not?

Thanks in advance...

PS. Deleting clear_inode() would help to simplify the
delete_inode parameters quite a significant bit, as
well as deleting the tail union in struct inode - that's the goal.

2001-12-05 15:20:01

by Alexander Viro

[permalink] [raw]
Subject: Re: Deep look into VFS



On Wed, 5 Dec 2001, Martin Dalecki wrote:

> Unless I'm compleatly misguided the lock on the superblock
> should entierly prevent the race described inside the header comment
> and we should be able to delete clear_inode from this function.

Huh? We drop that lock before the return from this function. So if you
move clear_inode() after the return, you lose that protections.

What's more, you can't more that lock_super()/unlock_super() into iput()
itself - you need it _not_ taken in the beginning of ext2_delete_inode()
and you don't want it for quite a few filesystems.

Nothing VFS-specific here, just a bog-standard "you lose protection of
semaphore once you call up()"...

> PS. Deleting clear_inode() would help to simplify the
> delete_inode parameters quite a significant bit, as
> well as deleting the tail union in struct inode - that's the goal.

Again, huh?

2001-12-05 15:41:11

by Martin Dalecki

[permalink] [raw]
Subject: Re: Deep look into VFS

Alexander Viro wrote:
>
> On Wed, 5 Dec 2001, Martin Dalecki wrote:
>
> > Unless I'm compleatly misguided the lock on the superblock
> > should entierly prevent the race described inside the header comment
> > and we should be able to delete clear_inode from this function.
>
> Huh? We drop that lock before the return from this function. So if you
> move clear_inode() after the return, you lose that protections.
>
> What's more, you can't more that lock_super()/unlock_super() into iput()
> itself - you need it _not_ taken in the beginning of ext2_delete_inode()
> and you don't want it for quite a few filesystems.
>
> Nothing VFS-specific here, just a bog-standard "you lose protection of
> semaphore once you call up()"...

Ummmmm... that is well trivially true... of course (I'm
slapping a hand on my forehead).

Thank you for answering!

2001-12-05 22:05:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

In article <[email protected]>,
=?iso-8859-1?Q?Thomas_Lang=E5s?= <[email protected]> wrote:
>Alan Cox:
>> > a SCSI device layer that isn't three half-finished clean-ups
>> Beginning (at last)
>
>So there's someone fixing the SCSI-layer code now? (It's marked as
>unmaintained in the MAINTAINERS-file for 2.4-kernels, at least)

The old SCSI code won't be fixed. It will just be made totally obsolete
by the better generic block layer code. I personally hope that a year
from now, if somebody wants to do a new SCSI driver, he won't even
_think_ about using the SCSI code, the driver will just take the
(generic SCSI) requests directly off the block queue.

Death to middle-men that can't do a good job anyway.

Linus

2001-12-05 23:09:38

by Andre Hedrick

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

On Wed, 5 Dec 2001, Linus Torvalds wrote:

> The old SCSI code won't be fixed. It will just be made totally obsolete
> by the better generic block layer code. I personally hope that a year
> from now, if somebody wants to do a new SCSI driver, he won't even
> _think_ about using the SCSI code, the driver will just take the
> (generic SCSI) requests directly off the block queue.
>
> Death to middle-men that can't do a good job anyway.

Linus,

Would a three part model be to your liking? The parts of there for
isolation and testing the intergity of the driver to have confidence it
can be trusted to do its tasks proper.

BLOCK IO
----------------------
1) Mainloop
2) Personality Drivers (DEVICE TYPES, but expanded)
3) HOST/Controller-DEVICE

The significance in part three of the driver layer is for satisfying a new
requirement in SCSI4 for (buzzword) "Domain Boundaries". It means to
provide a diagnostic verification of the transport/data-phase layers.

It would require non-serialized block access to perform something like a
direct pattern-block write-read-verification-checksum. This is not
trivial for SCSI, but it can be created. The strength of this model is
Linux could then isolate the hardware problems and make corrections on a
controller bases and not pollute the purity of the "Domain Boundaries".

This is a model, I have been working on for a while for ATA for 2.5,
however it is no longer possible at this time because of the changes in
the Block IO model that are not documented describing what and why new
request_struct items are added and their usage ruleset.

Regards,

Andre Hedrick
Linux ATA Development

2001-12-05 23:41:58

by Alan

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> by the better generic block layer code. I personally hope that a year
> from now, if somebody wants to do a new SCSI driver, he won't even
> _think_ about using the SCSI code, the driver will just take the
> (generic SCSI) requests directly off the block queue.

You still need the scsi code. There are a whole sequence of common, quite
complex and generic functions that the scsi layer handles (in paticular
error handling).

Turning it the right way I up definitely agree with. It should be the driver
calling the scsi code to do bio->scsi request, and to do scsi error
recovery, not vice versa.

There are also some tricky relationships
queues are per logical unit number
locking is mostly per controller
resources are often per controller

Alan


2001-12-05 23:55:08

by Andre Hedrick

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

On Wed, 5 Dec 2001, Alan Cox wrote:

> > by the better generic block layer code. I personally hope that a year
> > from now, if somebody wants to do a new SCSI driver, he won't even
> > _think_ about using the SCSI code, the driver will just take the
> > (generic SCSI) requests directly off the block queue.
>
> You still need the scsi code. There are a whole sequence of common, quite
> complex and generic functions that the scsi layer handles (in paticular
> error handling).
>
> Turning it the right way I up definitely agree with. It should be the driver
> calling the scsi code to do bio->scsi request, and to do scsi error
> recovery, not vice versa.
>
> There are also some tricky relationships
> queues are per logical unit number
> locking is mostly per controller
> resources are often per controller

Alan,

Nothing that can not be handled in the core-model described earlier;
however, I am positive that the suttle issues are more sticky than you are
revealing now.

Regards,


Andre Hedrick
Linux Disk Certification Project Linux ATA Development

2001-12-06 04:30:02

by Daniel Phillips

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

On December 6, 2001 12:05 am, Andre Hedrick wrote:
> This is a model, I have been working on for a while for ATA for 2.5,
> however it is no longer possible at this time because of the changes in
> the Block IO model that are not documented describing what and why new
> request_struct items are added and their usage ruleset.

There's a rather nice document by Suparna, have you seen it?

--
Daniel

2001-12-06 17:05:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

In article <[email protected]>,
Alan Cox <[email protected]> wrote:
>
>You still need the scsi code. There are a whole sequence of common, quite
>complex and generic functions that the scsi layer handles (in paticular
>error handling).

Well, the preliminary patches already handle _some_ common things, like
building the proper command request for reads and writes etc, and that
will probably continue. We'll probably have to have all the old helpers
for things like "this target only wants to be probed on lun 0" etc.

I disagree about the error handling, though.

Traditionally, the timeouts and the reset handling was handled in the
SCSI mid-layer, and it was a complete and utter disaster. Different
hosts simply wanted so different behaviour that it's not even funny.
Timeouts for different commands were so different that people ended up
making most timeouts so long that they no longer made sense for other
commands etc.

Other device drivers have been able to handle timeouts and errors on
their own before, and have _not_ had the kinds of horrendous problems
that the SCSI layer has had.

We'll see what the details will end up being, but I personally think
that it is a major mistake to try to have generic error handling. The
only true generic thing is "this request finished successfully / with an
error", and _no_ high-level retries etc. It's up to the driver to decide
if retries make sense.

(Often retrying _doesn't_ make sense, because the firmware on the
high-end card or disk itself may already have done retries on its own,
and high-level error handling is nothing but a waste of time and causes
the error notification to be even more delayed).

Linus

2001-12-06 17:53:33

by Alan

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> Timeouts for different commands were so different that people ended up
> making most timeouts so long that they no longer made sense for other
> commands etc.

Thats per _target_ not host. Which needs to be common code.

> Other device drivers have been able to handle timeouts and errors on
> their own before, and have _not_ had the kinds of horrendous problems
> that the SCSI layer has had.

Every IDE layer uses the same IDE error handling code, because every IDE
driver would otherwise have to make a copy of it - ditto scsi.

> that it is a major mistake to try to have generic error handling. The
> only true generic thing is "this request finished successfully / with an
> error", and _no_ high-level retries etc. It's up to the driver to decide
> if retries make sense.

Retries and retry handling are target specific not host specific (think
about the ton of logic you need every time your cd rom decides to error
a read). You can have a read turn into a sequence of operations while you
go and work out why it failed, ask it if its ready, tell it to lock the
door, spin up the media, wait for it to be ready, reissue the I/O.

This processing has to be robust because scsi cd-roms for example are
rarely robust themselves.

So its very much

request->controller
libscsi -> make me a command block
issue command

interrupt->controller
error ?
libscsi recommend an action please
add suggested recovery to queue head
kick request handling

> (Often retrying _doesn't_ make sense, because the firmware on the
> high-end card or disk itself may already have done retries on its own,
> and high-level error handling is nothing but a waste of time and causes
> the error notification to be even more delayed).

Those devices aren't SCSI controllers, and they don't want to appear as one.
Thats a horrible windows NT habit that harms performance badly. Of course
everyone is now doing it with Linux because someone wouldn't provide more
major numbers.

Which is another thing - can you make the internal dev_t 32 or 64bits now.
You can have 65536 volumes on an S/390 so even with perfectly distributed
devfs allocated device identifiers - we don't have enough.

Alan

2001-12-06 18:13:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters


On Thu, 6 Dec 2001, Alan Cox wrote:
>
> > Timeouts for different commands were so different that people ended up
> > making most timeouts so long that they no longer made sense for other
> > commands etc.
>
> Thats per _target_ not host. Which needs to be common code.

It hasn't traditionally been "common code". The old SCSI layer has various
fixed timeouts, many of them on the order of 2-5 minutes, and none of them
target-specific.

Some of them are effectively turned off - the format timeout was increased
to 2 hours to make sure that it basically never triggers.

But never fear, we'll have some common routines for error handling. But
they will be library routines, NOT the current crap.

> Those devices aren't SCSI controllers, and they don't want to appear as one.

Ehh.. IDE disks take SCSI commands, and do most error recovery entirely in
disk firmware. There is very little you can do about most errors there.

Don't think "SCSI" as in SCSI controllers. Think SCSI as in "fairly
generic packet protocol that somehow infiltrated most things".

> Which is another thing - can you make the internal dev_t 32 or 64bits now.
> You can have 65536 volumes on an S/390 so even with perfectly distributed
> devfs allocated device identifiers - we don't have enough.

It's called "struct block_device" and "struct genhd". The pointers will
have as many bits as pointers have on the architecture. Low-level drivers
will not even see anything else eventually, there will be no "numbers".

Linus

2001-12-06 18:25:55

by Alan

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> Some of them are effectively turned off - the format timeout was increased
> to 2 hours to make sure that it basically never triggers.

Thats scsi_generic which thankfully puts most of the logic in user space.

> > Those devices aren't SCSI controllers, and they don't want to appear as one.
>
> Don't think "SCSI" as in SCSI controllers. Think SCSI as in "fairly
> generic packet protocol that somehow infiltrated most things".

The scsi controller is akin to a network driver. The stuff that matters is
stuff like the scsi disk, scsi cd and scsi tape drivers. Scsi disk and CD
need to do a lot of error recovery (especially CD-ROM). Disk too has to
because older scsi devices don't have the same kind of "the host is clueless
crap I'll have to try error recovery myself before reporting" mentality.

It would be nice if a lot of the CD error/recovery logic could be in the
cdrom libraries because the logic (close the door, lock the door, try
half speed, ..) is the same in scsi and ide.

> It's called "struct block_device" and "struct genhd". The pointers will
> have as many bits as pointers have on the architecture. Low-level drivers
> will not even see anything else eventually, there will be no "numbers".

For those of us who want to run a standards based operating system can
you do the 32bit dev_t. Otherwise some slightly fundamental things don't
work. You know boring stuff like ls, find, df, and other standard unix
commands. Those export a dev_t cookie.

If you don't want to be able to run stuff like ls, just let me know and
I'll start another kernel tree 8)

Alan

2001-12-06 18:40:11

by Doug Ledford

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

Linus Torvalds wrote:

>In article <[email protected]>,
>Alan Cox <[email protected]> wrote:
>
>>You still need the scsi code. There are a whole sequence of common, quite
>>complex and generic functions that the scsi layer handles (in paticular
>>error handling).
>>
>
>Well, the preliminary patches already handle _some_ common things, like
>building the proper command request for reads and writes etc, and that
>will probably continue. We'll probably have to have all the old helpers
>for things like "this target only wants to be probed on lun 0" etc
>
Personally, I think it would be a mistake to have any of the low level
driver know a thing about the ll_rw_blk.c request structs or bio
structures or similar stuff. They simply don't *need* to know those
things. Low level drivers need a command that can go to the drive and
they need a sg array. Smart hosts, aka raid controllers and the like,
could theoretically use the higher level structs for reasonable things,
but I won't make any assumptions about which ones could or couldn't use
that information because I haven't looked into it. So, making the scsi
mid layer a helper library may be OK, but for the largest number of
drivers, it really means that the call chain would likely look something
like this:

blk_dev->request_fn()
scsi_dispatch_request() // map from the major/minor to driver
driver_queue_routine(bio *)
Scsi_Cmnd = scsi_make_cmnd(bio *);
send_command(cmnd); //put it in the hardware

driver_interrupt_routine()
if (cmnd = get_errored_command()) {
// Pick out the commands that had a transfer related error
// and rework or error them out here
if(retryable_error(cmnd))
requeue_command(cmnd);
else {
cmnd->bio->completion_handler(cmnd->bio);
scsi_free_command(cmnd);
}
}
if (cmnd = get_completed_command()) {
retryable = scsi_check_sense(cmnd);
switch(retryable) {
case TRANSIENT_ERROR:
requeue_command(cmnd);
break;
case FATAL_ERROR:
case NO_ERROR:
cmnd->bio->completion_handler(cmnd->bio);
scsi_free_command(cmnd);
break;
case BUSY:
requeue_command_with_delay(cmnd);
break;
case ...
}
}

Personally, I don't like it. The queueing stuff is somewhat OK (but
requires other things to be changed to accomodate and I don't think
those things should change), but I don't want to have to deal with
result parsing in every driver. Proper result parsing is huge and easy
to get wrong. The current code gets it wrong. There are a few ways in
which it could be fixed to do the right thing without disrupting the
world. Duplicating that in all the low level drivers will be a
nightmare though. I also don't like one aspect of putting the queueing
totally in the low level drivers. That way, all of the low level
drivers are going to have to maintain A) delayable queues with variable
delay times and undelay on completion semantics (my driver already has
this, but it's been a source of pain to maintain, it would be nice if it
didn't need it) and B) ordering requirements when multiple commands for
a device operating in untagged mode are in the queue (this one isn't so
hard, but getting it wrong means things like tape drives will store
garbage when you least expect it).

Anyway, I suspect the end result would be the same either way: drivers
would end up having control over their own flow of commands. The only
difference would be how that control would be achieved. A) by changing
the mid layer to accept more driver specific parameters (aka, even
though MegaRAID may take up to 60 seconds to complete a normal command,
the aic7xxx should never take more than 10 seconds, so the default
command timeout length would be driver specific) and to honor the low
level driver's idea of what to do on a timeout (let the low level driver
tell the mid layer what action to take, and that should be the limit of
the mid layers error handling "brains", performing those actions). Or
B) removing the mid layer from the picture all together (as much as
possible anyway, it (or a similar variant) will still likely be needed
on queueing just to map from major/minor to driver unless we change the
device allocation scheme or something) and then having the low level
driver call into helper routines.

>
>I disagree about the error handling, though.
>
>Traditionally, the timeouts and the reset handling was handled in the
>SCSI mid-layer, and it was a complete and utter disaster. Different
>hosts simply wanted so different behaviour that it's not even funny.
>
That's not really accurate. It was too opaque, sure. But for what it
was it did a decent job. The mid layer driver was trying to make
generic timeout decisions without the benefit of the low level driver's
knowledge of the current bus state. For example, 20 commands may time
out at once, but in reality, only one of those commands is probably
holding up the SCSI bus. The low level drivers can (usually) look at
their card to see which command is *really* the hold up. Then, they
could have all the other commands simply put back to sleep without doing
anything and take appropriate action against the holdup command. The
current mid layer (at least in the old error handling) couldn't do that.
The new_eh code attempts to allow drivers to tell them these things by
use of the strategy function. In reality, that's all you need. That's
the driver's ability to tell the mid layer *how* to proceed on any given
command.

>
>Timeouts for different commands were so different that people ended up
>making most timeouts so long that they no longer made sense for other
>commands etc.
>
Not accurate here either. For most commands across all controllers,
timeouts are pretty uniform (the timeout to read a CD-ROM for instance
is pretty constant). The timeouts *only* started to vary when you ran
across smart RAID controllers that were doing too much work rebuilding
things and didn't respond to your requests in a timely fashion. And
that only applys to their logical drives. It would be pretty easy to
just allow disk timeouts to be adjusted on a disk by disk basis to a
reasonable default for that controller and solve this problem.

>
>Other device drivers have been able to handle timeouts and errors on
>their own before, and have _not_ had the kinds of horrendous problems
>that the SCSI layer has had.
>
If you use the new eh strategy function (or so I understand it, if I'm
wrong here then I'll take the time to *make* myself right by changing
the code), then you are essentially allowing your driver to take control
of *what* happens, and the mid layer timeout code becomes nothing more
than a glorified timer creation, monitoring, and teardown framework that
happens to be plugged into the queueing and completion framework so it
can notice new commands and when old commands are done. Nothing more.
And that's all a SCSI driver that wants to do its own thing really
needs. Done properly, this could be a "good" thing. Done poorly,
everyone will hate it. But, the ability is there to make your driver do
what you want with the strategy call in.

>
>We'll see what the details will end up being, but I personally think
>that it is a major mistake to try to have generic error handling. The
>only true generic thing is "this request finished successfully / with an
>error", and _no_ high-level retries etc. It's up to the driver to decide
>if retries make sense.
>
Well, I disagree on this. I think asking all those drivers to deal with
sense data and keep up with changing standards and new sense returns on
new devices, etc. is just asking for a lot of out of date drivers that
do the wrong thing. Sense parsing and decision making is *device*
specific, not controller specific, and I don't think it has any place
being in the controller's responsibility list. You might as well start
asking eth drivers to not only check checksums on packets but also
protocol flags before passing the packet up to the IP or TCP layers and
to perform all the TCP retrans operations in themselves instead of in
the TCP layer.

>
>(Often retrying _doesn't_ make sense, because the firmware on the
>high-end card or disk itself may already have done retries on its own,
>and high-level error handling is nothing but a waste of time and causes
>the error notification to be even more delayed).
>
> Linus
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>



2001-12-06 19:04:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters


On Thu, 6 Dec 2001, Alan Cox wrote:
>
> The scsi controller is akin to a network driver. The stuff that matters is
> stuff like the scsi disk, scsi cd and scsi tape drivers. Scsi disk and CD
> need to do a lot of error recovery (especially CD-ROM).

Ok, we agree here.

The problem is that we've done things the "wrong way around". If you think
of the problem as a network controller, together with "packets" that have
SCSI commands in them, then it is clear how you should NOT have

- read/write ->
- driver IO request ->
SCSI layer ->
driver

because that is equivalent to doing TCP with

- read/write ->
driver request ->
TCP layer ->
driver

which is bogus.

However, what's bogus about it is not that the old SCSI layer was above
the driver, but the fact that it was _below_ the "ll_rw_block" and request
queueing interface. That's the _packet_ interface. You don't do TCP or UDP
below the packet interface.

We should try to have some of the error recovery etc at a really _high_
level, preferably in user space. Especially the "complicated" cases are
hard to do any other way, as some IO errors require you to start sending
magic "unlock drive using this key" packets to the drive, and just
stupidly retrying simply will not work.

But that is not something that the SCSI layer should really care about.

> It would be nice if a lot of the CD error/recovery logic could be in the
> cdrom libraries because the logic (close the door, lock the door, try
> half speed, ..) is the same in scsi and ide.

Not CD-ROM library.

Instead, what I and Jens have been talking about, and what the next
pre-patch will actually have is to move some of the higher-level logic
_up_, to above the "packet interface".

Think of "struct request" as a packet, and think of a disk driver as
nothing but a specialized network driver.

So what do you get? Rip out all of drivers/scsi/scsi_ioctl.c, and replace
it with a much higher-level interface that parses the ioctl and passes
down the appropriate packets.

So "close door" is equivalent to a ICMP packet.

Normal read/write is TCP - we do merging, sorting, re-ordering etc, again
at a higher level. The packet that makes it to the low-level driver is
just a packet. This is the only layer that does retransmit etc.

And then you have the old "raw packet" interface, where user-level apps
can send commands down to the disk.

> For those of us who want to run a standards based operating system can
> you do the 32bit dev_t.

You asked for an _internal_ data structure. dev_t is the external
representation, and has _nothing_ to do with any drivers at all.

Linus

2001-12-06 19:10:43

by Alan

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> Normal read/write is TCP - we do merging, sorting, re-ordering etc, again
> at a higher level. The packet that makes it to the low-level driver is
> just a packet. This is the only layer that does retransmit etc.

Makes sense yes. I'm not sure how much we can push into user space before
we break back compatibility or lose the needed info/security credentials
to take action but it makes sense when possible.

> > For those of us who want to run a standards based operating system can
> > you do the 32bit dev_t.
>
> You asked for an _internal_ data structure. dev_t is the external
> representation, and has _nothing_ to do with any drivers at all.

The internal representation is kdev_t, which wants to turn into a pointer
from what Aeb has been saying for a long time. A 32bit "dev_t" is need so
that we can label over 65536 file systems to things like ls, regardless of how
"/dev/sdfoo" is mapped onto a driver

I'm sure that dev_t (the cookie we feed to user space) going to 32bits is
going to break something and I'd rather it broke early

Alan

2001-12-06 20:48:39

by kaih

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

[email protected] (Linus Torvalds) wrote on 06.12.01 in <[email protected]>:

> Some of them are effectively turned off - the format timeout was increased
> to 2 hours to make sure that it basically never triggers.

And I recently found out the hard way that wasn't enough, and ended up
cludging a utility to patch a running kernel (don't ask) to increase that
timeout. Turned out the drive needed a little over three hours to tell me
it couldn't format.

Frankly, format should really have NO timeout. Or possibly a user-
specified one.

MfG Kai

2001-12-06 20:46:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters


On Thu, 6 Dec 2001, Alan Cox wrote:
>
> The internal representation is kdev_t, which wants to turn into a pointer

No.

That kdev_t has been around for years, and is going away. In 2.6 there
will _be_ no kdev_t.

There is "struct block_device" for internal stuff, and "dev_t" for
external stuff. The first one is a real structure, the second one is just
a cookie.

Linus

2001-12-06 20:54:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters


On 6 Dec 2001, Kai Henningsen wrote:
>
> Frankly, format should really have NO timeout. Or possibly a user-
> specified one.

Well, frankly, the interface should be that the user code sends the
command it needs, and waits for it. With no policy in the kernel at all.

Now, for backwards compatibility reasons we cannot do that generically,
and some things (not format, though), may be common enough that the
"library code" to do the normal thing is in the kernel. But on the whole,
the question should not be "how long should the timeout be", but more
along the lines of "how can we make this easy to interface to existing and
new applications _without_ having policy decisions like timeouts and
number of retries".

Linus

2001-12-06 22:26:42

by Alan

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> On Thu, 6 Dec 2001, Alan Cox wrote:
> > The internal representation is kdev_t, which wants to turn into a pointer
>
> No.
>
> That kdev_t has been around for years, and is going away. In 2.6 there
> will _be_ no kdev_t.
>
> There is "struct block_device" for internal stuff, and "dev_t" for
> external stuff. The first one is a real structure, the second one is just
> a cookie.

Ok so kdev_t will split into structs for char and block device which are
seperate things ?

2001-12-06 22:35:04

by Alan

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> cludging a utility to patch a running kernel (don't ask) to increase that
> timeout. Turned out the drive needed a little over three hours to tell me
> it couldn't format.
>
> Frankly, format should really have NO timeout. Or possibly a user-
> specified one.

For generic packet interfaces the "abort" operation becomes something you
can put in the hands of user space, as well as progress reports

2001-12-06 22:41:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters


On Thu, 6 Dec 2001, Alan Cox wrote:
>
> Ok so kdev_t will split into structs for char and block device which are
> seperate things ?

Yes. And the name will change to reflect that.

(Ie once char and block are separate, like they logically are in the
namespace anyway, there's no "dev_t" at all, it's all "struct char_device"
or "struct block_device" and they have nothing in common).

We already have pretty much all the infrastructure in place for this, it's
just that a lot of calling conventions have "kdev_t" still (which is
actually ambiguous as-is - you have to look at the function name etc to
figure out if it is a character device or a block device).

The main ones are things like "bread()" down all the way to the bottom of
the IO path. The sad thing is that along the whole path, we actually end
up needing the structure pointer in different places, so the IO code
(which is supposed to be timing-critical) ends up doing various lookups on
the kdev_t several times (both at a higher level and deep down in the IO
submit layer).

So now we have to do "bdfind()" *kdev_t -> block_device", and
"get_gendisk()" for "kdev_t -> struct gendisk" and about 5 different
"index various arrays using the MAJOR number" on the way to actually doing
the IO.

Even though the filesystems that want to _do_ the IO actually already have
the structure pointer available, and all the indexing off major would
actually fairly trivially just be about reading off the fields off that
structure.

(Ugh, just _look_ at the code looking up block size, sector size,
"readonly" status, queue finding, statistics gathering etc. The
ro_bits thing seems to "know" that "long" is 32 bits etc. It's enough to
make you cry ;)

Oh, well. It _is_ going to be quite painful to switch things around.

Linus

2001-12-06 22:59:05

by Alexander Viro

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters



On Thu, 6 Dec 2001, Linus Torvalds wrote:

> The main ones are things like "bread()" down all the way to the bottom of
> the IO path. The sad thing is that along the whole path, we actually end
> up needing the structure pointer in different places, so the IO code
> (which is supposed to be timing-critical) ends up doing various lookups on
> the kdev_t several times (both at a higher level and deep down in the IO
> submit layer).

I have a conversion patches for bread()/getblk()/get_hash_table(). Once
bio stuff settles down I'll start feeding them to you - they are very
straightforward.

Nice side effect is the death of buffer hash - once we have block_device
in all places in question we can use page hash just fine. One level of
spinlocks in buffer.c goes to hell...

If you are interested I can feed the preparation part tomorrow - it's
a matter of adding

struct buffer_head * sb_bread(struct super_block *sb, sector_t block)
{
return bread(sb->s_dev, block, sb->s_blocksize);
}

and replacing instances of that in filesystems with this guy. That
alone reduces the number of places that call bread() by factor of
80, IIRC. And it's an obvious cleanup that doesn't break anything and
can go into 2.4 as well as 2.5.

Same goes for getblk() and get_hash_table(). After that there is a payload
part of patch - switch to struct block_device * which is now available to
all callers - sb_...() have it in sb->s_bdev and the rest also have a place
to get it from. And that I'd rather postpone until bio is stable.

However, that part is several orders of magnitude smaller than the entire
patch - most is the conversion above. So if you want it - I can do it
as soon as I get some sleep; last version of that patch is against 2.4.12,
it's split into edible chunks and it's not hard to update. Comments?

2001-12-07 10:24:56

by Martin Dalecki

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> > It's called "struct block_device" and "struct genhd". The pointers will
> > have as many bits as pointers have on the architecture. Low-level drivers
> > will not even see anything else eventually, there will be no "numbers".
>
> For those of us who want to run a standards based operating system can
> you do the 32bit dev_t. Otherwise some slightly fundamental things don't
> work. You know boring stuff like ls, find, df, and other standard unix
> commands. Those export a dev_t cookie.

I don't think this is what Linus was talking about. The current problem
is that at many places the drivers (not the generic layer) know too
much about stuff, which should be handled entierly on the genric device
type layer. And changing this is actually a *prerequsite* to change
the type of dev_t.

For example please grep for the MINOR() macro in the scsi layer...
Most of the places where it's used should be replaced by a simple
driver instance enumerator. I did this once already, so this is for
sure.

> If you don't want to be able to run stuff like ls, just let me know and
> I'll start another kernel tree 8)

2001-12-07 10:28:56

by Alan

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> > For those of us who want to run a standards based operating system can
> > you do the 32bit dev_t. Otherwise some slightly fundamental things don't
> > work. You know boring stuff like ls, find, df, and other standard unix
> > commands. Those export a dev_t cookie.
>
> I don't think this is what Linus was talking about. The current problem

Linus wasnt talking about what I was talking about. Problem the other way
around 8)

> For example please grep for the MINOR() macro in the scsi layer...
> Most of the places where it's used should be replaced by a simple
> driver instance enumerator. I did this once already, so this is for
> sure.

it become block_device->instance or ->minor

major/minors for old stuff still end up leaking into user space and
mattering there. I'm not sure the best option for that

2001-12-07 11:07:14

by Martin Dalecki

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

Alan Cox wrote:

> > For example please grep for the MINOR() macro in the scsi layer...
> > Most of the places where it's used should be replaced by a simple
> > driver instance enumerator. I did this once already, so this is for
> > sure.
>
> it become block_device->instance or ->minor

Well if all the infromation those functions are needing would
be already in block_device in place, that it could become as easy
as just passing &block_device there.
However please note that replacing kdev_t in the scsi layer
with just passing the minor can be done already *now* without
any pain. The same applies to the excessive MINOR lookups in the
v4l code. I did this already some time ago (patch was here - about one
year ago)

> major/minors for old stuff still end up leaking into user space and
> mattering there. I'm not sure the best option for that

Thta's no problem. But they should be used as hash values no the
syscall implementation level and nowhere else.

--
- phone: +49 214 8656 283
- job: eVision-Ventures AG, LEV .de (MY OPINIONS ARE MY OWN!)
- langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort:
ru_RU.KOI8-R

2001-12-07 12:00:02

by Alan

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> > major/minors for old stuff still end up leaking into user space and
> > mattering there. I'm not sure the best option for that
>
> Thta's no problem. But they should be used as hash values no the
> syscall implementation level and nowhere else.

We have apps that "know" about specific major/minors that need changing and
will take time - also some of them are closed source so unfixable.

For new stuff that bit isnt an issue, although ioctl overlaps mean we have
some other problems to worry about there

2001-12-07 20:51:19

by Erik Andersen

[permalink] [raw]
Subject: On re-working the major/minor system

On Fri Dec 07, 2001 at 12:08:35PM +0000, Alan Cox wrote:
> > > major/minors for old stuff still end up leaking into user space and
> > > mattering there. I'm not sure the best option for that
> >
> > Thta's no problem. But they should be used as hash values no the
> > syscall implementation level and nowhere else.
>
> We have apps that "know" about specific major/minors that need changing and
> will take time - also some of them are closed source so unfixable.

Right. Tons of apps have illicit insider knowledge of kernel
major/minor representation and NEED IT to do their job. Try
running 'ls -l' on a device node. Wow, it prints out major and
minor number. You can pack up a tarball containing all of /dev
so tar has to has insider major/minor knowledge too -- as does
the structure of every existant tarball! Check out, for example,
Section 10.1.1 (page 210) of the IEEE Std. 1003.1b-1993 (POSIX)
and you will see every tarball in existance stores 8 chars for
the major, and 8 chars for the minor....

So we have POSIX, ls, tar, du, mknod, and mount and tons of other
apps all with illicit insider knowledge of what a dev_t looks
like. A couple of months ago I patched up mkfs.jffs2 so it
could create device nodes on the target filesystem that don't
really exist in the source directory (avoids the need to be root
when building filesystems).

Right now, you will find that a zillion user space apps currently
have little snippets of code looking like:

/* FIXME: MKDEV uses illicit insider knowledge of kernel
* major/minor representation... */
#define MINORBITS 8
#define MKDEV(ma,mi) (((ma) << MINORBITS) | (mi))

To currently, to do pretty much anything nifty related to devices
in usespace, usespace has to peek under the kernel's skirt to
know how to change a major and minor number into a dev_t and/or
to sanely populate a struct stat.

To change things, we 1) need some sortof sane interface by which
userspace can refer sensibly to devices without resorting to evil
illicit macros and 2) we certainly need some sort of a static
mapping such that existing devices end up mapping to the same
thing they always did or 3) we will need a flag day where we say
that all pre-2.5.x created tarballs and user space apps are
declared broken...

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2001-12-07 21:22:29

by H. Peter Anvin

[permalink] [raw]
Subject: Re: On re-working the major/minor system

Followup to: <[email protected]>
By author: Erik Andersen <[email protected]>
In newsgroup: linux.dev.kernel
>
> Right. Tons of apps have illicit insider knowledge of kernel
> major/minor representation and NEED IT to do their job. Try
> running 'ls -l' on a device node. Wow, it prints out major and
> minor number. You can pack up a tarball containing all of /dev
> so tar has to has insider major/minor knowledge too -- as does
> the structure of every existant tarball! Check out, for example,
> Section 10.1.1 (page 210) of the IEEE Std. 1003.1b-1993 (POSIX)
> and you will see every tarball in existance stores 8 chars for
> the major, and 8 chars for the minor....
>

Actually, it's not "tons of apps", it's in the C library itself.

These things are defined in <sys/sysmacros.h> and anyone who uses
anything else should be taken out and shot.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2001-12-07 21:55:58

by Erik Andersen

[permalink] [raw]
Subject: Re: On re-working the major/minor system

On Fri Dec 07, 2001 at 01:21:58PM -0800, H. Peter Anvin wrote:
> Followup to: <[email protected]>
> By author: Erik Andersen <[email protected]>
> In newsgroup: linux.dev.kernel
> >
> > Right. Tons of apps have illicit insider knowledge of kernel
> > major/minor representation and NEED IT to do their job. Try
> > running 'ls -l' on a device node. Wow, it prints out major and
> > minor number. You can pack up a tarball containing all of /dev
> > so tar has to has insider major/minor knowledge too -- as does
> > the structure of every existant tarball! Check out, for example,
> > Section 10.1.1 (page 210) of the IEEE Std. 1003.1b-1993 (POSIX)
> > and you will see every tarball in existance stores 8 chars for
> > the major, and 8 chars for the minor....
> >
>
> Actually, it's not "tons of apps", it's in the C library itself.

The C library, and the POSIX standard, etc, etc.

> These things are defined in <sys/sysmacros.h> and anyone who uses
> anything else should be taken out and shot.

Ok, so we go through, change sys/sysmacros.h, tar.h, cpio.h, and
any other offending header file. And guess what? Not only has
nothing changed (since those are macros, not functions), but you
just broke every older .deb and .rpm in existance on your updated
system.

In sys/sysmacros.h it defines major() and minor() as macros, so
just dropping in an updated C library binary isn't going to do
squat until all of userspace gets recompiled. And tar.h and
cpio.h define long standing (well over 10 years now) binary
structures. We can't just go changing this stuff, since now when
a dev_t is some magic cookie, if I go to install something from
my old Debian 1.2 CD or my old RedHat 4.0 CD, my system will puke
trying to install using cookies that in fact are old 8/8 split
device nodes and not cookies at all.

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2001-12-07 22:05:34

by H. Peter Anvin

[permalink] [raw]
Subject: Re: On re-working the major/minor system

Erik Andersen wrote:

>
> Ok, so we go through, change sys/sysmacros.h, tar.h, cpio.h, and
> any other offending header file. And guess what? Not only has
> nothing changed (since those are macros, not functions), but you
> just broke every older .deb and .rpm in existance on your updated
> system.
>
> In sys/sysmacros.h it defines major() and minor() as macros, so
> just dropping in an updated C library binary isn't going to do
> squat until all of userspace gets recompiled. And tar.h and
> cpio.h define long standing (well over 10 years now) binary
> structures. We can't just go changing this stuff, since now when
> a dev_t is some magic cookie, if I go to install something from
> my old Debian 1.2 CD or my old RedHat 4.0 CD, my system will puke
> trying to install using cookies that in fact are old 8/8 split
> device nodes and not cookies at all.
>


It's clear a painful change is needed. **We don't have a choice.**
However, the fewer places we have to make source code changes the better.

What we agreed upon when this was discussed last year was the following:

dev_t is extended to a 12:20 (32-bit size.) I personally would rather
have seen a 64-bit size (32:32) but was outvoted :(

New major 0 is reserved, except that dev_t == 0 remains the code for "no
device". The unnamed device major becomes major 256.

If (dev_t & ~0xFFFF) == 0, the dev_t is interpreted as an old-format
dev_t, and is interpreted according to the following algorithm:

if ( dev && (dev & ~0xFFFF) == 0 ) {
major = (dev >> 8) ? (dev >> 8) : 256;
minor = dev & 0xFF;
} else {
major = dev >> 20;
minor = dev & 0xFFFFF;
}

-hpa




2001-12-07 23:07:52

by Erik Andersen

[permalink] [raw]
Subject: Re: On re-working the major/minor system

On Fri Dec 07, 2001 at 02:04:42PM -0800, H. Peter Anvin wrote:
>
> It's clear a painful change is needed. **We don't have a choice.**
> However, the fewer places we have to make source code changes the better.

Sure. I'm not arguing again the change. Just making sure
everyone 100% understands that we have just thown any prayer of
binary compatibility with anything less then 2.5.x....

But lets look on the bright side though. Since we are going to
be having a flag day _anyways_ we may as well make the most of
it. I can think of 20 things off the top of my head that are
being retained in the name of binary cmpatibilty that can easily
move to the trash bucket. :)

For example, I would _love_ for Linux to standardize syscall
numbers across all architectures, guarantee that userspace gets
the exact same stack setup for all arches, we might as well fixup
proc, etc, etc, etc.


> What we agreed upon when this was discussed last year was the following:
>
> dev_t is extended to a 12:20 (32-bit size.) I personally would rather
> have seen a 64-bit size (32:32) but was outvoted :(
>
> New major 0 is reserved, except that dev_t == 0 remains the code for "no
> device". The unnamed device major becomes major 256.
>
> If (dev_t & ~0xFFFF) == 0, the dev_t is interpreted as an old-format
> dev_t, and is interpreted according to the following algorithm:
>
> if ( dev && (dev & ~0xFFFF) == 0 ) {
> major = (dev >> 8) ? (dev >> 8) : 256;
> minor = dev & 0xFF;
> } else {
> major = dev >> 20;
> minor = dev & 0xFFFFF;
> }

That works, and should prevent most major problems. Hmm. At
least for cpio there are 6 chars worth of device info in there,
so we coule easily go to 48 bits without RPM problems. Or redhat
could fix rpm to use tarballs like debs do, and then we could go
to 64 bit devices no problem.

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2001-12-08 00:04:55

by H. Peter Anvin

[permalink] [raw]
Subject: Re: On re-working the major/minor system

Erik Andersen wrote:

> On Fri Dec 07, 2001 at 02:04:42PM -0800, H. Peter Anvin wrote:
>
>>It's clear a painful change is needed. **We don't have a choice.**
>>However, the fewer places we have to make source code changes the better.
>>
>
> Sure. I'm not arguing again the change. Just making sure
> everyone 100% understands that we have just thown any prayer of
> binary compatibility with anything less then 2.5.x....
>
> But lets look on the bright side though. Since we are going to
> be having a flag day _anyways_ we may as well make the most of
> it. I can think of 20 things off the top of my head that are
> being retained in the name of binary cmpatibilty that can easily
> move to the trash bucket. :)
>
> For example, I would _love_ for Linux to standardize syscall
> numbers across all architectures, guarantee that userspace gets
> the exact same stack setup for all arches, we might as well fixup
> proc, etc, etc, etc.
>


Not going to happen. Linux deliberately choose against that, because in
Linux, syscall numbers are generally (except x86) compatible with the
dominant vendor Unix on the platform.

>
> That works, and should prevent most major problems. Hmm. At
> least for cpio there are 6 chars worth of device info in there,
> so we coule easily go to 48 bits without RPM problems. Or redhat
> could fix rpm to use tarballs like debs do, and then we could go
> to 64 bit devices no problem.
>


The big stubling block seems to be NFSv2.

-hpa


2001-12-08 01:51:21

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

From: Alan Cox <[email protected]>

> > For those of us who want to run a standards based operating system can
> > you do the 32bit dev_t.
>
> You asked for an _internal_ data structure. dev_t is the external
> representation, and has _nothing_ to do with any drivers at all.

The internal representation is kdev_t, which wants to turn into a pointer
from what Aeb has been saying for a long time.

Yes and no. If I am not mistaken there are three details:

(i) Linus prefers to separate block and character devices.
I agree that that makes the code a bit cleaner, but dislike
the code duplication: the interface to user space, the allocation,
deallocation, registering is completely identical for the two.
But apparently Linus does not mind a little bloat if that avoids
an ugly cast in two or three places.

(ii) So, we split kdev_t into kbdev_t and kcdev_t.
Al (and/or Linus) baptizes the struct that a kbdev_t is pointing at
"struct block_device". I usually had a two-layer version, with
device_struct and driver_struct, while struct genhd disappeared.
Don't know whether Al has similar ideas.
The current struct block_device is an ordered pair (dev_t, ops *)
and does not seem to give easy access to the partitions, so maybe Al
still has to reshuffle things a bit, or add a pointer to a struct genhd.
We'll see.

(iii) The past months Al has been nibbling away a little at the road
that makes kdev_t (or kbdev_t or so) a pointer to a device_struct.
Instead it looks like he wants to construct a parallel and equivalent
road starting from the already present basis for a struct block_device.

So, yes, internally we'll have a pointer. No, it doesnt look like
the name of the pointer will be kdev_t.

No doubt Linus or Al or somebody will correct me if the above is all wrong.


A 32bit "dev_t" is needed so that we can label over 65536 file systems
to things like ls, regardless of how
"/dev/sdfoo" is mapped onto a driver

I'm sure that dev_t (the cookie we feed to user space) going to 32bits is
going to break something and I'd rather it broke early

Yes, that is an entirely independent matter.
User space uses a 64bit cookie today, and the kernel throws away
three quarters of that. Very little breaks if the kernel throws away less.

[As you know I like a large dev_t, and Linus hated it before he understood
the use of a large dev_t. (For example, he worried that an "ls" would take
many centuries.) Don't know about current opinions. Such a lot of nice
applications: use any device description you like, take a cryptographic
hash and have a device number. Or, generate a new anonymous device by
incrementing a counter. Or, support full NFS.
It would really be a pity to go only to 32 bits. Indeed, 32 bits is
large but not large enough to be collision-free for random assignments,
so one would need a registry of numbers. With a much larger device
number the registry is superfluous.]

Andries

2001-12-08 03:43:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

Followup to: <[email protected]>
By author: [email protected]
In newsgroup: linux.dev.kernel
>
> Yes and no. If I am not mistaken there are three details:
>
> (i) Linus prefers to separate block and character devices.
> I agree that that makes the code a bit cleaner, but dislike
> the code duplication: the interface to user space, the allocation,
> deallocation, registering is completely identical for the two.
> But apparently Linus does not mind a little bloat if that avoids
> an ugly cast in two or three places.
>

I don't understand why you can't share this code. The main reason for
having different types is so you don't mix them up -- they are
separante namespaces, and shouldn't be mixed up. Having them be
different types makes the compiler enforce this.

If we were using C++ we could make a base class which contained the
common stuff. As it is, perhaps a substructure would do it.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2001-12-08 11:34:13

by Alan

[permalink] [raw]
Subject: Re: On re-working the major/minor system

> > That works, and should prevent most major problems. Hmm. At
> > least for cpio there are 6 chars worth of device info in there,
> > so we coule easily go to 48 bits without RPM problems. Or redhat
> > could fix rpm to use tarballs like debs do, and then we could go

RPM can't easily use tarballs. Too much of a tar ball isnt rigidly defined so
you can cryptographically sign it.

> > to 64 bit devices no problem.
>
> The big stubling block seems to be NFSv2.

Well 2.5 isnt going to be able to support NFS without a magic daemon
maintained translation table - so that when the kernel randomly changes the
major/minor number of an exported file system (eg a USB reconnect or even plain
boring shutdown/reboot) it can keep consistent file handles.

If you have a file handle table surely you can remap every NFS file handle
through that down to 32bits. For device files the problem doesn't matter
because at the kernel meeting Linus said those were going to change in a way
that meant devices over NFS are a lost cause and clients would have to use
devfs

Alan


2001-12-08 17:27:46

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

From: Linus Torvalds <[email protected]>

The sad thing is that along the whole path, we actually end
up needing the structure pointer in different places, so the IO code
(which is supposed to be timing-critical) ends up doing various lookups on
the kdev_t several times (both at a higher level and deep down in the IO
submit layer).

So now we have to do "bdfind()" *kdev_t -> block_device", and
"get_gendisk()" for "kdev_t -> struct gendisk" and about 5 different
"index various arrays using the MAJOR number" on the way to actually doing
the IO.

Even though the filesystems that want to _do_ the IO actually already have
the structure pointer available, and all the indexing off major would
actually fairly trivially just be about reading off the fields off that
structure.

Oh, well. It _is_ going to be quite painful to switch things around.

I don't understand at all. It is not painful at all.
Things are completely straightforward.

A kdev_t is a pointer to all information needed, nowhere a lookup,
except at open time.

You make it kbdev_t, and then call it struct block_device *.
OK, the name doesnt matter as long as the struct it points to has all
information needed. In my version that is the case, and I would
be rather surprised if it were otherwise in Al's version.

The changes are only of the easy, provably correct, mechanical kind.
Boring work, and a bit slow - each step requires a grep over the
kernel source and there are about a hundred steps.

I am sure also Al will tell you that there is no problem.

Andries

2001-12-08 17:55:56

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: On re-working the major/minor system

From: Erik Andersen <[email protected]>

So we have POSIX, ls, tar, du, mknod, and mount and tons of other
apps all with illicit insider knowledge of what a dev_t looks
like.

To currently, to do pretty much anything nifty related to devices
in usespace, usespace has to peek under the kernel's skirt to
know how to change a major and minor number into a dev_t and/or
to sanely populate a struct stat.

To change things, we 1) need some sortof sane interface by which
userspace can refer sensibly to devices without resorting to evil
illicit macros and 2) we certainly need some sort of a static
mapping such that existing devices end up mapping to the same
thing they always did or 3) we will need a flag day where we say
that all pre-2.5.x created tarballs and user space apps are
declared broken...

No flag day required. These things have been discussed
many times already, and there are easy and good solutions.

Code like

dev_t dev;
u64 d = dev;
int major, minor;

if (d & ~0xffffffff) {
major = (d >> 32);
minor = (d & 0xffffffff);
} else if (d & ~0xffff) {
major = (d >> 16);
minor = (d & 0xffff);
} else {
major = (d >> 8);
minor = (d & 0xff);
}

will handle dev_t fine, regardless of whether it is 16, 32 or 64 bits.
You see that change of the size of dev_t does not change the values
of major and minor found in your tarballs.
We already use such code for isofs.

Andries

2001-12-08 20:38:29

by H. Peter Anvin

[permalink] [raw]
Subject: Re: On re-working the major/minor system

Alan Cox wrote:

>>>That works, and should prevent most major problems. Hmm. At
>>>least for cpio there are 6 chars worth of device info in there,
>>>so we coule easily go to 48 bits without RPM problems. Or redhat
>>>could fix rpm to use tarballs like debs do, and then we could go
>>>
>
> RPM can't easily use tarballs. Too much of a tar ball isnt rigidly defined so
> you can cryptographically sign it.
>


Why does that matter? You're signing a *specific instance* of tar, not
the generic format.


>
>>>to 64 bit devices no problem.
>>>
>>The big stubling block seems to be NFSv2.
>
> Well 2.5 isnt going to be able to support NFS without a magic daemon
> maintained translation table - so that when the kernel randomly changes the
> major/minor number of an exported file system (eg a USB reconnect or even plain
> boring shutdown/reboot) it can keep consistent file handles.
>
> If you have a file handle table surely you can remap every NFS file handle
> through that down to 32bits. For device files the problem doesn't matter
> because at the kernel meeting Linus said those were going to change in a way
> that meant devices over NFS are a lost cause and clients would have to use
> devfs
>


Yeah, I know what Linus said at the kernel summit. As far as I could
tell he rejected anything that seemed like a sensible approach from here
to there, but that's just my $0.02...

-hpa


2001-12-09 04:29:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters


On Sat, 8 Dec 2001 [email protected] wrote:
>
> Oh, well. It _is_ going to be quite painful to switch things around.
>
> I don't understand at all. It is not painful at all.
> Things are completely straightforward.
>
> A kdev_t is a pointer to all information needed, nowhere a lookup,
> except at open time.

No.

I refuse to have the same structure for block devices and character
devices. We already know that they are different.

Which means that I _will_ rename the thing.

Which means that the patch will be straightforward, but painful.

> I am sure also Al will tell you that there is no problem.

To me, touching a few hundred files, even if it's almost a
search-and-replace operation is always painful. Much more painful than
touching just one subsystem..

Linus

2001-12-09 05:50:02

by Alexander Viro

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters



On Sat, 8 Dec 2001, Linus Torvalds wrote:

> > I am sure also Al will tell you that there is no problem.

<raised brows> What gave you such impression? IIRC, I've described
the problems several months ago. Three words: object freeing policy.

> To me, touching a few hundred files, even if it's almost a
> search-and-replace operation is always painful. Much more painful than
> touching just one subsystem..

Less painful, in that case.

kdev_t will have to stay what it is right now. We _still_ don't have
clear lifetime logics for char_device (BTW, patches for long-living
cache on bdev side might be worth resurrecting them).

We can switch to _use_ of pointers to block_device and char_device, but
we can't turn kdev_t into such pointer and expect it to work correctly.

So this "touching just one subsystem" would involve a shitload of very
tricky audits of said few hundred files. I'll take straightforward
search-and-replace job over that any day, thank you very much.

As for kdev_t... Eventually it will die out. Arguments about code
duplication on maintaining caches are BS - especially since we will
end up with different allocation/freeing rules and thus almost definitely
different locking.

IOW, for what I care the strategy for 2.5 should be:
* reduce amount of places where we pass kdev_t around while there
is better alternative. E.g. for bread() and friends in filesystems it's
definitely struct superblock.
* supply struct block_device * to the rest of places where we
are passing block devices around.
* switch i_rdev to dev_t.
* kill uses of kdev_t for block devices completely.

Character devices are both simpler and harder - we have fewer places using
them, but we have no clean release point due to the games played by
subsystems that replace ->f_op on a live struct file.

Andries, believe me, I understand the attractiveness of "let's use sma
t macros to hide the complexity of chage", but that will do us no good since
the real complexity will bite us anyway when we start chasing dangling pointers
and having fun with races.

The fundamental reason why we can't replace kdev_t with pointer and hope
to survive is that YOU DON'T FREE NUMBERS. Integer is an integer - it's
always valid. We will need to free the structures and _that_ is where the
problems will start.

Witness the fun with iput() changes around 2.4.15 - they were needed exactly
because we used to be sloppy and left inodes in cache for too long; after
the ->i_sb was garbage. That had been fixed - we finally have sane rules
for inode lifetime and these rules guarantee that we won't have junk floating
around.

For kdev_t situation is much more complex than for superblocks. More
objects refering to them, different locking situation for these objects,
etc.

Blind turning kdev_t into a pointer to dynamically allocated structure
is a recipe for a huge fuckup. Sorry, but we'll have to do honest work.

2001-12-09 09:00:34

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

From: Alexander Viro <[email protected]>

> > I am sure also Al will tell you that there is no problem.

<raised brows> What gave you such impression? IIRC, I've described
the problems several months ago. Three words: object freeing policy.

The fundamental reason why we can't replace kdev_t with pointer and hope
to survive is that YOU DON'T FREE NUMBERS. Integer is an integer - it's
always valid. We will need to free the structures and _that_ is where the
problems will start.

Yes, you are quite right, this is a difficulty, more serious
than the bdev/cdev distinction Linus mentions.

But for me the difficulty is far away.

Let me once more sketch the mechanical change.

Part 1: Invent some random structures, to be changed when needed,
that contain all data we want to refer to via our pointer.
Since the procedure was supposed to be mechanical, take
the arrays indexed today by major or major,minor and make
their contents fields in these structs.

Work to do: global search and replace of
blk_size[MAJOR(dev)][MINOR(dev)]
by
dev->size
(possibly with a shift: I was going to bytes instead of blocks;
possibly with an inline function
get_size(dev)
so that changing the setup of these structs later is easier).

Part 2: These structures have to be allocated. Let the allocating
happen in the same place where the arrays like blk_size[][] are
initialized today.

Part 3: These structures have to be found, given a dev_t.
Use a hash table.

Now you see no refcounting, and no freeing.
But my point is that that does not matter.
At least not at first.

I have run for months with systems like this, and typically saw
2000 or so such structures allocated. But they are small structures,
a few dozen bytes, nobody cares - at first.

Result of the mechanical change: a system without arrays,
with large device numbers, so that people can have ten thousand
SCSI disk partitions, should they want to.

In other words, two problems are solved: the arrays are gone,
and the device numbers no longer live in this cramped space.


Yes, now you want, and I want, to go further.
As long as these structs are not located in memory that goes away,
and do not contain pointers that point to stuff that goes away
when a module is unloaded it does not matter much that they are
never freed. But in the long run we of course want to free all
that is allocated. So, later we must audit what happens to them.
I can say more about that, but our difference is that it is your
first worry and my last worry.

(Roughly speaking the situation is still as I ordained six years ago:
things of type kdev_t only live in ROOT_DEV, inode->i_dev, inode->i_rdev,
sb->s_dev, bh->b_dev, req->rq_dev, tty->device.
We change inode->i_rdev back to a dev_t.
One does not want to free the struct upon the last close; soon there
will be an open again. One only wants to free the struct when the module
is unloaded, or perhaps when it is certain that the device will never
be used again, like in my version with 40-bit anonymous device numbers
that are never reused. So, inode, sb, bh, req, tty belonging to a
module that is unloaded must be freed. But we wanted that already,
also without device structs.)

[There is more to say, but I have to go, and maybe you and Linus
can start telling me why this mechanical approach is silly.
Hope to be back twelve hours from now.]

Andries

2001-12-09 15:11:34

by kaih

[permalink] [raw]
Subject: Re: On re-working the major/minor system

[email protected] (Erik Andersen) wrote on 07.12.01 in <[email protected]>:

> The C library, and the POSIX standard, etc, etc.

I think you'll find that there is *NOTHING* in either the C standard,
POSIX, or the Austin future-{POSIX,UNIX} standard that knows about major
or minor numbers.

MfG Kai

2001-12-09 21:37:51

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: On re-working the major/minor system

From: [email protected] (Kai Henningsen)

> The C library, and the POSIX standard, etc, etc.

I think you'll find that there is *NOTHING* in either the C standard,
POSIX, or the Austin future-{POSIX,UNIX} standard that knows about major
or minor numbers.

The Austin draft turned into POSIX 1003.1-2001 yesterday or so.

There is not much, but a few traces can be found.
For example, the pax archive format uses 8-byte devmajor and devminor fields.

(But to reassure others: no, this standard does not specify
major and minor in ls output, but just says
"If the file is a character special or block special file, the size of
the file may be replaced with implementation-defined information
associated with the device in question.")

Andries


2001-12-09 21:58:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: On re-working the major/minor system

Followup to: <[email protected]>
By author: [email protected] (Kai Henningsen)
In newsgroup: linux.dev.kernel
>
> > The C library, and the POSIX standard, etc, etc.
>
> I think you'll find that there is *NOTHING* in either the C standard,
> POSIX, or the Austin future-{POSIX,UNIX} standard that knows about major
> or minor numbers.
>

It's not "future" anymore... Austin is now IEEE 1003.1-2001 and thus
the new POSIX standard.

Anyway, look for things like tar, cpio, ISO 9660 and that class of
standards.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2001-12-10 16:49:55

by Alexander Viro

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters



On Sun, 9 Dec 2001 [email protected] wrote:

> [There is more to say, but I have to go, and maybe you and Linus
> can start telling me why this mechanical approach is silly.

Basically you propose to take the current system, replace it with
something without clear memory management ("let it leak") and then
try to fix the resulting mess.

I would rather switch code that uses kdev_t to use of dynamically
allocated structures. Subsystem-by-subsystem. Keeping decent
memory management on every step.

It's _way_ easier than trying to fix leaks and dangling pointers in
the fuzzy code we'd get with your approach. Just look at the fun
Richard has with devfs right now.

It's easier to convert nth piece when you have n-1 done right and nothing
else using the objects in question. Putting the whole thing together
first and the trying to fix it will be a living horror compared to that.

2001-12-10 17:01:47

by Alan

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> It's _way_ easier than trying to fix leaks and dangling pointers in
> the fuzzy code we'd get with your approach. Just look at the fun
> Richard has with devfs right now.

And it means we can get proper refcounting. Which as the maintainer of
two block drivers that support dynamic volume create/destroy is remarkably
good news.

Alan

2001-12-10 19:36:46

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

From [email protected] Mon Dec 10 17:50:02 2001

Basically you propose to take the current system, replace it with
something without clear memory management ("let it leak") and then
try to fix the resulting mess.

Al - you are using debating tricks instead of logic, using
negative words ("unclear", "leak", "mess") instead of arguments.
Maybe you are unable to refute the soundness of the system I propose?

It is quite possible that I overlook some detail.
On the other hand, I have been running these systems.
You are not able to convince me that something is wrong
just by handwaving. Real arguments are required.


What I do is go from the present situation, in a series of steps,
to a new situation where the source looks different but the
system behaves provably the same. Consequently, no "fixing"
is required. "Mess" is a matter of taste, I'll not discuss that
except by saying that I vastly prefer the situation without arrays.
"Leak" is false. "Dangling pointers" is false.

Andries


[About "leak": What happens today is that a driver like sd.c
allocates arrays and fills them. In my version this driver
allocates structures and fills them. When the module is removed,
today the arrays are freed. In my version the structures are
freed at that point. So, no leakage occurs.
About "dangling pointers": The correctness condition for this
scheme is that no struct that contains kdev_t fields survives
removal of the module.
It seems to me that that is true already, and in any case will
be easy to ensure. If you have other opinions, please come
with explicit examples where fundamental problems would occur.]

[and, Linus, the name of the beast makes no difference; kdev_t
or kbdev_t or struct block_device *; it is the same amount of work]

2001-12-10 19:52:46

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

From [email protected] Mon Dec 10 18:01:03 2001

And it means we can get proper refcounting. Which as the maintainer of
two block drivers that support dynamic volume create/destroy is remarkably
good news.

You say this as if that would be a difference between the two
approaches. I don't think it is.

My goal was: allow large device numbers.
The subgoal: get rid of the arrays since these do not allow large indices.
The approach: make kdev_t a pointer to some random structure.

Now that I have achieved my goal, if you come along and want
refcounting, it seems to me that all I have to do is add a field
refcount to this struct, and have xget() and xput() routines
increase or decrease this number.

Maybe you are confused because usually one has a structure
that keeps track of all references to itself, so that the structure
can be freed when the number drops to zero. I do not need such refcounting
for a kdev_t, but it is very easy to keep track of the number of openers,
the number of inodes, or whatever you would like to count.
After all, anything you do with the device gets called with
a kdev_t argument, so nothing is easier than having open() increase
and close() decrease some field.

Andries

2001-12-10 20:26:16

by Alan

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> And it means we can get proper refcounting. Which as the maintainer of
> two block drivers that support dynamic volume create/destroy is remarkably
> good news.
>
> You say this as if that would be a difference between the two
> approaches. I don't think it is.

Its easier to make sure its correct when we have a single structure not
a pile of arrays. Object lifetime becomes explicit, and we don't have to
worry about re-use races since a new instance of that major,minor will have
a different object attached to the one in use that is about to be refcounted
into oblivion by currently active requests

2001-12-10 21:32:20

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

From: Alan Cox <[email protected]>

> And it means we can get proper refcounting. Which as the maintainer
> of two block drivers that support dynamic volume create/destroy is
> remarkably good news.
>
> You say this as if that would be a difference between the two
> approaches. I don't think it is.

Its easier to make sure its correct when we have a single structure not
a pile of arrays.

I don't understand your reference to arrays. Nobody uses arrays.
That is something of the past.

Object lifetime becomes explicit, and we don't have to
worry about re-use races since a new instance of that major,minor
will have a different object attached to the one in use that is
about to be refcounted into oblivion by currently active requests

As described, my setup certainly has no re-use races, since
I do not use refcounts as a way to terminate the lifespan of
a kdev_t. So, are you saying that you prefer my version?
I have problems reading your replies.

Andries

2001-12-10 21:37:43

by Alan

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> Object lifetime becomes explicit, and we don't have to
> worry about re-use races since a new instance of that major,minor
> will have a different object attached to the one in use that is
> about to be refcounted into oblivion by currently active requests
>
> As described, my setup certainly has no re-use races, since
> I do not use refcounts as a way to terminate the lifespan of
> a kdev_t. So, are you saying that you prefer my version?
> I have problems reading your replies.

I have problems understanding your argument. Basically you seem to be saying
"void * is cool" (aka kdev_t is basically an opaque magic). I don't see
what it gains you over "struct block_device *".

Alan

2001-12-10 22:49:21

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

> Basically you seem to be saying
> "void * is cool" (aka kdev_t is basically an opaque magic).

Well, kdev_t is just as opaque as struct inode *.
One refers to what you want to know about a block device.
The other to what you want to know about an inode.

> I don't see what it gains you over "struct block_device *".

That is difficult to say, since the present struct block_device
still has a long way to go. At present it has no facilities
for storing data. Maybe the final results would be the same.
My main objective has always been to do a mechanical,
correctness preserving change (as the first and major step).

This means that very early on the road the objectives
"no arrays" and "large device numbers" are achieved.
Afterwards one can continue restructuring and polishing
as desired. Al's approach (as I understand it) will
achieve the same things, but later, and with more handwork.

Andries

2001-12-10 22:56:11

by Alexander Viro

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters



On Mon, 10 Dec 2001 [email protected] wrote:

> From [email protected] Mon Dec 10 17:50:02 2001
>
> Basically you propose to take the current system, replace it with
> something without clear memory management ("let it leak") and then
> try to fix the resulting mess.
>
> Al - you are using debating tricks instead of logic, using
> negative words ("unclear", "leak", "mess") instead of arguments.
> Maybe you are unable to refute the soundness of the system I propose?

<blink>

> It is quite possible that I overlook some detail.
> On the other hand, I have been running these systems.
> You are not able to convince me that something is wrong
> just by handwaving. Real arguments are required.

> What I do is go from the present situation, in a series of steps,
> to a new situation where the source looks different but the
> system behaves provably the same. Consequently, no "fixing"
> is required. "Mess" is a matter of taste, I'll not discuss that
> except by saying that I vastly prefer the situation without arrays.
> "Leak" is false. "Dangling pointers" is false.

What??? You've just said that on the first stage you are not going to
free these objects and then add freeing them and audit the whole thing
at that point.

The first is commonly known as leak (objects are allocated but not freed).
Dangling pointers is what you will have to fight during that audit -
places where something retains kdev_t after your object had been freed.

Let me rephrase it: with your plan we will have much more complex audit
needed at the moment when you introduce freeing your objects. Reason:
it will have to involve all subsystems using kdev_t at once. That's
my problem with your plan. Sigh...

2001-12-10 23:34:37

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

From: Alexander Viro <[email protected]>

What??? You've just said that on the first stage you are not going to
free these objects and then add freeing them and audit the whole thing
at that point.

The first is commonly known as leak (objects are allocated but not freed).

You are mistaken. Allocation without freeing is not a leak.
A leak is the situation where an unbounded amount of memory is lost
over time because of repeated allocs without corresponding frees.

Allocation of a known, bounded amount of memory is no leak.

(But this has very little relevance except in a shouting match.
Your next remarks are more interesting.)

Dangling pointers is what you will have to fight during that audit -
places where something retains kdev_t after your object had been freed.

Let me rephrase it: with your plan we will have much more complex audit
needed at the moment when you introduce freeing your objects. Reason:
it will have to involve all subsystems using kdev_t at once. That's
my problem with your plan. Sigh...

I am not as afraid as you are.
Something retains kdev_t after the module has been unloaded?
That would be a bug, sure, both in the present and in the future kernel.
I listed the places where a kdev_t is stored (inode, sb, ..) and for
each of those it is true that these structs should be released before
or at module unload time, so that after module unload time no instances
of corresponding kdev_t are left.

Moreover, the audit happens fully automatically during the boring,
mechanical work. Indeed, already the separation of kdev_t into
kbdev_t and kcdev_t will touch all places where kdev_t occurs,
so that as a side effect one has a list of all places where one
of these is stored.

Andries

2001-12-11 08:39:59

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: Linux/Pro -- clusters

Alexander Viro writes:

> Basically you propose to take the current system, replace it with
> something without clear memory management ("let it leak") and then
> try to fix the resulting mess.
>
> I would rather switch code that uses kdev_t to use of dynamically
> allocated structures. Subsystem-by-subsystem. Keeping decent
> memory management on every step.
>
> It's _way_ easier than trying to fix leaks and dangling pointers in
> the fuzzy code we'd get with your approach. Just look at the fun
> Richard has with devfs right now.

Leaks go away if you add a garbage collector. To get rid of the
dangling pointers, write this part of the kernel in Java or LISP.
There's an OS called emacs that was done this way, and even has
a LISP engine under the GPL. Grab that code maybe.






















>:-)

2001-12-11 20:58:08

by kaih

[permalink] [raw]
Subject: Re: On re-working the major/minor system

[email protected] (H. Peter Anvin) wrote on 09.12.01 in <[email protected]>:

> By author: [email protected] (Kai Henningsen)

> > > The C library, and the POSIX standard, etc, etc.
> >
> > I think you'll find that there is *NOTHING* in either the C standard,
> > POSIX, or the Austin future-{POSIX,UNIX} standard that knows about major
> > or minor numbers.
> >
>
> It's not "future" anymore... Austin is now IEEE 1003.1-2001 and thus
> the new POSIX standard.

As of this Friday, yes.

> Anyway, look for things like tar, cpio, ISO 9660 and that class of
> standards.

Well, at least in Austin there is neither tar, cpio, nor 9660.

You are, however, right insofar as there's pax, which for ustar format has
devmajor and devminor fields of 8 octets each, which contain unspecified
information. (cpio format just has the rdev field.)


MfG Kai