2002-01-04 00:04:41

by Alessandro Suardi

[permalink] [raw]
Subject: 2.5.2-pre7 still missing bits of kdev_t

Merging Andries' changes for these files gets me a full build:

./fs/reiserfs/inode.c
./fs/reiserfs/super.c
./fs/reiserfs/journal.c
./fs/ext3/super.c
./include/linux/reiserfs_fs.h

--alessandro

"this machine will, will not communicate
these thoughts and the strain I am under
be a world child, form a circle before we all go under"
(Radiohead, "Street Spirit [fade out]")


2002-01-04 00:13:52

by Jeff Garzik

[permalink] [raw]
Subject: Re: 2.5.2-pre7 still missing bits of kdev_t

Alessandro Suardi wrote:
> ./fs/reiserfs/inode.c
> ./fs/reiserfs/super.c
> ./fs/reiserfs/journal.c
> ./include/linux/reiserfs_fs.h

reiserfs is blindly storing the kernel's kdev_t value raw to disk.

AFAICS this will need a policy decision not just cleanup, before it
works in 2.5.2 properly. If we switch the kernel to 12:20 major:minor
numbers, suddenly the reiserfs disk format changes based on kernel
version, and earlier kernels see corrupted major:minor numbers.

For many filesystems with just 16-bits of major/minor storage, they
store the raw kdev_t value as well, but have different problems when
12:20 comes around.

If reiserfs guys had planned ahead they would be decoding the
major/minor number before it hits disk, and already storing it in 12:20
fashion or somesuch.

Unless I am missing something...

Jeff


--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2002-01-04 01:48:24

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: 2.5.2-pre7 still missing bits of kdev_t

From [email protected] Fri Jan 4 01:13:34 2002

reiserfs is blindly storing the kernel's kdev_t value raw to disk.

AFAICS this will need a policy decision not just cleanup, before it
works in 2.5.2 properly. If we switch the kernel to 12:20 major:minor
numbers, suddenly the reiserfs disk format changes based on kernel
version, and earlier kernels see corrupted major:minor numbers.

No, not really. For how to do this, see a fragment of example code
that Linus removed from kdev_t.h in pre6, it went something like
(adapted for 12+20 instead of 16+16):

int major(dev_t dev) {
int ma;

ma = (dev >> 20);
if (!ma)
ma = (dev >> 8);
return ma;
}

int minor(dev_t dev) {
if (dev >> 20)
return (dev & 0xfffff);
else
return (dev & 0xff);
}

dev_t mkdev(int ma, int mi) {
if (mi & ~0xff)
return ((ma << 20) + mi);
else
return ((ma << 8) + mi);
}

(with the correctness conditions that ma is 12-bit,
mi is 20-bit, and major 0 has only 8-bit minors).

You see that the representation of old values does not change.
No disk corruption.

Andries


[I didnt spell it out, but you understand: the dev_t is the on-disk
format, the conversion finds the major and minor, and these are
combined again into a kdev_t for use by the kernel]

[Similar code occurs is isofs/rock.c, where a 64-bit dev
must be converted.]

[I don't know whether reiserfs is a Linux-only filesystem.
If it is not, and has a disk format that is OS-independent,
a third struct stat_data might be needed, since the 12+20
is not universal, so 32+32 would be the better choice.]

2002-01-04 02:33:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.5.2-pre7 still missing bits of kdev_t


On Thu, 3 Jan 2002, Jeff Garzik wrote:
>
> reiserfs is blindly storing the kernel's kdev_t value raw to disk.

Well, it won't do that. You have to use "kdev_t_to_nr()", which (whenever
the format of kdev_t changes) will still be identical in the low 16 bits.

Now, if somebody actually has the raw "kdev_t" in their on-disk
structures, that's a real problem, but I don't think anybody does.
Certainly I didn't see reiserfs do it (but it may well be missing a few
"kdev_t_to_nr()" calls)

Linus

2002-01-04 06:53:21

by Jeff Garzik

[permalink] [raw]
Subject: Re: 2.5.2-pre7 still missing bits of kdev_t

Linus Torvalds wrote:
>
> On Thu, 3 Jan 2002, Jeff Garzik wrote:
> >
> > reiserfs is blindly storing the kernel's kdev_t value raw to disk.
>
> Well, it won't do that. You have to use "kdev_t_to_nr()", which (whenever
> the format of kdev_t changes) will still be identical in the low 16 bits.
>
> Now, if somebody actually has the raw "kdev_t" in their on-disk
> structures, that's a real problem, but I don't think anybody does.
> Certainly I didn't see reiserfs do it (but it may well be missing a few
> "kdev_t_to_nr()" calls)

AFAICS it does:

include/linux/reiserfs.h:
#define sd_v1_rdev(sdp) (le32_to_cpu((sdp)->u.sd_rdev))
#define set_sd_v1_rdev(sdp,v) ((sdp)->u.sd_rdev = cpu_to_le32(v))

[jgarzik@rum reiserfs]$ grep v1_rdev *.c
inode.c: rdev = sd_v1_rdev(sd);
inode.c: set_sd_v1_rdev(sd_v1, inode->i_rdev );

In the first inode.c line shown here, it passes the value received
directly to init_special_inode.

Jeff


--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2002-01-04 13:21:51

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: 2.5.2-pre7 still missing bits of kdev_t

Jeff Garzik wrote [on reiserfs]:

> granted you can stick a kdev_to_nr in there but it's still an FS policy
> decision at that point, IMHO...

Yes, for today we stick a kdev_t_to_nr in there and preserve
old behaviour, that is, nothing changes and no policy decisions
have been made. It should have been there from the start.

For next week, when larger-than-16-bit device numbers are
introduced, the proper code everywhere (on all interfaces
with the outside world: stat, mknod with user space and
special device nodes on disk and network filesystems)
would unpack the kdev_t into major and minor, and pack
again to the dev_t required by this interface (and vice versa).
That is, in principle, there is no global, unique, kdev_t_to_nr.

This is done already in most places, but reiserfs is one
of the exceptions, and they'll need a policy decision
on how to pack. In fact ext2 needs precisely the same
policy decision.

The details are rather unimportant - device numbers are
nonportable, so if we transport an ext2 disk to some
other OS and it sees different major,minor pairs, there
is no big catastrophe. Still, I have heard many a complaint
from sysadmins who needed to do _mknod foo x ma mi_
on some NFS mounted filesystem and had to make some
computations to decide on the right ma' mi' to use.
Installation scripts fail over NFS.

That is, even though a device number must be regarded
as a cookie, the fact that the mknod command separates
that cookie into two parts means that the way the
on-disk dev_t is separated belongs to the definition
of the on-disk filesystem format.
Now that 8+24, 12+20, 14+18, 32+32 all occur, the easy
way to solve all problems for a filesystem is to use 32+32.
That is what NFSv3 does, and isofs, etc.
If it is possible, the right policy no doubt is to store 32+32.
If there is no room for that then one just has to live with
the fact that the filesystem image is somewhat less portable.

Andries

2002-01-04 16:47:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.5.2-pre7 still missing bits of kdev_t


On Fri, 4 Jan 2002, Jeff Garzik wrote:
> >
> > Now, if somebody actually has the raw "kdev_t" in their on-disk
> > structures, that's a real problem, but I don't think anybody does.
> > Certainly I didn't see reiserfs do it (but it may well be missing a few
> > "kdev_t_to_nr()" calls)
>
> AFAICS it does:
>
> include/linux/reiserfs.h:
> #define sd_v1_rdev(sdp) (le32_to_cpu((sdp)->u.sd_rdev))
> #define set_sd_v1_rdev(sdp,v) ((sdp)->u.sd_rdev = cpu_to_le32(v))

Ok, just add the proper conversion functions, ie a "to_kdev_t()" and
"kdev_t_to_nr()".

Linus

2002-01-04 17:45:55

by Nikita Danilov

[permalink] [raw]
Subject: Re: 2.5.2-pre7 still missing bits of kdev_t

Linus Torvalds writes:
>
> On Fri, 4 Jan 2002, Jeff Garzik wrote:
> > >
> > > Now, if somebody actually has the raw "kdev_t" in their on-disk
> > > structures, that's a real problem, but I don't think anybody does.
> > > Certainly I didn't see reiserfs do it (but it may well be missing a few
> > > "kdev_t_to_nr()" calls)
> >
> > AFAICS it does:
> >
> > include/linux/reiserfs.h:
> > #define sd_v1_rdev(sdp) (le32_to_cpu((sdp)->u.sd_rdev))
> > #define set_sd_v1_rdev(sdp,v) ((sdp)->u.sd_rdev = cpu_to_le32(v))
>
> Ok, just add the proper conversion functions, ie a "to_kdev_t()" and
> "kdev_t_to_nr()".

Actually, result of sd_v1_rdev() is only passed to init_special_inode()
which takes int rather than kdev_t.

Isn't this a bit strange, because file system backend has to convert
kdev_t to u32 on write, but not on read?

>
> Linus

Nikita.

>

2002-01-04 17:51:45

by Jeff Garzik

[permalink] [raw]
Subject: Re: 2.5.2-pre7 still missing bits of kdev_t

Nikita Danilov wrote:
>
> Linus Torvalds writes:
> >
> > On Fri, 4 Jan 2002, Jeff Garzik wrote:
> > > >
> > > > Now, if somebody actually has the raw "kdev_t" in their on-disk
> > > > structures, that's a real problem, but I don't think anybody does.
> > > > Certainly I didn't see reiserfs do it (but it may well be missing a few
> > > > "kdev_t_to_nr()" calls)
> > >
> > > AFAICS it does:
> > >
> > > include/linux/reiserfs.h:
> > > #define sd_v1_rdev(sdp) (le32_to_cpu((sdp)->u.sd_rdev))
> > > #define set_sd_v1_rdev(sdp,v) ((sdp)->u.sd_rdev = cpu_to_le32(v))
> >
> > Ok, just add the proper conversion functions, ie a "to_kdev_t()" and
> > "kdev_t_to_nr()".
>
> Actually, result of sd_v1_rdev() is only passed to init_special_inode()
> which takes int rather than kdev_t.
>
> Isn't this a bit strange, because file system backend has to convert
> kdev_t to u32 on write, but not on read?

As mentioned to viro on IRC, I think init_special_inode should take
major and minor arguments, to nudge the filesystem implementors into
thinking that major and minor should be treated separately, and be given
additional thought as to how they are encoded on-disk.

(I suggested having init_special_inode taking a kdev_t argument as its
third arg, but viro yelled at me :))

Jeff


--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2002-01-04 17:54:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.5.2-pre7 still missing bits of kdev_t


On Fri, 4 Jan 2002, Jeff Garzik wrote:
>
> As mentioned to viro on IRC, I think init_special_inode should take
> major and minor arguments, to nudge the filesystem implementors into
> thinking that major and minor should be treated separately, and be given
> additional thought as to how they are encoded on-disk.

Yes. If somebody sends me a patch, I'll apply it in a jiffy.

Linus

2002-01-04 18:11:36

by Alexander Viro

[permalink] [raw]
Subject: Re: 2.5.2-pre7 still missing bits of kdev_t



On Fri, 4 Jan 2002, Linus Torvalds wrote:

>
> On Fri, 4 Jan 2002, Jeff Garzik wrote:
> >
> > As mentioned to viro on IRC, I think init_special_inode should take
> > major and minor arguments, to nudge the filesystem implementors into
> > thinking that major and minor should be treated separately, and be given
> > additional thought as to how they are encoded on-disk.
>
> Yes. If somebody sends me a patch, I'll apply it in a jiffy.

Guys, wait a minute with that. There is a related issue (->i_rdev
becoming dev_t) and I'd rather see it handled first.

2002-01-04 19:25:31

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: 2.5.2-pre7 still missing bits of kdev_t

From [email protected] Fri Jan 4 19:11:10 2002

On Fri, 4 Jan 2002, Linus Torvalds wrote:

> On Fri, 4 Jan 2002, Jeff Garzik wrote:
> >
> > As mentioned to viro on IRC, I think init_special_inode should take
> > major and minor arguments, to nudge the filesystem implementors into
> > thinking that major and minor should be treated separately, and be
> > given additional thought as to how they are encoded on-disk.
>
> Yes. If somebody sends me a patch, I'll apply it in a jiffy.

Guys, wait a minute with that. There is a related issue (->i_rdev
becoming dev_t) and I'd rather see it handled first.

Those are independent issues.

If init_special_inode() has major,minor arguments instead of
the present rdev, then the line

inode->i_rdev = to_kdev_t(rdev);

just becomes

inode->i_rdev = mk_kdev(major,minor);

I consider every occurrence of mk_kdev() and of to_kdev_t()
a flaw in the kernel, so this change does not make things
better or worse inside init_special_inode().

Andries

2002-01-04 19:33:21

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: 2.5.2-pre7 still missing bits of kdev_t

> (I suggested having init_special_inode taking a kdev_t argument as its
> third arg, but viro yelled at me :))

Yes. If you think that a kdev_t is a pointer to a struct with
device information, then having a kdev_t there is wrong,
because a special device node can have arbitrary major,minor
not necessarily belonging to any device, so rdev should just
have the numbers.

Andries

2002-01-04 21:10:44

by Alexander Viro

[permalink] [raw]
Subject: Re: 2.5.2-pre7 still missing bits of kdev_t



On Fri, 4 Jan 2002 [email protected] wrote:

> If init_special_inode() has major,minor arguments instead of
> the present rdev, then the line
>
> inode->i_rdev = to_kdev_t(rdev);
>
> just becomes
>
> inode->i_rdev = mk_kdev(major,minor);
>
> I consider every occurrence of mk_kdev() and of to_kdev_t()
> a flaw in the kernel, so this change does not make things
> better or worse inside init_special_inode().

Well, to start with, any use of kdev_t for block devices is a flaw
(fortunately most of fs/*/*.c ones are gone by now and buffer cache
will follow).

As for the init_special_inode() I'd rather have le16_to_dev() et.al.
so that ext2_read_inode() and friends would do
init_special_inode(inode, mode, le16_to_dev(raw->i_rdev));

Reasons:
a) foo_mknod() - why the hell would we take dev_t, pass it into
->mknod() only to split it into major:minor there and immediately
rebuild dev_t from them? And that - for _all_ instances of ->mknod()
b) what happens in ->read_inode() is "here's on-disk encoding,
we want to know which dev_t value it is". Sure, helpers that do
mapping will need to know encoding. ->read_inode() itself, though...
Why do we care about major:minor split at that level?

2002-01-04 21:15:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.5.2-pre7 still missing bits of kdev_t


On Fri, 4 Jan 2002, Alexander Viro wrote:
>
> Reasons:
> a) foo_mknod() - why the hell would we take dev_t, pass it into
> ->mknod() only to split it into major:minor there and immediately
> rebuild dev_t from them? And that - for _all_ instances of ->mknod()

No, if you make init_special_inode() take minor/major, then you have to
make mknod do the same.

At that point it becomes a ABI issue how the mknod _system_call_ argument
is split up into major/minor, and the rest of the kernel wouldn't really
care.

However, looking more at this, we clearly need to have the same ABI for
stat() as for mknod(), so it probably makes sense to just continue to have
"dev_t is a standard mix of minor/major" numbers approach, and just
continue passing "dev_t" around exactly as such a combination (and then
let people do the proper MKNOD/MAJOR/MINOR macros on it).

So you may be right - there are no real advantage from splitting the two,
as long as dev_t and kdev_t cannot be mixed up by mistake (which is
largely true already now).

Linus

2002-01-05 00:01:40

by Alan

[permalink] [raw]
Subject: Re: 2.5.2-pre7 still missing bits of kdev_t

> At that point it becomes a ABI issue how the mknod _system_call_ argument
> is split up into major/minor, and the rest of the kernel wouldn't really
> care.

You need a mknod2() system call assuming we want hotplug to be able to
create 32bit dev_t's. At the point glibc calls mknod its got an internal
32:32 representation so passing mknod2(char *,int,int) and making sure
mknod doesnt break if we expand further some years hence doesn't seem to
be daft.