2002-07-24 22:40:26

by Andries E. Brouwer

[permalink] [raw]
Subject: 2.5.28 and partitions

Just saw some new partition code in 2.5.28. Good!
I like almost all I see, except for one thing:

When I did precisely these same things, long ago, I used

struct blkpg_partition {
long long start; /* starting offset in bytes */
long long length; /* length in bytes */
int pno; /* partition number */
char devname[BLKPG_DEVNAMELTH]; /* partition name, like sda5 or c0d1p2,
to be used in kernel messages */
char volname[BLKPG_VOLNAMELTH]; /* volume label */
};

still visible in blkpg.h.

Now I read in 2.5.28:

+struct parsed_partitions {
+ char name[40];
+ struct {
+ unsigned long from;
+ unsigned long size;
+ int flags;
+ } parts[MAX_PART];
+ int next;
+ int limit;
+};

and I object to the long instead of u64 or so.

With 2^32 sectors one can handle up to 2^41 bytes, 2 TiB.
Already today people want RAIDs that are larger, and
few years from now we'll have single disks that are larger.

The fields from and size really need more bits than 32.
And when they become u64, it is a good idea to measure bytes
instead of 512-byte sectors.

(In the design where all partition reading code is removed
from the kernel, and user space tells the kernel what the
partitions on its disks are, it is also natural that user
space is able to provide names for the partitions.
Both names for the kernel to use in its messages, and names
to be used in mount-by-label. Of course I would like to
remove all mount-by-label code from mount(8).)

Andries



2002-07-24 23:39:31

by Alexander Viro

[permalink] [raw]
Subject: Re: 2.5.28 and partitions



On Thu, 25 Jul 2002 [email protected] wrote:

> Just saw some new partition code in 2.5.28. Good!
> I like almost all I see, except for one thing:
>
> +struct parsed_partitions {
> + char name[40];
> + struct {
> + unsigned long from;
> + unsigned long size;
> + int flags;
> + } parts[MAX_PART];
> + int next;
> + int limit;
> +};
>
> and I object to the long instead of u64 or so.

Separate set of patches. As it is, struct hd_struct is still there and
still not modified. And it has unsigned long. It will become sector_t.

Actually, I'm not all that sure that we want u64 here. The thing being,
start_sect shouldn't be bigger than sector_t (see how it's used). And
64bit arithmetics on 32bit boxen sucks big way. I'm not too concerned
about adding start_sect per se - it's done once per request and it's
noise compared to the rest of work. However, long long for sector_t
will hit in a lot of more interesting code paths.

That stuff becomes an issue for 2Tb disks. Do we actually have something
that large attached to 32bit boxen?

> With 2^32 sectors one can handle up to 2^41 bytes, 2 TiB.
> Already today people want RAIDs that are larger, and
> few years from now we'll have single disks that are larger.

... and still use i386 with these disks? ia64 is stillborn, but x86-64
promises to be more useful than Itanic.

u64 for sector_t doesn't change anything for 64bit boxen that might be
interested in really large disks and screws 32bit ones that shouldn't
have to pay for that...

2002-07-25 00:09:07

by Kwijibo

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

Alexander Viro wrote:

>On Thu, 25 Jul 2002 [email protected] wrote:
>
>
>Separate set of patches. As it is, struct hd_struct is still there and
>still not modified. And it has unsigned long. It will become sector_t.
>
>Actually, I'm not all that sure that we want u64 here. The thing being,
>start_sect shouldn't be bigger than sector_t (see how it's used). And
>64bit arithmetics on 32bit boxen sucks big way. I'm not too concerned
>about adding start_sect per se - it's done once per request and it's
>noise compared to the rest of work. However, long long for sector_t
>will hit in a lot of more interesting code paths.
>
>That stuff becomes an issue for 2Tb disks. Do we actually have something
>that large attached to 32bit boxen?
>
I do. Two 3ware 7850's with 8 160GB hd's on each. Wanted
to software strip but I hit the 2TB limit and ended up settling
with software mirror. This is on a dual Athlon box.

>
>... and still use i386 with these disks? ia64 is stillborn, but x86-64
>promises to be more useful than Itanic.
>
Will be nice when it arrives.

Steve



2002-07-25 02:07:38

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

At 00:42 25/07/02, Alexander Viro wrote:
>On Thu, 25 Jul 2002 [email protected] wrote:
> > Just saw some new partition code in 2.5.28. Good!
> > I like almost all I see, except for one thing:
> >
> > +struct parsed_partitions {
> > + char name[40];
> > + struct {
> > + unsigned long from;
> > + unsigned long size;
> > + int flags;
> > + } parts[MAX_PART];
> > + int next;
> > + int limit;
> > +};
> >
> > and I object to the long instead of u64 or so.
>
>Separate set of patches. As it is, struct hd_struct is still there and
>still not modified. And it has unsigned long. It will become sector_t.
>
>Actually, I'm not all that sure that we want u64 here. The thing being,
>start_sect shouldn't be bigger than sector_t (see how it's used). And
>64bit arithmetics on 32bit boxen sucks big way. I'm not too concerned
>about adding start_sect per se - it's done once per request and it's
>noise compared to the rest of work. However, long long for sector_t
>will hit in a lot of more interesting code paths.
>
>That stuff becomes an issue for 2Tb disks. Do we actually have something
>that large attached to 32bit boxen?

Not right now perhaps, but we may well do in a year or two. E.g. in the
department, we just bought a Dual Athlon 2000+, 3G RAM, and attached it to
a new 1.4TiB RAID array. We only need HDs to double in size and that array
could easily become 2.8TiB... And the whole fun costs less than US$15,000
at present so it is quite affordable for smaller institutions/companies.

OTOH, we are going to be using 32-bit systems for quite a few years to come...

> > With 2^32 sectors one can handle up to 2^41 bytes, 2 TiB.
> > Already today people want RAIDs that are larger, and
> > few years from now we'll have single disks that are larger.
>
>... and still use i386 with these disks?

Yes, definitely. Why pay for some stupidly expensive 64-bit computer when
you only want large storage?

>ia64 is stillborn, but x86-64 promises to be more useful than Itanic.

We shall see once it comes on the market... And we will then see the price
tag it will bring with it, too...

>u64 for sector_t doesn't change anything for 64bit boxen that might be
>interested in really large disks and screws 32bit ones that shouldn't
>have to pay for that...

True. That's why sector_t should be a compile time option in the kernel
"Enable large device support > 2TiB: Y/N". Then I am happy and loads of
other people because we can use large raid arrays without having to buy the
latest expensive system and other people are happy for having faster 32-bit
code... Surely we can write robust enough code which will work with either
sector_t size...

Best regards,

Anton


--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-07-25 03:19:37

by Matt Domsch

[permalink] [raw]
Subject: RE: 2.5.28 and partitions

> That stuff becomes an issue for 2Tb disks. Do we actually have something
> that large attached to 32bit boxen?

Absolutely. A single external disk pod with 14 73GB SCSI disks is >1TB,
with 145GB disks expected in the very near future, and 120GB IDE disks
available today. You can put 4 disk pods on a single 4-channel RAID
controller. You can have multiple 4-channel RAID controllers and do
software RAID across the lot. You can attach your server to a multi-TB
Dell|EMC SAN. All of these configs are limited by the current 32-bit block
address.

> ... and still use i386 with these disks?

Yep. We're doing all of this today on our x86 server products, and don't
expect x86 to die any time soon. I'm on conference calls each week with
customers who have huge data storage requirements, who like the
price/performance of x86 servers and the ever-decreasing cost of storage.
Medical imaging. Render farms. CAD/CAM. Search engines. Mirror sites.
Scientific compute clusters (they want a real CFS too). Spam quarantine :-)
I'm excited by Peter Chubb's LBD patch for 2.5.x, but a product with a 2.6.x
kernel is still a long way away, and customers with money are asking for
this today. "Be patient" isn't something a salesperson likes to hear when
there's a commission on the line. :-)

Right now all of these solutions are being done with multiple ~1TB
partitions and file systems, which for most applications works. But some of
the above believe they would benefit from, say, a single 10TB shared
clustered file system (with another 10TB of disks to back the thing up).
That isn't possible today, even though one could build such.

> >u64 for sector_t doesn't change anything for 64bit boxen
> >that might be interested in really large disks and
> >screws 32bit ones that shouldn't have to pay for that...
>
> True. That's why sector_t should be a compile time option in
> the kernel

I'd be happy with an option too. Then the distros can choose to enable it
for some kernels "i686 bigmem-bigdisk", but not for i686 UP. There does
arise the proliferation of kernels problem, but I'm sure the distros will
have some ideas there.

The promise of 64-bit block addresses eventually was a huge part of why I
worked on the GPT code in the kernel, partx, parted, etc. I could really
use it today, and it'll be a solid requirement less than a year from now.


Thanks,
Matt
--
Matt Domsch
Sr. Software Engineer, Lead Engineer, Architect
Dell Linux Solutions http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com
#1 US Linux Server provider for 2001 and Q1/2002! (IDC May 2002)

2002-07-25 03:57:08

by Jason L Tibbitts III

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

>>>>> "AV" == Alexander Viro <[email protected]> writes:

AV> That stuff becomes an issue for 2Tb disks. Do we actually have
AV> something that large attached to 32bit boxen?

Yes, I just built a few machines with 2.2TB of disk apiece. And this
isn't really all that esoteric; each box only cost around $7K.

AV> ... and still use i386 with these disks? ia64 is stillborn, but
AV> x86-64 promises to be more useful than Itanic.

Well, these are just file servers. It would be a waste to stick a
64bit processor in there just for its 64bitness; if I could get
Hammers, I'd rather run compute jobs on them and leave the menial
tasks like file serving to the 32bit machines.

AV> u64 for sector_t doesn't change anything for 64bit boxen that
AV> might be interested in really large disks and screws 32bit ones
AV> that shouldn't have to pay for that...

Well, I'd happily run a custom kernel on these machines. I certainly
don't want my other hundred-plus machines to run slower just to let a
handful of file servers see all of their disk, but it would be nice to
have the choice.

- J<

2002-07-25 05:11:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.5.28 and partitions



On Thu, 25 Jul 2002, Anton Altaparmakov wrote:
>
> >u64 for sector_t doesn't change anything for 64bit boxen that might be
> >interested in really large disks and screws 32bit ones that shouldn't
> >have to pay for that...
>
> True. That's why sector_t should be a compile time option in the kernel
> "Enable large device support > 2TiB: Y/N". Then I am happy and loads of
> other people because we can use large raid arrays without having to buy the
> latest expensive system and other people are happy for having faster 32-bit
> code... Surely we can write robust enough code which will work with either
> sector_t size...

Careful. One issue is user-level interfaces to the kernel. I would suggest
any user level interface should use u64, not "sector_t". So that there is
zero confusion. Clearly 64-bit sector numbers will be/are really close to
being an issue for some people.

Linus

2002-07-25 05:23:10

by Linus Torvalds

[permalink] [raw]
Subject: RE: 2.5.28 and partitions



On Wed, 24 Jul 2002 [email protected] wrote:
>
> The promise of 64-bit block addresses eventually was a huge part of why I
> worked on the GPT code in the kernel, partx, parted, etc. I could really
> use it today, and it'll be a solid requirement less than a year from now.

Note that there is one place where 64 bits is simply _too_ expensive, and
that's the page cache. In particular, the "index" in "struct page". We
want to make "struct page" _smaller_, not larger.

Right now that means that 16TB really is a hard limit for at least some
device access on a 32-bit machine with a 4kB page-size (yes, you could
make a filesystem that is bigger, but you very fundamentally cannot make
individual files larger than 16TB).

The block device layer also cannot write to the 16TB+ region using the
page cache (but it should be possible to do it using raw device access
with a 64-bit sector_t, so you can initialize the filesystem).

Linus

2002-07-25 08:39:48

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

At 06:15 25/07/02, Linus Torvalds wrote:
>On Thu, 25 Jul 2002, Anton Altaparmakov wrote:
> >
> > >u64 for sector_t doesn't change anything for 64bit boxen that might be
> > >interested in really large disks and screws 32bit ones that shouldn't
> > >have to pay for that...
> >
> > True. That's why sector_t should be a compile time option in the kernel
> > "Enable large device support > 2TiB: Y/N". Then I am happy and loads of
> > other people because we can use large raid arrays without having to buy the
> > latest expensive system and other people are happy for having faster 32-bit
> > code... Surely we can write robust enough code which will work with either
> > sector_t size...
>
>Careful. One issue is user-level interfaces to the kernel. I would suggest
>any user level interface should use u64, not "sector_t". So that there is
>zero confusion. Clearly 64-bit sector numbers will be/are really close to
>being an issue for some people.

Of course. We do need a consistent ABI... But I don't see that as a big
problem. There aren't that many places that take sectors as arguments that
we need to fix AFAICS.

Both there and for user supplied byte offsets/sizes, we just need to check
that user supplied values are not being overflowed on 32-bit sector_t
compiled kernels... something like

if (sizeof(sector_t) == 4) {
if (value & ~(((u64)1 << 32) - 1))
return -E2BIG;
}

should compile out nicely for 64-bit sector_t and provide a simple, highly
optimized check for 32-bit sector_t... (If gcc optimizes it well I should
hope it will just do a simple 32-bit compare of the high 32-bits with zero...)

I have to admit that if it was just up to me, I would make sector_t
unconditionally u64, so there don't need to be checks like the above all
over the place... But that's just me... (-;

Best regards,

Anton


--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-07-25 09:25:45

by Alan

[permalink] [raw]
Subject: RE: 2.5.28 and partitions

On Thu, 2002-07-25 at 04:22, [email protected] wrote:
> Absolutely. A single external disk pod with 14 73GB SCSI disks is >1TB,
> with 145GB disks expected in the very near future, and 120GB IDE disks
> available today. You can put 4 disk pods on a single 4-channel RAID

With 3ware cards I know multiple people using 2 8 port 3ware cards each
with 8 160Gb IDE disks on it. These are now extremely cheap systems to
build, especially if you buy the 3ware cards carefully and don't believe
the list prices.

> > ... and still use i386 with these disks?
>
> Yep. We're doing all of this today on our x86 server products, and don't
> expect x86 to die any time soon. I'm on conference calls each week with

I can point to multiple people doing this. Everything from video data
vaults to archives of scanned document images. Right now they have to
split the arrays up, but as the disks get bigger and cheaper that is
going to become a pain

2002-07-25 11:41:28

by Alexander Viro

[permalink] [raw]
Subject: RE: 2.5.28 and partitions



On Wed, 24 Jul 2002, Linus Torvalds wrote:

> Note that there is one place where 64 bits is simply _too_ expensive, and
> that's the page cache. In particular, the "index" in "struct page". We
> want to make "struct page" _smaller_, not larger.
>
> Right now that means that 16TB really is a hard limit for at least some
> device access on a 32-bit machine with a 4kB page-size (yes, you could
> make a filesystem that is bigger, but you very fundamentally cannot make
> individual files larger than 16TB).

ITYM "8Tb" - indices are signed, IIRC. OTOH, it's not 2^31 * PAGE_SIZE -
it's 2^31 * PAGE_CACHE_SIZE, which can be bigger.

Al, still thinking that anybody who does mkfs.<whatever> on a multi-Tb
device should seek professional help of the kind they don't give on l-k...

2002-07-25 12:42:03

by Petr Vandrovec

[permalink] [raw]
Subject: RE: 2.5.28 and partitions

On 25 Jul 02 at 7:44, Alexander Viro wrote:
> On Wed, 24 Jul 2002, Linus Torvalds wrote:
>
> > Note that there is one place where 64 bits is simply _too_ expensive, and
> > that's the page cache. In particular, the "index" in "struct page". We
> > want to make "struct page" _smaller_, not larger.
> >
> > Right now that means that 16TB really is a hard limit for at least some
> > device access on a 32-bit machine with a 4kB page-size (yes, you could
> > make a filesystem that is bigger, but you very fundamentally cannot make
> > individual files larger than 16TB).
>
> ITYM "8Tb" - indices are signed, IIRC. OTOH, it's not 2^31 * PAGE_SIZE -
> it's 2^31 * PAGE_CACHE_SIZE, which can be bigger.
>
> Al, still thinking that anybody who does mkfs.<whatever> on a multi-Tb
> device should seek professional help of the kind they don't give on l-k...

Don't worry. Netware (NW6) uses also 32bit for indices to page cache,
and 4KB page cache size, but in addition to our implementation they
(1) do not verify that file you created is smaller than 16TB, and
(2) they have signedness bug somewhere too. So if you'll create file
larger than 8TB, data you wrote in are silently discarded, while
file size is preserved.

I was really surprised when I updated ncpfs to access files > 4GB.
Written data were disappearing after server reboot :-(

Just my two cents.
Petr Vandrovec
[email protected]

2002-07-25 12:57:00

by Anton Altaparmakov

[permalink] [raw]
Subject: RE: 2.5.28 and partitions

At 12:44 25/07/02, Alexander Viro wrote:
>Al, still thinking that anybody who does mkfs.<whatever> on a multi-Tb
>device should seek professional help of the kind they don't give on l-k...

Why? What is wrong with large devices/file systems? Why do we have to break
up everything into multiple devices? Just because the kernel is "too lazy"
to implement support for large devices? Nobody cares if 64bit code is
10-20% slower than 32bit code on a storage server. The storage devices are
physically way slower than the system, so the data throughput would not be
affected in the slightest. We would just see a higher CPU load on the
database server and we can live with that. At least our applications deal
with GiBs of data for each experiment, which is shifted over Gigabit
ethernet to/from a SQL database backend stored on a huge RAID array, so we
are completely i/o bound.

It's one database, and it's huge. And it's going to get bigger as people do
more experiments. We need mkfs.<whatever> on a huge device... We are just
lucky that our current RAID array is under 2TiB se we haven't hit the
"magic" barrier quite yet. But at 1.4TiB we are not far off...

Best regards,

Anton


--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-07-25 13:21:44

by Petr Vandrovec

[permalink] [raw]
Subject: RE: 2.5.28 and partitions

On 25 Jul 02 at 14:03, Anton Altaparmakov wrote:
> At 12:44 25/07/02, Alexander Viro wrote:
> >Al, still thinking that anybody who does mkfs.<whatever> on a multi-Tb
> >device should seek professional help of the kind they don't give on l-k...
>
> Why? What is wrong with large devices/file systems? Why do we have to break
> up everything into multiple devices? Just because the kernel is "too lazy"
> to implement support for large devices? Nobody cares if 64bit code is
> 10-20% slower than 32bit code on a storage server. The storage devices are

But I care whether gcc barfs on code or not, and whether generated code
is correct or not.

I do very trivial 64bit computations in TV-Out portion of matroxfb,
but I spent two days shifting code up/down, adding temporary variables
and splitting expressions to simple ones to make code compilable at all
with gcc-2.95.4 compiling module for PIII kernel (Debian bug #151196).
So I personally cannot recommend doing any 64bit math without setting
gcc-3.0 as minimal version for ia32 architecture.
Petr Vandrovec
[email protected]

2002-07-25 13:41:53

by Anton Altaparmakov

[permalink] [raw]
Subject: RE: 2.5.28 and partitions

At 14:24 25/07/02, Petr Vandrovec wrote:
>On 25 Jul 02 at 14:03, Anton Altaparmakov wrote:
> > At 12:44 25/07/02, Alexander Viro wrote:
> > >Al, still thinking that anybody who does mkfs.<whatever> on a multi-Tb
> > >device should seek professional help of the kind they don't give on l-k...
> >
> > Why? What is wrong with large devices/file systems? Why do we have to
> break
> > up everything into multiple devices? Just because the kernel is "too lazy"
> > to implement support for large devices? Nobody cares if 64bit code is
> > 10-20% slower than 32bit code on a storage server. The storage devices are
>
>But I care whether gcc barfs on code or not, and whether generated code
>is correct or not.

Everyone cares about that! That has nothing to do with performance. It's
simply a broken compiler which needs fixing.

>I do very trivial 64bit computations in TV-Out portion of matroxfb,
>but I spent two days shifting code up/down, adding temporary variables
>and splitting expressions to simple ones to make code compilable at all
>with gcc-2.95.4 compiling module for PIII kernel (Debian bug #151196).
>So I personally cannot recommend doing any 64bit math without setting
>gcc-3.0 as minimal version for ia32 architecture.

Thanks for the warning. I will keep an eye out for eventual "NTFS is broken
with gcc-2.95 reports"... Although I would make that gcc-2.96 and not 3.0
as minimum requirement. At least I haven't found anything wrong with the
current gcc-2.96...

(Please let's not start another flamewar about whether gcc-2.96 exists or not.)

Best regards,

Anton


--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-07-25 15:52:54

by Linus Torvalds

[permalink] [raw]
Subject: RE: 2.5.28 and partitions



On Thu, 25 Jul 2002, Alexander Viro wrote:>
> On Wed, 24 Jul 2002, Linus Torvalds wrote:
> > Right now that means that 16TB really is a hard limit for at least some
> > device access on a 32-bit machine with a 4kB page-size (yes, you could
> > make a filesystem that is bigger, but you very fundamentally cannot make
> > individual files larger than 16TB).
>
> ITYM "8Tb" - indices are signed, IIRC. OTOH, it's not 2^31 * PAGE_SIZE -
> it's 2^31 * PAGE_CACHE_SIZE, which can be bigger.

Hmm. The index really should be unsigned, but obviously there could be
sign errors.

The stupid BSD approach of putting metadata in negative offsets is just
that - stupid. Under Linux the people who do that just have another
address space for metadata.

Your PAGE_CACHE_SIZE vs PAGE_SIZE thing is true, but separating the two
out is going to be less than pleasant in practice, methinks.

Linus

2002-07-25 16:47:38

by Alexander Viro

[permalink] [raw]
Subject: RE: 2.5.28 and partitions



On Thu, 25 Jul 2002, Anton Altaparmakov wrote:

> At 12:44 25/07/02, Alexander Viro wrote:
> >Al, still thinking that anybody who does mkfs.<whatever> on a multi-Tb
> >device should seek professional help of the kind they don't give on l-k...
>
> Why? What is wrong with large devices/file systems? Why do we have to break
> up everything into multiple devices? Just because the kernel is "too lazy"
> to implement support for large devices? Nobody cares if 64bit code is

Large filesystem => troubles with backups, even more troubles with restoring
after disk failure, yadda, yadda.

> database server and we can live with that. At least our applications deal
> with GiBs of data for each experiment, which is shifted over Gigabit
> ethernet to/from a SQL database backend stored on a huge RAID array, so we
> are completely i/o bound.
>
> It's one database, and it's huge. And it's going to get bigger as people do
> more experiments. We need mkfs.<whatever> on a huge device... We are just
> lucky that our current RAID array is under 2TiB se we haven't hit the
> "magic" barrier quite yet. But at 1.4TiB we are not far off...

... and backups of your database are done on...?

"RAID" doesn't mean that data is safe. It means that some class of
failures will not be immediately catastrophic, but that's it - both
hardware and software arrays _do_ go tits up. Just ask hpa for story
of the troubles on kernel.org.

2002-07-25 17:33:55

by Jason L Tibbitts III

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

>>>>> "AV" == Alexander Viro <[email protected]> writes:

AV> ... and backups of your database are done on...?

An identically configured machine in another building. A 20-pack of
160GB disks is under $5K; you can even swap out a complete set of
disks and take them offsite since they're all in carriers.

The incredibly low costs of these things have forced a change in how
many of us think about data storage. Building a 2+TB filesystem is no
longer a sign of insanity or too much money to spend (or both), it's a
weekend project.

- J<

2002-07-25 17:47:34

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

>> and I object to the long instead of u64 or so.

> Separate set of patches.

Good.
Although it is better to design the right data structures first.

> As it is, struct hd_struct is still there and still not modified.
> And it has unsigned long. It will become sector_t.

You need two things:

(i) A faithful representation of what the partition parser says.
Partition table parsers, in the kernel or in user space, find out
how this disk is partitioned and the information found is stored
in some "parsed partition table" struct. Here offset and length
must be u64 and use byte as a unit.

(ii) A representation of offset and length suitable to use for
block I/O. During block I/O a sector number is tested against
the max to test for errors, and the partition offset is added.
These two must of course use the units the sector number is in.
So a sector_t is reasonable here.


> Actually, I'm not all that sure that we want u64 here. The thing being,
> start_sect shouldn't be bigger than sector_t (see how it's used). And
> 64bit arithmetics on 32bit boxen sucks big way. I'm not too concerned
> about adding start_sect per se - it's done once per request and it's
> noise compared to the rest of work. However, long long for sector_t
> will hit in a lot of more interesting code paths.

It will be unavoidable soon. For many applications it is needed today.

> Al, still thinking that anybody who does mkfs.<whatever> on a multi-Tb
> device should seek professional help of the kind they don't give on l-k...

I don't see how this can be relevant. If the device is large and you
make it one big partition then the size of the partition will need more
than 32 bits. If you split it up into lots of tiny 2 TB partitions
then the offsets will need more than 32 bits.

I did my partition stuff seven years ago, and at that time discussion
was possible: is it really necessary to use 64 bits?
Today no discussion is possible. Yes, u64 is needed.

Andries


[As a separate discussion:
I used a sparse setup, that is why the struct describing a partition also
had the partition number. Your version with 256 structs looks a bit clumsy.
In most setups 256 is a waste. In some it is not enough.
Sparseness is useful for user space. But of course I had a 64-bit dev_t.]

2002-07-25 17:45:46

by Anton Altaparmakov

[permalink] [raw]
Subject: RE: 2.5.28 and partitions

At 17:50 25/07/02, Alexander Viro wrote:
>On Thu, 25 Jul 2002, Anton Altaparmakov wrote:
>
> > At 12:44 25/07/02, Alexander Viro wrote:
> > >Al, still thinking that anybody who does mkfs.<whatever> on a multi-Tb
> > >device should seek professional help of the kind they don't give on l-k...
> >
> > Why? What is wrong with large devices/file systems? Why do we have to
> break
> > up everything into multiple devices? Just because the kernel is "too lazy"
> > to implement support for large devices? Nobody cares if 64bit code is
>
>Large filesystem => troubles with backups, even more troubles with restoring
>after disk failure, yadda, yadda.
>
> > database server and we can live with that. At least our applications deal
> > with GiBs of data for each experiment, which is shifted over Gigabit
> > ethernet to/from a SQL database backend stored on a huge RAID array, so we
> > are completely i/o bound.
> >
> > It's one database, and it's huge. And it's going to get bigger as
> people do
> > more experiments. We need mkfs.<whatever> on a huge device... We are just
> > lucky that our current RAID array is under 2TiB se we haven't hit the
> > "magic" barrier quite yet. But at 1.4TiB we are not far off...
>
>... and backups of your database are done on...?

UTO Ultrium attached via SCSI to the file server. Or they will be once we
get Arcserve for Linux with support for the Ultrium which should hopefully
be in the near future...

>"RAID" doesn't mean that data is safe. It means that some class of
>failures will not be immediately catastrophic, but that's it - both
>hardware and software arrays _do_ go tits up. Just ask hpa for story
>of the troubles on kernel.org.

Indeed. Hence the Ultrium tape based backup...

Best regards,

Anton


--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-07-25 17:55:05

by Rik van Riel

[permalink] [raw]
Subject: RE: 2.5.28 and partitions

On Thu, 25 Jul 2002, Alexander Viro wrote:
> On Thu, 25 Jul 2002, Anton Altaparmakov wrote:
> > At 12:44 25/07/02, Alexander Viro wrote:

> > It's one database, and it's huge.
>
> ... and backups of your database are done on...?

LVM snapshot + rsync to an identical machine elsewhere ?

Rik
--
http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid. Go buy yourself a real t-shirt"

http://www.surriel.com/ http://distro.conectiva.com/

2002-07-25 18:24:36

by Alexander Viro

[permalink] [raw]
Subject: RE: 2.5.28 and partitions



On Thu, 25 Jul 2002, Rik van Riel wrote:

> On Thu, 25 Jul 2002, Alexander Viro wrote:
> > On Thu, 25 Jul 2002, Anton Altaparmakov wrote:
> > > At 12:44 25/07/02, Alexander Viro wrote:
>
> > > It's one database, and it's huge.
> >
> > ... and backups of your database are done on...?
>
> LVM snapshot + rsync to an identical machine elsewhere ?

Works fine until you find a nasty bug in (identical) firmware.

<cue story about RAID5 built out of a bunch of Seagates; a year later
6 disks out of 16 went to hell during a weekend - ones that had
serial numbers within a $SMALLNUM from each other>

And that's aside of the "wisdom" of using LVM...

2002-07-26 05:10:25

by Adrian Bunk

[permalink] [raw]
Subject: RE: 2.5.28 and partitions

On Thu, 25 Jul 2002, Anton Altaparmakov wrote:

> At 14:24 25/07/02, Petr Vandrovec wrote:
> >But I care whether gcc barfs on code or not, and whether generated code
> >is correct or not.
>
> Everyone cares about that! That has nothing to do with performance. It's
> simply a broken compiler which needs fixing.
>...

Unfortunately the 2.95 branch of gcc is more or less dead: Noone maintains
it and no new release is planned. It's perhaps a more useful work to get
the kernel compiling with gcc 3.1/3.2 ...

> Best regards,
>
> Anton

cu
Adrian

--

You only think this is a free country. Like the US the UK spends a lot of
time explaining its a free country because its a police state.
Alan Cox


2002-07-27 05:54:52

by Austin Gonyou

[permalink] [raw]
Subject: RE: 2.5.28 and partitions

On Thu, 2002-07-25 at 11:50, Alexander Viro wrote:
> On Thu, 25 Jul 2002, Anton Altaparmakov wrote:
>
> > At 12:44 25/07/02, Alexander Viro wrote:
> > >Al, still thinking that anybody who does mkfs.<whatever> on a multi-Tb
> > >device should seek professional help of the kind they don't give on l-k...
> >
> > Why? What is wrong with large devices/file systems? Why do we have to break
> > up everything into multiple devices? Just because the kernel is "too lazy"
> > to implement support for large devices? Nobody cares if 64bit code is
>
> Large filesystem => troubles with backups, even more troubles with restoring
> after disk failure, yadda, yadda.

Right, but that doesn't stop people anyway. Whether you have one or a
hundred file systems, you still back it up and restore it, and the
bottleneck is still, usually, the TBU bus(i.e. tbu speed + SCSI||FC
speed + network speed + disk speed == latency of some kinds for
restores)

> > database server and we can live with that. At least our applications deal
> > with GiBs of data for each experiment, which is shifted over Gigabit
> > ethernet to/from a SQL database backend stored on a huge RAID array, so we
> > are completely i/o bound.
> >
> > It's one database, and it's huge. And it's going to get bigger as people do
> > more experiments. We need mkfs.<whatever> on a huge device... We are just
> > lucky that our current RAID array is under 2TiB se we haven't hit the
> > "magic" barrier quite yet. But at 1.4TiB we are not far off...
>
> ... and backups of your database are done on...?

Tape usually, which in itself is a problem. But, more and more people
are implementing "third mirrors" of types, whether they are snapshot
types or full copies of the data, people are using FC or fast shared
SCSI subsystems, or Gigabit NFS to a NAS filer or to a centralized
system with one large file-systems with many directories for backups.

My shop has several TB+ DBs...and no we don't have TB sized filesystems,
yet, but it will happen, our largest single FS is 400GB+, and will only
get bigger in the future. I'm intimately aware though, that regardless
of having FS under 2TBs, we still backup to tape, but we do it with a
copy of the data by making a mirror of our already mirrored FS's.

We then mount those FS on a host capable of mounting them, and then
backup from that storage, to the TBU directly attached to it.

Bottleneck on all of this is in fact the TBU and associated bus.

> "RAID" doesn't mean that data is safe. It means that some class of
> failures will not be immediately catastrophic, but that's it - both
> hardware and software arrays _do_ go tits up. Just ask hpa for story
> of the troubles on kernel.org.

Sure RAID doesn't mean it's "totally safe", but it does mean it's
"safer" than it would have been otherwise. What this is often referred
to as crisis management. The idea is that you do *not* run in a degraded
mode for very long, and take care of the problematic hardware/software
ASAP, but to *not* lose your data.

I say this regardless of the fact that it *can and does* happen
occasionally, because there are plenty of corner cases where RAID and
data protection are concerned.

Support for *large filesystems* TB size + is imperative in the future.
Things will only get bigger and bigger as far as they are able, and with
people like Hitatchi and Sony coming out with optical disks the size of
a 1.44MB floppy holding Terabytes and up, this is just the beginning.

Mind you the scenario I just described is not nearline or offline
storage, it is fully re-writeable, faster than a typical hdd, optical
media and will be available to consumers, if vapor isn't implied,
withing 12 months. (consumers being corporations first, then general
consumption.)



> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Austin Gonyou <[email protected]>

2002-07-31 18:13:57

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

Hi!

> > Note that there is one place where 64 bits is simply _too_ expensive, and
> > that's the page cache. In particular, the "index" in "struct page". We
> > want to make "struct page" _smaller_, not larger.
> >
> > Right now that means that 16TB really is a hard limit for at least some
> > device access on a 32-bit machine with a 4kB page-size (yes, you could
> > make a filesystem that is bigger, but you very fundamentally cannot make
> > individual files larger than 16TB).
>
> ITYM "8Tb" - indices are signed, IIRC. OTOH, it's not 2^31 * PAGE_SIZE -
> it's 2^31 * PAGE_CACHE_SIZE, which can be bigger.
>
> Al, still thinking that anybody who does mkfs.<whatever> on a multi-Tb
> device should seek professional help of the kind they don't give on
> l-k...

Why?

Its Linux's job to make this work. If I happen to own 20 120GB disks,
whats wrong with just mkfs on them? If mkfs.ext3 on 2TB array is
reason for seeking profesional help, then there's something wrong with
Linux.
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

2002-07-31 22:35:43

by Alexander Viro

[permalink] [raw]
Subject: Re: 2.5.28 and partitions



On Thu, 1 Aug 2002, Peter Chubb wrote:

> Maybe we need to roll our own? I suggest something like:
> struct linux_volume_header {
> char volname[16];
> __u32 nparts;
> __u32 blocksize;
> struct linux_partition {
> char partname[16]
> __u64 start;
> __u64 len;
> __u32 usage;
> __u32 flags;
> } parts[]
> }

Oh, ferchrissake! WHY??? People, we'd seen a lot of demonstrations
of the reasons why binary structures are *bad* for such stuff.

What the bleedin' hell is wrong with <name> <start> <len>\n - all in ASCII?
Terminated by \0. No need for flags, no need for endianness crap, no
need to worry about field becoming too narrow...

What, parsing that would be too slow? Right. Sure. How many times do
we parse partition table? How many times do we end up reading it from
disk? How does IO time compare to the "overhead" of trivial sscanf loop?

Furrfu... "ASCII is tough, let's go shopping"...

2002-07-31 22:44:31

by Matt Domsch

[permalink] [raw]
Subject: RE: 2.5.28 and partitions

Hi Peter. Thanks for your work on LBD for 2.5.x. I'm really looking
forward to its inclusion.

> What we really need to be able to do, however, is partition these huge
> discs, if only so that each partition is less than a reasonable number
> of backup tapes/devices/whatever. And at present the only scheme that
> Linux understands for partitioning huge discs is the EFI GUID scheme.

:-)

> Maybe we need to roll our own?

What's wrong with EFI GUID scheme (GPT) (other than it wasn't invented by
Linux folks)?

the 2.5.x kernel understands it today
the 2.4.x kernel could very easily understand it (patch available on
http://domsch.com/linux/patches/gpt against 2.4.19-rc1), and ia64 has had it
for a couple years.
partx understands it today
parted understands it today
(efibootmgr and the EFI environment understand it today, but that's only
relevant to IA-64 at the moment)


> Maybe we need to roll our own? I suggest something like:
> struct linux_volume_header {
> char volname[16];
> __u32 nparts;
> __u32 blocksize;

The disk can tell you its blocksize. The FS will have its own idea anyhow.

> struct linux_partition {
> char partname[16]
> __u64 start;
> __u64 len;
> __u32 usage;
> __u32 flags;
> } parts[]
> }
>
> the whole to fit into a 4k block at the start of the volume, with a
> crc32 at the end.
>
> Usage to be a magic number that says this is a swap, spare,
> whole-disc, filesystem+type, whatever, partition.
>
> flags for whatever we want.

All of this is already done in GPT today, or could be if desired (spare,
etc). Tagging the FS type inside the partition table isn't pretty, and has
lead to the huge table of partition type numbers that Andries maintains,
when fs probing isn't hard.

> I can't see anyone booting from a huge array in the near-term future,
> because you need the BIOS to understand the array.

Sure, so we don't have to fix grub or lilo to understand GPT yet. :-)

Unless there's something that GPT doesn't do well, I'd prefer not to make
yet another partitioning scheme. If there is something else it needs, it
can be extended.

Thanks,
Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer, Architect
Dell Linux Solutions http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com
#1 US Linux Server provider for 2001 and Q1/2002! (IDC May 2002)

2002-07-31 22:54:30

by Anton Altaparmakov

[permalink] [raw]
Subject: RE: 2.5.28 and partitions

At 23:47 31/07/02, [email protected] wrote:
> > Maybe we need to roll our own?
>
>What's wrong with EFI GUID scheme (GPT) (other than it wasn't invented by
>Linux folks)?
[snip]
>Unless there's something that GPT doesn't do well, I'd prefer not to make
>yet another partitioning scheme. If there is something else it needs, it
>can be extended.

And if there is something GPT doesn't do then there is Veritas LDM (also
used in simplified form by Windows LDM) and the kernel understands it
today. Admittedly none of the Linux partitioning tools support it yet but
that is subject to change. (-; LDM is journalled, supports large numbers of
disks, huge disks, all sorts of RAID, etc... I don't think you will find
anything missing in that one...

So I fully agree that inventing yet another partitioning scheme is silly in
view of the multitude of existing ones which do the job just fine. Feel
free to prove me I am wrong by showing me something that GPT/LDM can't do...

Best regards,

Anton


--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-07-31 23:35:04

by Matt Domsch

[permalink] [raw]
Subject: RE: 2.5.28 and partitions

> Matt> What's wrong with EFI GUID scheme (GPT) (other than it wasn't
> Matt> invented by Linux folks)?
>
> Nothing, except it's not used on all platforms yet.

(set boot issues aside for now)
It could. I use it on x86 and IA-64 now. I think Richard Hirst found the
last (knock on wood) of my endianness bugs about 6 months ago, so I know it
works on BE and LE non-Intel machines. It's in the partitioning menu, not
specific to arch. The only arch dependency in code is on asm-ia64/efi.h for
some typedefs, which is annoying but not hard to fix if desired (move
relevant bits to include/linux/efi.h).

> For my machines the *only* reason for having a legacy partitioning
> scheme is to allow booting.

As you point out, booting is BIOS-specific. So for now boot a disk with a
native scheme (where your OS resides already) and mount that 64XB file
system for data afterwords. By the time that doesn't work, 32-bit CPUs will
be dead anyhow.

Thanks,
Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer, Architect
Dell Linux Solutions http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com
#1 US Linux Server provider for 2001 and Q1/2002! (IDC May 2002)

2002-07-31 23:38:46

by Alexander Viro

[permalink] [raw]
Subject: Re: 2.5.28 and partitions



On Thu, 1 Aug 2002, Peter Chubb wrote:

> >>>>> "Alexander" == Alexander Viro <[email protected]> writes:
>
> Alexander> On Thu, 1 Aug 2002, Peter Chubb wrote:
>
> Alexander> What the bleedin' hell is wrong with <name> <start> <len>\n
> Alexander> - all in ASCII? Terminated by \0. No need for flags, no
> Alexander> need for endianness crap, no need to worry about field
> Alexander> becoming too narrow...
>
> I guess as it won't be used it for booting that'd be fine... except I
> really *don't* like the idea of any kind of parser in the kernel

Please. It's ~6 lines of loop. And if somebody can't write a "parser"
of such kind correctly, I really don't like the idea of having his
code in the kernel - failing C101 doesn't inspire a lot of confidence.

And I don't see what's the problem on the boot side - finding first entry
with name that starts with (say it) '*', skipping to next space and
converting the following digits into a number... <shrug>

2002-08-01 10:10:30

by Marcin Dalecki

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

Alexander Viro wrote:
>
> On Thu, 1 Aug 2002, Peter Chubb wrote:
>
>
>>Maybe we need to roll our own? I suggest something like:
>> struct linux_volume_header {
>> char volname[16];
>> __u32 nparts;
>> __u32 blocksize;
>> struct linux_partition {
>> char partname[16]
>> __u64 start;
>> __u64 len;
>> __u32 usage;
>> __u32 flags;
>> } parts[]
>> }
>
>
> Oh, ferchrissake! WHY??? People, we'd seen a lot of demonstrations
> of the reasons why binary structures are *bad* for such stuff.
>
> What the bleedin' hell is wrong with <name> <start> <len>\n - all in ASCII?
> Terminated by \0. No need for flags, no need for endianness crap, no
> need to worry about field becoming too narrow...
>
> What, parsing that would be too slow? Right. Sure. How many times do
> we parse partition table? How many times do we end up reading it from
> disk? How does IO time compare to the "overhead" of trivial sscanf loop?
>
> Furrfu... "ASCII is tough, let's go shopping"...

Whats wrong with ASCII processing? Easy to tell:

1. Look at bagtraq. (http://www.securityfocus.com)

2. It's making data *not agnostic* against i18n issues. This is
something most people forgett about. /proc is LANG=en_US. ISO8859-1 - I
do not like this language.

3. For some as of jet undiscovered reason actual application programmers
hate processing it.

4. Answer 1. should be actually sufficient.

2002-08-01 16:42:04

by kaih

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

[email protected] (Marcin Dalecki) wrote on 01.08.02 in <[email protected]>:

> Alexander Viro wrote:
> >
> > On Thu, 1 Aug 2002, Peter Chubb wrote:
> >
> >
> >>Maybe we need to roll our own? I suggest something like:
> >> struct linux_volume_header {
> >> char volname[16];
> >> __u32 nparts;
> >> __u32 blocksize;
> >> struct linux_partition {
> >> char partname[16]
> >> __u64 start;
> >> __u64 len;
> >> __u32 usage;
> >> __u32 flags;
> >> } parts[]
> >> }
> >
> >
> > Oh, ferchrissake! WHY??? People, we'd seen a lot of demonstrations
> > of the reasons why binary structures are *bad* for such stuff.
> >
> > What the bleedin' hell is wrong with <name> <start> <len>\n - all in
> > ASCII? Terminated by \0. No need for flags, no need for endianness crap,
> > no need to worry about field becoming too narrow...
> >
> > What, parsing that would be too slow? Right. Sure. How many times do
> > we parse partition table? How many times do we end up reading it from
> > disk? How does IO time compare to the "overhead" of trivial sscanf loop?
> >
> > Furrfu... "ASCII is tough, let's go shopping"...
>
> Whats wrong with ASCII processing? Easy to tell:
>
> 1. Look at bagtraq. (http://www.securityfocus.com)

I can't see how that can possibly apply to this case. Getting a parser for
*this* format wrong enough to allow for an attack needs incredible
stupidity.

And remember, this is not something for generic applications to parse:
possibly the bootloader (unless it's LILO), the kernel, and fdisk. That's
it.

> 2. It's making data *not agnostic* against i18n issues. This is
> something most people forgett about. /proc is LANG=en_US. ISO8859-1 - I
> do not like this language.

I18n in partition names? Because that's certainly the only part in there
that seems to even be possible. Just define that text in partition names
is supposed to be UTF-8, and if there's ever anything that needs to be
understood by programs (as opposed to just handed straight through between
user and disk), make that be ASCII.

(However, I'd put the name as the _last_ field, possibly with a different
separator [in case we ever want more fields], so I'd not need to think
about any special characters in there.)

Oh, and don't forget some kind of magic string at the beginning. And
possibly add some (optional) uuid. Helps with partitions moving around if
the filesystem hasn't one.

> 3. For some as of jet undiscovered reason actual application programmers
> hate processing it.

Very few of them need to.

> 4. Answer 1. should be actually sufficient.

Not even remotely.


Ok, that makes a format proposal as follows:

#*Partition table*#
512 156258637 547af-d65e78-8978af =My Volume Label\n
0 3 ptable\n
4 7868 6562-adfea-898809aa =Bootloader\n
7872 150000000 a6f5-c9ba-6532 =Linux root
150007872 6250765\n
\0

That would be a 512-bytes-per block volume of around 80 GB (assuming I
didn't miscalculate), with two data partitions and some free space at the
end, and with both uuid and name fields being optional, except that an
uuid of "ptable" marks the partition table partition.

Every line after the magic parses as /^\s*\d+\s+\d+\s*(\s\w+\s*)(=.*)$/
(in Perl notation). It's free space if it only has the first two fields;
the first line describes the whole volume and uses the first field for the
block size. (Obviously, if you need boot code at the beginning of the
disk, the partition table will need to start at some later sector. So that
field has actual meaning.)

As for finding where to boot from - either have the bootloader define a
partition name it wants to see, or put the relevant name into the boot
loader config. No need to define that in the partition format. That's
trivial: even MS-DOS did that (finding IO.SYS and MSDOS.SYS from the boot
loader)! And neither scanning for '=' and '\n' nor comparing one string
nor converting one number from decimal is any kind of hardship. Maybe half
a screen of assembler, tops.

MfG Kai

2002-08-01 19:26:12

by Thunder from the hill

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

Hi,

On Wed, 31 Jul 2002, Alexander Viro wrote:
> What the bleedin' hell is wrong with <name> <start> <len>\n - all in ASCII?
> Terminated by \0. No need for flags, no need for endianness crap, no
> need to worry about field becoming too narrow...

Well, why not long[] fields? Might be more powerful, and possibly not any
slower than ASCII.

Thunder
--
.-../../-./..-/-..- .-./..-/.-.././.../.-.-.-

2002-08-01 20:27:56

by Alexander Viro

[permalink] [raw]
Subject: Re: 2.5.28 and partitions



On Thu, 1 Aug 2002, Thunder from the hill wrote:

> Hi,
>
> On Wed, 31 Jul 2002, Alexander Viro wrote:
> > What the bleedin' hell is wrong with <name> <start> <len>\n - all in ASCII?
> > Terminated by \0. No need for flags, no need for endianness crap, no
> > need to worry about field becoming too narrow...
>
> Well, why not long[] fields? Might be more powerful, and possibly not any
> slower than ASCII.

More powerful in which way? I see where it's less powerful - sizeof(long)
is platform-dependent and so is endianness. More powerful? Maybe, if
you have integers that do not have decimal representation. I've never
heard of such beasts, but sure would appreciate some examples.

As for the Martin's comments... Martin, if you can't write a function
that checks whether array of characters has a contents fitting the
description above - stand up and say so. Aloud. In public.

The fact that thousands of selfstyled "programmers" manage to screw that
up says only one thing - that they should not be allowed anywhere near
programming. Because the same guys screw up in _anything_ they do,
no matter what data types are involved. ASCII is tough? Make it "arithmetics
is tough". Examples on demand, including real gems like
fread(&foo, sizeof(foo), 1, fp);
if (foo.x >= 100000 || foo.y >= 100000)
/* fail and exit */
p = (char *)malloc(foo.x * foo.y);
if (!p)
/* fail and exit */
for (i = 0; i < foo.x; i++)
fread(p + i*foo.y. 1, foo.y, fp);
and similar wonders (if anybody wonders what's wrong with the code above,
you need to learn how multiplication is defined on int and compare 10^10 with
2^32). And yes, it's real-life code, from often-used programs. Used on
untrusted data, at that.

Should we declare that arithmetics is dangerous?

2002-08-01 20:42:17

by Thunder from the hill

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

Hi,

On Thu, 1 Aug 2002, Alexander Viro wrote:
> More powerful?

Well, compared to ASCII: it's unlikely that you meet a j letter or a \033
in the size string.

I intended something such as GMP. We'd neet special maths functions for
this crap then, of course, but it might be worth it. The basics is just
"If field[n] overflows, push the overhead into field[n+1]"...

But you're of course right here that endianness is an issue here.

Thunder
--
.-../../-./..-/-..- .-./..-/.-.././.../.-.-.-

2002-08-01 21:05:43

by Alexander Viro

[permalink] [raw]
Subject: Re: 2.5.28 and partitions



On Thu, 1 Aug 2002, Thunder from the hill wrote:

> Hi,
>
> On Thu, 1 Aug 2002, Alexander Viro wrote:
> > More powerful?
>
> Well, compared to ASCII: it's unlikely that you meet a j letter or a \033
> in the size string.

Huh??? That's a new meaning of "powerful"... If you mean "more compact"
I would certainly agree (base-10 instead of base-256), but if _that_ becomes
a problem with partition tables... IIRC, OP proposed 4096 bytes for table.

Again, if somebody really can't check if array of characters is a valid
representation of integer or can't implement conversion of known valid
one to its value... What the devil are you doing here?

2002-08-01 21:03:49

by Marcin Dalecki

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

Uz.ytkownik Alexander Viro napisa?:
>
> On Thu, 1 Aug 2002, Thunder from the hill wrote:
>
>
>>Hi,
>>
>>On Wed, 31 Jul 2002, Alexander Viro wrote:
>>
>>>What the bleedin' hell is wrong with <name> <start> <len>\n - all in ASCII?
>>>Terminated by \0. No need for flags, no need for endianness crap, no
>>>need to worry about field becoming too narrow...
>>
>>Well, why not long[] fields? Might be more powerful, and possibly not any
>>slower than ASCII.
>
>
> More powerful in which way? I see where it's less powerful - sizeof(long)
> is platform-dependent and so is endianness. More powerful? Maybe, if
> you have integers that do not have decimal representation. I've never
> heard of such beasts, but sure would appreciate some examples.
>
> As for the Martin's comments... Martin, if you can't write a function
> that checks whether array of characters has a contents fitting the
> description above - stand up and say so. Aloud. In public.

Actually you asked me to just shut up. Becouse I assume that you guessed
that I'm able to write the corresponding code?

I will anser anyway ;-)

Sure I'm able to do this. However if I hear the words parser I
immediately think *complete* parsers in the formal sense.
Not a bunch of reg exp guessing. Neither do
I think about that error prone scanning for '\0' or fumbling
with xxx[strln(xxx)]. And yes using lex and yacc *is actually* easy
for me.

So unless you provide me with a... well for example, *complete* BNF
grammar definition of /proc I will always claim that using it or ASCII
based interfaces is:

1. Not easy.

2. Like walking on moving sand.

Oh well: I will accept EBNF as well...

Looking at some structs relives one from this headache.


> The fact that thousands of selfstyled "programmers" manage to screw that
> up says only one thing - that they should not be allowed anywhere near
> programming. Because the same guys screw up in _anything_ they do,
> no matter what data types are involved. ASCII is tough? Make it "arithmetics
> is tough". Examples on demand, including real gems like
> fread(&foo, sizeof(foo), 1, fp);
> if (foo.x >= 100000 || foo.y >= 100000)
> /* fail and exit */
> p = (char *)malloc(foo.x * foo.y);
> if (!p)
> /* fail and exit */
> for (i = 0; i < foo.x; i++)
> fread(p + i*foo.y. 1, foo.y, fp);
> and similar wonders (if anybody wonders what's wrong with the code above,
> you need to learn how multiplication is defined on int and compare 10^10 with
> 2^32). And yes, it's real-life code, from often-used programs. Used on
> untrusted data, at that.

Storing the constants in question in the above code sample
as ASCII at the start of where foo is pointing at, would have hardly
saved the poor overworked programmers mind from precisely the same
mistake he did above. (Needless to say that you actually forgott
to mention that the code fails on <= 32 bit systems. Inestad of
providing te "hint" for guessing where the actual error is.)

It would have just duplicated the code size, becouse he would
have to do the ASCII parsing and additionaly he would
have to deal with moving offsets for reading the actual data.
Just more room for more mistakes.

The example above is a bad example to support your point therefore.
Actually it fires back. Like firing sharp rounds through
an AK47 with an exercise device still attached to the end of the pipe.

> Should we declare that arithmetics is dangerous?

It is it is... Dealing with the 5 axioms of peano leads you to
many many wired concepts. Like for example - infinity!!!

2002-08-01 21:23:49

by Alexander Viro

[permalink] [raw]
Subject: Re: 2.5.28 and partitions



On Thu, 1 Aug 2002, Marcin Dalecki wrote:

> > As for the Martin's comments... Martin, if you can't write a function
> > that checks whether array of characters has a contents fitting the
> > description above - stand up and say so. Aloud. In public.
>
> Actually you asked me to just shut up. Becouse I assume that you guessed
> that I'm able to write the corresponding code?
>
> I will anser anyway ;-)
>
> Sure I'm able to do this. However if I hear the words parser I
> immediately think *complete* parsers in the formal sense.
> Not a bunch of reg exp guessing. Neither do

Newsflash: for Homsky-3 grammar "reg exp guessing" _IS_ complete parser
in the formal sense.

> I think about that error prone scanning for '\0' or fumbling

OK. So "check if n bytes starting at address p contain zero and return
the distance of first zero from p if they do and n if they do not" is
error-prone task? Fiiine...

> So unless you provide me with a... well for example, *complete* BNF
> grammar definition of /proc I will always claim that using it or ASCII
> based interfaces is:

What the devil does BNF for everything somebody decided to dump in some
file in procfs have to partition tables?

> > is tough". Examples on demand, including real gems like
> > fread(&foo, sizeof(foo), 1, fp);
> > if (foo.x >= 100000 || foo.y >= 100000)
> > /* fail and exit */
> > p = (char *)malloc(foo.x * foo.y);
> > if (!p)
> > /* fail and exit */
> > for (i = 0; i < foo.x; i++)
> > fread(p + i*foo.y. 1, foo.y, fp);
> > and similar wonders (if anybody wonders what's wrong with the code above,
> > you need to learn how multiplication is defined on int and compare 10^10 with
> > 2^32). And yes, it's real-life code, from often-used programs. Used on
> > untrusted data, at that.
>
> Storing the constants in question in the above code sample
> as ASCII at the start of where foo is pointing at, would have hardly
> saved the poor overworked programmers mind from precisely the same
> mistake he did above. (Needless to say that you actually forgott
> to mention that the code fails on <= 32 bit systems. Inestad of
> providing te "hint" for guessing where the actual error is.)

Huh???

you: "it's easy to screw up when working with ASCII strings"
me: "tossers will find a way to screw up on anything, no matter what it is;
see example of tosser screwing up on plain arithmetics"
you: "use of ASCII wouldn't help them in that case"

Sure thing, it wouldn't. _Nothing_ short of acquiring some clue would.
Possible solutions:
A) replace all arithmetics with BIGNUMs (and just you wait for
first out-of-memory)
B) get rid of tossers.

Matter of taste, indeed, but I'd rather go for (B) - has a benefit of
solving many other problems.

2002-08-01 21:21:34

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

Alexander Viro writes:
> [...]
>> On Wed, 31 Jul 2002, Alexander Viro wrote:

>>> What the bleedin' hell is wrong with <name> <start> <len>\n - all in ASCII?
>>> Terminated by \0. No need for flags, no need for endianness crap, no
>>> need to worry about field becoming too narrow...

There's just that little overflow problem to worry about,
trailing garbage, encouragement of assumptions about the
maximum size... is that a %d or a %llu or what?

Given n fields and an constant c>5, there will be at least
exp(c,n) ways to parse and generate the data. All will be
implemented. In addition to disagreement over the format,
most parsers will be buggy.

You just like ASCII because that's the Plan 9 way.
There's a time and place for ASCII, and this isn't it.

> More powerful in which way? I see where it's less powerful - sizeof(long)
> is platform-dependent and so is endianness. More powerful? Maybe, if

type safety, given a C struct with proper alignment

> Should we declare that arithmetics is dangerous?

We should use FORTRAN or Pascal, with overflow/underflow
trapping enabled for integer math and array access. :-)

2002-08-01 21:27:33

by Marcin Dalecki

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

Uz.ytkownik Alexander Viro napisa?:
>
> On Thu, 1 Aug 2002, Thunder from the hill wrote:
>
>
>>Hi,
>>
>>On Thu, 1 Aug 2002, Alexander Viro wrote:
>>
>>>More powerful?
>>
>>Well, compared to ASCII: it's unlikely that you meet a j letter or a \033
>>in the size string.
>
>
> Huh??? That's a new meaning of "powerful"... If you mean "more compact"
> I would certainly agree (base-10 instead of base-256), but if _that_ becomes
> a problem with partition tables... IIRC, OP proposed 4096 bytes for table.
>
> Again, if somebody really can't check if array of characters is a valid
> representation of integer or can't implement conversion of known valid
> one to its value... What the devil are you doing here?

Ahh. we are at "devil" arguemnt level... So I will ease myself:
Why the hell don't you rewrite the whole kernel for example in LISP if
you love string processing that much?
I know I know GCC people tryed this in C for a compiler...

2002-08-01 21:38:25

by Alexander Viro

[permalink] [raw]
Subject: Re: 2.5.28 and partitions



On Thu, 1 Aug 2002, Marcin Dalecki wrote:

> Ahh. we are at "devil" arguemnt level... So I will ease myself:
> Why the hell don't you rewrite the whole kernel for example in LISP if
> you love string processing that much?

Huh?

What the <your pet expletive> does LISP have to strings?

> I know I know GCC people tryed this in C for a compiler...

gcc people tried a lot of crap in a lot of ways for a lot of reasons,
but
a) I'm not sure I've parsed your sentence correctly
b) I don't see what the flaming fsck does it have to _anything_
discussed above.

2002-08-01 21:47:49

by Marcin Dalecki

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

Uz.ytkownik Alexander Viro napisa?:
>
> Newsflash: for Homsky-3 grammar "reg exp guessing" _IS_ complete parser
> in the formal sense.

Unsually only unless you compre it with your *intentions*.
Please don't confuse definition of grammar with parser implementation
despite that fact the reg-exp stuff is looking like declarative
programming. Whot it does is *not* always equivalent to what it should.
OK?

>>>is tough". Examples on demand, including real gems like
>>> fread(&foo, sizeof(foo), 1, fp);
>>> if (foo.x >= 100000 || foo.y >= 100000)
>>> /* fail and exit */
>>> p = (char *)malloc(foo.x * foo.y);
>>> if (!p)
>>> /* fail and exit */
>>> for (i = 0; i < foo.x; i++)
>>> fread(p + i*foo.y. 1, foo.y, fp);
>>>and similar wonders (if anybody wonders what's wrong with the code above,
>>>you need to learn how multiplication is defined on int and compare 10^10 with
>>>2^32). And yes, it's real-life code, from often-used programs. Used on
>>>untrusted data, at that.
>>
>>Storing the constants in question in the above code sample
>>as ASCII at the start of where foo is pointing at, would have hardly
>>saved the poor overworked programmers mind from precisely the same
>>mistake he did above. (Needless to say that you actually forgott
>>to mention that the code fails on <= 32 bit systems. Inestad of
>>providing te "hint" for guessing where the actual error is.)
>
>
> Huh???
>
> you: "it's easy to screw up when working with ASCII strings"
> me: "tossers will find a way to screw up on anything, no matter what it is;
> see example of tosser screwing up on plain arithmetics"
> you: "use of ASCII wouldn't help them in that case"
>

Scratch the above: I tell you: "Not unsing ASCII is greatly
diminishing the propability of the occurrance of the error."
And error rate depends on the size of code. No matter how
perfect you think someone has to be as a coder.
No code - no errors. The same buggy code twice - twice the same errors.

> Sure thing, it wouldn't. _Nothing_ short of acquiring some clue would.
> Possible solutions:
> A) replace all arithmetics with BIGNUMs (and just you wait for
> first out-of-memory)
> B) get rid of tossers.
>
> Matter of taste, indeed, but I'd rather go for (B) - has a benefit of
> solving many other problems.

No. The vomitting only moves to the time where you actually get your ass
up from the kernel and take a look at the code trying to use it. And
then it's more painfull. Go libproc or its relatives please. I don't
blaim Albert for it! I blaim the interface.
I don't try to tell you that binary interfaces are the best thing
since slice bread. They are just less worse for the *actual* user.

That's the point.

2002-08-02 05:18:34

by Ryan Anderson

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

On Thu, 1 Aug 2002, Alexander Viro wrote:
> On Thu, 1 Aug 2002, Marcin Dalecki wrote:
>
> you: "it's easy to screw up when working with ASCII strings"
> me: "tossers will find a way to screw up on anything, no matter what it is;
> see example of tosser screwing up on plain arithmetics"
> you: "use of ASCII wouldn't help them in that case"

Ages and ages ago (ok, not that long ago, really) I remember reading a
vaguely similar arugment on Fidonet.

The argument was largely over the format of the "next gen" message
format - RFC822 came up a lot, and the ensuing "binary header" vs "ASCII
header" arguments would follow.

All I really learned from that was, "parsing email headers is more
complicated than people suspect" (binary or ascii, doesn't matter), and that
ASCII has one advantage that more compact/binary/etc formats lack:

When the ASCII file/header/partition table/whatever gets fscked
beyond all recognition, I can fix the goddamn thing with a text editor.

That last part is the reason most people utterly detest things like the
Windows registry and prefer the (imo, at least), much saner /etc design
prevalent in Linux distributions.


--
Ryan Anderson
sometimes Pug Majere

2002-08-02 14:51:00

by Jesse Pollard

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

[email protected] (Kai Henningsen):
...
> As for finding where to boot from - either have the bootloader define a
> partition name it wants to see, or put the relevant name into the boot
> loader config. No need to define that in the partition format. That's
> trivial: even MS-DOS did that (finding IO.SYS and MSDOS.SYS from the boot
> loader)! And neither scanning for '=' and '\n' nor comparing one string
> nor converting one number from decimal is any kind of hardship. Maybe half
> a screen of assembler, tops.
>

Nope.

The problem is different - which file system is the file stored in?
How many different filesystems are there?
Do think all of them will fit in a boot loader?
Or even one of them?
How many different logical volume structures are there?

Do do this you first have to convince the development people to say that
"only xxxx filesystem shall be bootable".

Very unlikely.

And now, you also have to add possible logical volumes on top (or under :)
of it.

Even more unlikely.

That is why LILO doesn't use file names for boots. It only uses block
numbers.

Another alternative (possibly just as hard) is to have LILO only
load a more complex and dynamic loader, which could be configured for
each filesystem structure. Once that "dynamic loader" is loaded, it
could find and load the kernel (passing, of course, the boot command line
from LILO).

I know IRIX gets around the problem by having a tiny filesystem for the
"disk label". This filesystem contains only contigeous files, and has
references to the drive partition table, the complex boot program (sash -
stand alone shell), optional diagnostic boot, and logical volume mebership -
one reference per logical volume type and partition .. I think it is
<lvm type>.<partitionnumber>
The contents of the file is volume name followed by the order of the partition
in the lvm (section 1, 2, 3, ..).

And this is not a "mountable" file system. It is only accessed via special
utilities (like the "mtools" set for non-mounted M$DOS floppies)

At least, I remember IRIX this way - it should be close.

SunOS had something a little different: the initial boot (at the bios level)
use block numbers to locate a "boot" utility. The "boot" utility knew about
the filesystem type. I think it was a link of the boot object with a fs
utility library, where the library was selected by a "makeboot" command
and by the filesystem type that the kernel(s) was(were) stored on. The
"makeboot" utility modified/replace the "boot" program, then set the
block numbers in the boot sector.

All of this has truly horrible effects on boot times though. At a minimum
I would expect it to take twice as long.

You pay for the additional flexibility though.

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2002-08-02 19:43:01

by Mike Touloumtzis

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

On Thu, Aug 01, 2002 at 05:41:53PM -0400, Alexander Viro wrote:
>
> On Thu, 1 Aug 2002, Marcin Dalecki wrote:
>
> > Ahh. we are at "devil" arguemnt level... So I will ease myself:
> > Why the hell don't you rewrite the whole kernel for example in LISP if
> > you love string processing that much?
>
> Huh?
>
> What the <your pet expletive> does LISP have to strings?

Umm... LISP is all about using strings instead of binary representations,
or at least hiding binary representations other than list building
primitives from the programmer.

Your ASCII partition table proposal is _exactly_ what a LISPer would
propose for partition tables: use strings to represent values in a format
that has no implicit size limits on numbers, is endian independent, etc.
The only difference is a LISPer would surround it with parentheses :-).

IMHO s-expressions are severely underrepresented as an
architecture-independent data representation that could more or less
eliminate the need for ad hoc parsers in, say, /proc. Of course
one-ASCII-symbol-per-file accomplishes more or less the same thing,
but for much higher system call overhead. I guess the ideal would be a
multi-file-spanning variation on seq_file (I think that's the name for
the stateful /proc parsing helper?) that would serialize the contents
of a tree into an s-expression, allowing the best of both worlds.

miket

2002-08-02 19:43:46

by Mike Touloumtzis

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

On Thu, Aug 01, 2002 at 05:24:37PM -0400, Albert D. Cahalan wrote:
> Alexander Viro writes:
> > [...]
> >> On Wed, 31 Jul 2002, Alexander Viro wrote:
>
> >>> What the bleedin' hell is wrong with <name> <start> <len>\n
> >>> - all in ASCII? Terminated by \0. No need for flags, no need
> >>> for endianness crap, no need to worry about field becoming too
> >>> narrow...
>
> There's just that little overflow problem to worry about,

Ummm:

-- stuff ASCII digits into u64 (or u32, or whatever)
-- if (still more digits)
-- printk("partition too big to mount!\n")
-- return error

How hard is that?

> trailing garbage,

Don't write garbage into your partition table.

> encouragement of assumptions about the maximum size...
> is that a %d or a %llu or what?

See above. Use leading '-' for negative numbers. ASCII has no
2's complement ambiguity issues.

miket

2002-08-02 20:46:04

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

Mike Touloumtzis writes:
> On Thu, Aug 01, 2002 at 05:24:37PM -0400, Albert D. Cahalan wrote:

>> There's just that little overflow problem to worry about,
>
> Ummm:
>
> -- stuff ASCII digits into u64 (or u32, or whatever)
> -- if (still more digits)
> -- printk("partition too big to mount!\n")
> -- return error
>
> How hard is that?

I refer to overflowing the space allowed for your
partition table. Programs will generate the data,
then write it out. If the data gets too long, then
you overwrite part of your first filesystem.
Alternately, the partition table gets truncated
at the maximum size -- with or without a '\0'.

But sure, overflowing a u64 is also a problem.
This will not be checked for. Either the u64 will
get overflowed, or the parser will take what fits
and then mis-interpret the remaining digits as
a second number.

>> trailing garbage,
>
> Don't write garbage into your partition table.

I can see multiple ways for this to happen.
Take the length of the new data, with or without
the trailing '\0', and write it out. Write the
whole partition table, including uninitialized
data that happens to be in memory. (some other
program will of course not ignore trailing garbage)

>> encouragement of assumptions about the maximum size...
>> is that a %d or a %llu or what?
>
> See above. Use leading '-' for negative numbers. ASCII has no
> 2's complement ambiguity issues.

You've got to stuff it into something eventually,
unless you want to implement ASCII math. Will you
be using plain C, or C++ operator overloading?

Yeah, just what we need. The /proc mess expanding
into partition tables. That sounds like a great way
to increase filesystem destruction performance.


2002-08-02 20:48:30

by kaih

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

[email protected] (Jesse Pollard) wrote on 02.08.02 in <[email protected]>:

> [email protected] (Kai Henningsen):
> ...
> > As for finding where to boot from - either have the bootloader define a
> > partition name it wants to see, or put the relevant name into the boot
> > loader config. No need to define that in the partition format. That's
> > trivial: even MS-DOS did that (finding IO.SYS and MSDOS.SYS from the boot
> > loader)! And neither scanning for '=' and '\n' nor comparing one string
> > nor converting one number from decimal is any kind of hardship. Maybe half
> > a screen of assembler, tops.
> >
>
> Nope.
>
> The problem is different - which file system is the file stored in?

Huh?! What file?!

> How many different filesystems are there?

That's not a question for the bootloader.

> Do think all of them will fit in a boot loader?

Who cares? You can always give it a partition of its own. (The example did
exactly that!)

> Or even one of them?
> How many different logical volume structures are there?

I have no idea what you are talking about here.

> Do do this you first have to convince the development people to say that
> "only xxxx filesystem shall be bootable".

Utter nonsense.

> And now, you also have to add possible logical volumes on top (or under :)
> of it.

What are you babbling about?

> That is why LILO doesn't use file names for boots. It only uses block
> numbers.

So?

(By the way, it's the *only* boot loader I know that does this.)

> Another alternative (possibly just as hard) is to have LILO only
> load a more complex and dynamic loader, which could be configured for
> each filesystem structure. Once that "dynamic loader" is loaded, it
> could find and load the kernel (passing, of course, the boot command line
> from LILO).

What on earth does that have to do with the format of a partition table?!

> I know IRIX gets around the problem by having a tiny filesystem for the
> "disk label". This filesystem contains only contigeous files, and has

Around *which* problem?! That's certainly something that's only relevant
after the bootloader is long gone.

Frankly, I have no idea what you're smoking, but it can't be healthy.

MfG Kai

2002-08-02 21:18:35

by Mike Touloumtzis

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

On Fri, Aug 02, 2002 at 04:49:17PM -0400, Albert D. Cahalan wrote:
> Mike Touloumtzis writes:
> > On Thu, Aug 01, 2002 at 05:24:37PM -0400, Albert D. Cahalan wrote:
>
> >> There's just that little overflow problem to worry about,
> >
> > Ummm:
> >
> > -- stuff ASCII digits into u64 (or u32, or whatever)
> > -- if (still more digits)
> > -- printk("partition too big to mount!\n")
> > -- return error
> >
> > How hard is that?
>
> I refer to overflowing the space allowed for your
> partition table. Programs will generate the data,
> then write it out. If the data gets too long, then
> you overwrite part of your first filesystem.
> Alternately, the partition table gets truncated
> at the maximum size -- with or without a '\0'.

Writing the partition table would still have to be done with
knowledge of its maximum size (i.e. the need to worry about
maximum partition table size wouldn't go away, just the need to
set a maximum size for every individual component in the table).

A program should write the ASCII representation into a buffer,
testing at that time for overflow. I certainly wouldn't
recommend:

FILE *f = fopen("/dev/hda", "r+");
fprintf(f, "%u %u %u%c", foo, bar, baz, '\0');

:-)

> But sure, overflowing a u64 is also a problem.
> This will not be checked for. Either the u64 will
> get overflowed, or the parser will take what fits
> and then mis-interpret the remaining digits as
> a second number.

Are you advocating the use of stupid parsers?

> > Don't write garbage into your partition table.
>
> I can see multiple ways for this to happen.
> Take the length of the new data, with or without
> the trailing '\0', and write it out. Write the
> whole partition table, including uninitialized
> data that happens to be in memory. (some other
> program will of course not ignore trailing garbage)

If programs writing the partition table know the amount of disk
allocated to the table they can zero-fill the rest (see above).

> >> encouragement of assumptions about the maximum size...
> >> is that a %d or a %llu or what?
> >
> > See above. Use leading '-' for negative numbers. ASCII has no
> > 2's complement ambiguity issues.
>
> You've got to stuff it into something eventually,
> unless you want to implement ASCII math. Will you
> be using plain C, or C++ operator overloading?

I think you are seeing phantom problems where obvious solutions
exist.

Of course you have to stuff the values into native binary formats
eventually. I'm just talking about on-disk representation,
not in-memory.

On output, you can use the biggest integer size the machine
supports, e.g. %llu, because you wouldn't be able to handle the
partition at all if it was just too big for your machine. Or you
use bignums and something other than printf(3). Your attempt
to smear this approach by illogically associating it with C++
operator overloading is ridiculous.

On input, if a value is too big to handle, you just
fprintf(stderr, "Partition too big, tough luck for you!\n");
Or, in the kernel, you refuse to mount it.

Or if you really want to handle big numbers, you use a bignum
package for fdisk. It's not like there's a magic solution with
_current_ partition tables for handling numbers that are too big.
The current approach to this kind of problem in the kernel is
more or less:

-- Choose a structure which imposes a size limit for every value.
-- When _any_ of those limits overflows, switch to a whole new
structure. Implement new code branches, syscalls, etc. as
needed to handle both old and new versions.

Frankly, that sucks.

> Yeah, just what we need. The /proc mess expanding
> into partition tables. That sounds like a great way
> to increase filesystem destruction performance.

The /proc mess exists because people chose N ad hoc output
formats for /proc files. If they had a consistent format like
s-expressions or one-value-per-file most problems with /proc
would not exist.

I'm just putting my hope in the belief that sooner or later Al Viro
will realize that there's a lot more similarity between the Plan
9 and Lisp/Scheme approaches to simple, architecture-independent
representations than he thinks, and swoop in to clean up this
mess :-).

miket

2002-08-02 21:34:03

by Thunder from the hill

[permalink] [raw]
Subject: Re: [RFC] 2.5.28 and partitions

Hi,

On Fri, 2 Aug 2002, Mike Touloumtzis wrote:
> fprintf(f, "%u %u %u%c", foo, bar, baz, '\0');

Yes, that's crap... However, what about the following partition table
format:

<size>\0<part1>\n<part2>\n<part...>\n<partn>\0

Size represents the complete size of the partition table. partx defines
the partitions that we have, part1 starts at
(size % BLOCK_SIZE ?
drive_start+(size/BLOCK_SIZE) :
drive_start+(SIZE/BLOCK_SIZE)+1)

String encoding is possible, not a requirement...

Thunder
--
.-../../-./..-/-..- .-./..-/.-.././.../.-.-.-

2002-08-02 22:09:38

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

Mike Touloumtzis writes:
> On Fri, Aug 02, 2002 at 04:49:17PM -0400, Albert D. Cahalan wrote:
>> Mike Touloumtzis writes:
>>> On Thu, Aug 01, 2002 at 05:24:37PM -0400, Albert D. Cahalan wrote:

>>>> There's just that little overflow problem to worry about,
>>>
>>> Ummm:
>>>
>>> -- stuff ASCII digits into u64 (or u32, or whatever)
>>> -- if (still more digits)
>>> -- printk("partition too big to mount!\n")
>>> -- return error
>>>
>>> How hard is that?
>>
>> I refer to overflowing the space allowed for your
>> partition table. Programs will generate the data,
>> then write it out. If the data gets too long, then
>> you overwrite part of your first filesystem.
>> Alternately, the partition table gets truncated
>> at the maximum size -- with or without a '\0'.
>
> Writing the partition table would still have to be done with
> knowledge of its maximum size (i.e. the need to worry about
> maximum partition table size wouldn't go away, just the need to
> set a maximum size for every individual component in the table).
>
> A program should write the ASCII representation into a buffer,
> testing at that time for overflow. I certainly wouldn't
> recommend:
>
> FILE *f = fopen("/dev/hda", "r+");
> fprintf(f, "%u %u %u%c", foo, bar, baz, '\0');
>
> :-)

The above, and worse, will be used. Face reality.
Now, what problem were you trying to solve?
The data all ends up as binary data types anyway,
so you're not escaping any limitations. You're
just hiding them, and letting stuff break when
the limitations get hit.

>> But sure, overflowing a u64 is also a problem.
>> This will not be checked for. Either the u64 will
>> get overflowed, or the parser will take what fits
>> and then mis-interpret the remaining digits as
>> a second number.
>
> Are you advocating the use of stupid parsers?

I'm telling you that this isn't your ideal world.
Stupid parsers are damn common.

BTW, many text data formats don't deserve anything
better anyway, because the format itself is ill-defined.

Prime example: /proc/cpuinfo
See also: /proc/*/status SigCgt/SigCat name change

>>> Don't write garbage into your partition table.
>>
>> I can see multiple ways for this to happen.
>> Take the length of the new data, with or without
>> the trailing '\0', and write it out. Write the
>> whole partition table, including uninitialized
>> data that happens to be in memory. (some other
>> program will of course not ignore trailing garbage)
>
> If programs writing the partition table know the amount of disk
> allocated to the table they can zero-fill the rest (see above).

Yeah, they could do that. Many will not. Reality again...
Using ASCII is just asking for bugs. It's begging and
pleading for bugs, especially when you really do have to
fit this variable-size data into a fixed-size space.

>>>> encouragement of assumptions about the maximum size...
>>>> is that a %d or a %llu or what?
>>>
>>> See above. Use leading '-' for negative numbers. ASCII has no
>>> 2's complement ambiguity issues.
>>
>> You've got to stuff it into something eventually,
>> unless you want to implement ASCII math. Will you
>> be using plain C, or C++ operator overloading?
>
> I think you are seeing phantom problems where obvious solutions
> exist.
>
> Of course you have to stuff the values into native binary formats
> eventually. I'm just talking about on-disk representation,
> not in-memory.

Ah, but it has to get into memory at some point.
There it will need a data type. Changing the data
type involves changing the parser and inventing
yet another in-memory struct anyway.

> On output, you can use the biggest integer size the machine
> supports, e.g. %llu, because you wouldn't be able to handle the
> partition at all if it was just too big for your machine. Or you
> use bignums and something other than printf(3). Your attempt
> to smear this approach by illogically associating it with C++
> operator overloading is ridiculous.

I was being kind. I could have mentioned LISP or Scheme.
Oddly, you volunteered them already yourself!

Fine, no operator overloading:

err = ascii_math_make_number(baz, 512); // baz = 512
if(err){
// handle error here
}
err = ascii_math_add(foo, bar, baz); // foo = bar + baz
if(err){
// handle error here
}
// ...
// blah, blah
// ...
err = ascii_math_free(baz); // don't forget to free memory
if(err){
// handle error here
}

> On input, if a value is too big to handle, you just
> fprintf(stderr, "Partition too big, tough luck for you!\n");
> Or, in the kernel, you refuse to mount it.

With a 32-bit binary field, programs will use 32-bit types.
With a 64-bit binary field, programs will use 64-bit types.
With an ASCII format, every program will use a different type.

> Or if you really want to handle big numbers, you use a bignum
> package for fdisk. It's not like there's a magic solution with
> _current_ partition tables for handling numbers that are too big.
> The current approach to this kind of problem in the kernel is
> more or less:
>
> -- Choose a structure which imposes a size limit for every value.
> -- When _any_ of those limits overflows, switch to a whole new
> structure. Implement new code branches, syscalls, etc. as
> needed to handle both old and new versions.
>
> Frankly, that sucks.

It does, a bit, but it sure beats hidden per-program
limits caused by every program converting the ASCII
to a different in-memory structure.

>> Yeah, just what we need. The /proc mess expanding
>> into partition tables. That sounds like a great way
>> to increase filesystem destruction performance.
>
> The /proc mess exists because people chose N ad hoc output
> formats for /proc files. If they had a consistent format like
> s-expressions or one-value-per-file most problems with /proc
> would not exist.

That only solves a superficial problem. It doesn't let
you reliably handle changing data types and keywords.

> I'm just putting my hope in the belief that sooner or later Al Viro
> will realize that there's a lot more similarity between the Plan
> 9 and Lisp/Scheme approaches to simple, architecture-independent
> representations than he thinks, and swoop in to clean up this
> mess :-).

I'm hoping he'll realize the similarity and back away in horror.

2002-08-02 22:50:31

by Mike Touloumtzis

[permalink] [raw]
Subject: Re: 2.5.28 and partitions

On Fri, Aug 02, 2002 at 06:12:54PM -0400, Albert D. Cahalan wrote:
>
> > Of course you have to stuff the values into native binary formats
> > eventually. I'm just talking about on-disk representation,
> > not in-memory.
>
> Ah, but it has to get into memory at some point.
> There it will need a data type. Changing the data
> type involves changing the parser and inventing
> yet another in-memory struct anyway.

Right, but if each of your in-memory structs is localized to the
kernel (i.e. not present in the partition table, and not exported
to userspace via /proc or syscalls), then you can just increase
field sizes and recompile the kernel, without the need to support
both structure layouts in the kernel in perpetuity.

For efficiency reasons the kernel will always need to export
structs to userspace but partition information probably shouldn't
get a performance exemption.

> Fine, no operator overloading:
>
> err = ascii_math_make_number(baz, 512); // baz = 512
> if(err){
> // handle error here
> }
> err = ascii_math_add(foo, bar, baz); // foo = bar + baz
> if(err){
> // handle error here
> }

Yes, this is more or less what a bignum package would implement,
albeit with a much more efficient representation that ASCII
strings. But actually manipulating values in bignum format
should be left to utilities like fdisk that want to be generic.
The kernel and boot loaders would just load values into 'u32
a' and 'u32 b' (or whatever type) and add them with 'a + b'.
I'm not in any way advocating bignum arithmetic in the kernel.

> With a 32-bit binary field, programs will use 32-bit types.
> With a 64-bit binary field, programs will use 64-bit types.
> With an ASCII format, every program will use a different type.

Right, which would allow intelligence on the part of the programs,
like using 32-bit types on 32-bit architectures where values are
known to max out at 32-bits (I'm thinking of, say, /proc here)
and 64-bit values on 64-bit architectures.

> It does, a bit, but it sure beats hidden per-program
> limits caused by every program converting the ASCII
> to a different in-memory structure.

So create a library and flame people who don't link against it
and screw up their parsing. At least this way only some programs
would have hidden limits, not all of them.

> >> Yeah, just what we need. The /proc mess expanding
> >> into partition tables. That sounds like a great way
> >> to increase filesystem destruction performance.
> >
> > The /proc mess exists because people chose N ad hoc output
> > formats for /proc files. If they had a consistent format like
> > s-expressions or one-value-per-file most problems with /proc
> > would not exist.
>
> That only solves a superficial problem. It doesn't let
> you reliably handle changing data types and keywords.

s-expressions do--the first value in each parenthesized expression
is the keyword. For instance, if you have the tree:

/proc
/sys
/net
/ipsec
/inbound_policy_check (== 1)
/ipv4
/icmp_echo_ignore_all (== 0)
/icmp_echo_ignore_broadcasts (== 0)

Via, say, a magic 'cat /proc/serialize', you could view this as:

(sys (net (ipsec (inbound_policy_check 1)
(ipv4 (icmp_echo_ignore_all 0)
(icmp_echo_ignore_broadcasts 0)))))

without the pretty-printing, of course. Likewise you could
cat /proc/sys/serialize to see just that subtree.

All this information is available already in the kernel and
'parsing' the resulting s-expressions is mostly a matter of
counting parenthesis nesting depth, matching keywords, and doing
ASCII->numeric conversions.

I'm not religious about s-expressions but they do solve this
problem fairly well. People who are religiously anti-LISP should
pretend I used { } in the above.

Of course this all relies on a one-value-per-file /proc, which is
regrettably not the case now; that's why I chose /proc/sys for
the above example.

miket