I vaguely remember a discussion about this a few months back.
If I remember, the reasoning was it would unnecessarily slow
down smaller systems that would never have block devices in
the 4-28T range attached.
However, isn't it possible there will continue to be a series
of P-IV,V,VI,VII ...etc, addons that will be used for sometime
to come. I've even heard it suggested that we might see
2 or more CPU's on a single chip as a way to increase cpu
capacity w/o driving up clock speed. Given the cheapness of
.25T drives now, seeing the possibility of 4T drives doesn't seem
that remote (maybe 5 years?).
Side question: does the 32-bit block size limit also apply to
RAID disks or does it use a different block-nr type?
So...is it the plan, or has it been though about -- 'abstracting'
block numbes as a typedef 'block_nr', then at compile time
having it be selectable as to whether or not this was to
be a 32-bit or 64 bit quantity -- that way older systems would
lose no efficiency. Drivers that couldn't be or hadn't been
ported to use 'block_nr' could default to being disabled if
64-bit blocks were selected, etc.
So has this idea been tossed about and or previously thrashed?
-l
--
L A Walsh | Trust Technology, Core Linux, SGI
[email protected] | Voice: (650) 933-5338
On Mon, Mar 26, 2001 at 08:39:21AM -0800, LA Walsh wrote:
> I vaguely remember a discussion about this a few months back.
> If I remember, the reasoning was it would unnecessarily slow
> down smaller systems that would never have block devices in
> the 4-28T range attached.
4k page size * 2GB = 8TB.
i consider it much more likely on such systems that the page size will
be increased to maybe 16 or 64k which would give us 32TB or 128TB.
you keep on trying to increase the size of types without looking at
what gcc outputs in the way of code that manipulates 64-bit types.
seriously, why don't you just try it? see what the performance is.
see what the code size is. then come back with some numbers. and i mean
numbers, not `it doesn't feel any slower'.
personally, i'm going to see what the situation looks like in 5 years time
and try to solve the problem then. there're enough real problems with the
VFS today that i don't feel inclined to fix tomorrow's potential problems.
--
Revolutions do not require corporate support.
Matthew Wilcox wrote:
>
> On Mon, Mar 26, 2001 at 08:39:21AM -0800, LA Walsh wrote:
> > I vaguely remember a discussion about this a few months back.
> > If I remember, the reasoning was it would unnecessarily slow
> > down smaller systems that would never have block devices in
> > the 4-28T range attached.
>
> 4k page size * 2GB = 8TB.
---
Drat...was being more optimistic -- you're right
the block_nr can be negative. Somehow thought page size could
be 8K....living in future land. That just makes the limitations
even closer at hand...:-(
> you keep on trying to increase the size of types without looking at
> what gcc outputs in the way of code that manipulates 64-bit types.
---
Maybe someone will backport some of the features of the
IA-64 code generator into 'gcc'. I've been told that in some
cases it's a 2.5x performance difference. If 'gcc' is generating
bad code, then maybe the 'gcc' people will increase the quality
of their code -- I'm sure they are just as eagerly working on
gcc improvements as we are kernel improvements. When I worked
on the PL/M compiler project at Intel, I know our code-optimization
guy would spend endless cycles trying to get better optimization
out of the code. He got great joy out of doing so. -- and
that was almost 20 years ago -- and code generation has come
a *long* way since then.
> seriously, why don't you just try it? see what the performance is.
> see what the code size is. then come back with some numbers. and i mean
> numbers, not `it doesn't feel any slower'.
---
As for 'trying' it -- would anyone care if we virtualized
the block_nr into a typedef? That seems like it would provide
for cleaner (type-checked) code at no performance penalty and
more easily allow such comparisons.
Well this is my point: if I have disks > 8T, wouldn't
it be at *all* beneficial to be able to *choose* some slight
performance impact and access those large disks vs. having not
choice? Having it as a configurable would allow a given
installation to make that choice rather than them having no
choice. BTW, are block_nr's on RAID arrays subject to this
limitation?
>
> personally, i'm going to see what the situation looks like in 5 years time
> and try to solve the problem then.
---
It's not the same, but SGI has had customers for over
3 years using >2T *files*. The point I'm looking at is if
the P-X series gets developed enough, and someone is using a
4-16P system, a corp user might be approaching that limit
today or tomorrow. Joe User, might not for 5 years, but that's
what the configurability is about. Keep linux usable for both
ends of the scale -- "I love scalability"....
-l
--
L A Walsh | Trust Technology, Core Linux, SGI
[email protected] | Voice: (650) 933-5338
--
L A Walsh | Trust Technology, Core Linux, SGI
[email protected] | Voice: (650) 933-5338
Matthew Wilcox writes:
> On Mon, Mar 26, 2001 at 08:39:21AM -0800, LA Walsh wrote:
> > I vaguely remember a discussion about this a few months back.
> > If I remember, the reasoning was it would unnecessarily slow
> > down smaller systems that would never have block devices in
> > the 4-28T range attached.
>
> 4k page size * 2GB = 8TB.
>
> i consider it much more likely on such systems that the page size will
> be increased to maybe 16 or 64k which would give us 32TB or 128TB.
>
> personally, i'm going to see what the situation looks like in 5 years time
> and try to solve the problem then.
What do you mean by problems 5 years down the road? The real issue is that
this 32-bit block count limit affects composite devices like MD RAID and
LVM today, not just individual disks. There have been several postings
I have seen with people having a problem _today_ with a 2TB limit on
devices.
There is some hope with LVM (and MD I suspect as well), that it could
do blocksize remapping, so it appears to be a 4k sector device, but
remaps to 512-byte sector disks underneath. This _should_ give us an
upper limit of 16TB, assuming 32-bit unsigned ints for block numbers.
Of course, you would need to only do 4kB block I/O on top of these devices
(not much of an issue for such large devices).
Still, this is just a stop-gap measure because next year people will want
> 16TB devices, and there won't be an easy way to do this.
Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert
LA Walsh <[email protected]> writes:
> I vaguely remember a discussion about this a few months back.
> If I remember, the reasoning was it would unnecessarily slow
> down smaller systems that would never have block devices in
> the 4-28T range attached.
With classic 512 byte sectors the top size is right about 2TB.
The basic thought is that 64bit numbers tend to suck, so we don't
want then in any fast paths on a 32bit system.
> However, isn't it possible there will continue to be a series
> of P-IV,V,VI,VII ...etc, addons that will be used for sometime
> to come. I've even heard it suggested that we might see
> 2 or more CPU's on a single chip as a way to increase cpu
> capacity w/o driving up clock speed. Given the cheapness of
> .25T drives now, seeing the possibility of 4T drives doesn't seem
> that remote (maybe 5 years?).
>
> Side question: does the 32-bit block size limit also apply to
> RAID disks or does it use a different block-nr type?
For now yes it does.
>
> So...is it the plan, or has it been though about -- 'abstracting'
> block numbes as a typedef 'block_nr', then at compile time
> having it be selectable as to whether or not this was to
> be a 32-bit or 64 bit quantity -- that way older systems would
> lose no efficiency. Drivers that couldn't be or hadn't been
> ported to use 'block_nr' could default to being disabled if
> 64-bit blocks were selected, etc.
>
> So has this idea been tossed about and or previously thrashed?
Using a 64bit number of 32bit systems has so far been trashed.
Though this does look like a real problem that needs to be solved
at some point. I doubt we can wait past 2.5 though if we want the
code ready when the hardware is.
Eric
>> I vaguely remember a discussion about this a few months back.
>> If I remember, the reasoning was it would unnecessarily slow
>> down smaller systems that would never have block devices in
>> the 4-28T range attached.
>
>4k page size * 2GB = 8TB.
Try it.
If your drive (array) is larger than 512byte*4G (4TB) linux will eat
your data.
drivers/block/ll_rw_blk.c, in submit_bh()
> bh->b_rsector = bh->b_blocknr * (bh->b_size >> 9);
But it shouldn't cause data corruptions:
It was discussed a few months ago, and iirc LVM refuses to create too
large volumes.
--
Manfred
On Mon, Mar 26, 2001 at 10:47:13AM -0700, Andreas Dilger wrote:
> What do you mean by problems 5 years down the road? The real issue is that
> this 32-bit block count limit affects composite devices like MD RAID and
> LVM today, not just individual disks. There have been several postings
> I have seen with people having a problem _today_ with a 2TB limit on
> devices.
people who can afford 2TB of disc can afford to buy a 64-bit processor.
--
Revolutions do not require corporate support.
On Mon, Mar 26, 2001 at 08:01:21PM +0200, Manfred Spraul wrote:
> drivers/block/ll_rw_blk.c, in submit_bh()
> > bh->b_rsector = bh->b_blocknr * (bh->b_size >> 9);
>
> But it shouldn't cause data corruptions:
> It was discussed a few months ago, and iirc LVM refuses to create too
> large volumes.
Ah yes, I'd forgotten the block layer still works in terms of 512-byte blocks.
--
Revolutions do not require corporate support.
Matthew Wilcox <[email protected]> writes:
> On Mon, Mar 26, 2001 at 10:47:13AM -0700, Andreas Dilger wrote:
> > What do you mean by problems 5 years down the road? The real issue is that
> > this 32-bit block count limit affects composite devices like MD RAID and
> > LVM today, not just individual disks. There have been several postings
> > I have seen with people having a problem _today_ with a 2TB limit on
> > devices.
>
> people who can afford 2TB of disc can afford to buy a 64-bit processor.
Currently that doesn't solve the problem as block_nr is held in an int.
And as gcc compiles an int to a 32bit number on a 64bit processor, the
problem still isn't solved.
That at least we need to address.
Eric
On Mon, 26 Mar 2001, Matthew Wilcox wrote:
>
> people who can afford 2TB of disc can afford to buy a 64-bit processor.
>
Sort of. A back-of-the-envelope calculation shows that 2 TB is only 25
80GB IDE drives. Given 4 3ware 8-channel IDE controllers and a large
enough case, you could probably build a cheap 2TB RAID0 array for ~$10k.
You could do RAID5 for only slightly more.
While this isn't exactly a standard, off-the-shelf, general-purpose sort
of configuration, it definately has its uses. Be careful assuming that
huge amounts of disk storage requires a huge amount of money, or a high
level of reliability or performance.
Scott
Matthew Wilcox writes:
> On Mon, Mar 26, 2001 at 10:47:13AM -0700, Andreas Dilger wrote:
> > What do you mean by problems 5 years down the road? The real issue is that
> > this 32-bit block count limit affects composite devices like MD RAID and
> > LVM today, not just individual disks. There have been several postings
> > I have seen with people having a problem _today_ with a 2TB limit on
> > devices.
>
> people who can afford 2TB of disc can afford to buy a 64-bit processor.
Get real. If you buy (cheapest) 40GB IDE disks, I can have 2TB for
U$9200 (not including controllers). In 1 year it will be half, etc.
I expect I will start moving my DVD collection to disk storage in an
ia32 system once price/GB falls by 50% from current levels. This is
just for home use, let alone what large companies want to do. I am
fully expecting hard drive price/GB to keep falling at its current rate.
This whole "64-bit" fallacy has got to stop. First it was "anybody
who needs files > 2GB should use a 64-bit CPU", wrong. Then it was
"anybody who needs > 1GB RAM should use a 64-bit CPU", wrong. Now it is
"anybody who needs > 2TB disk should use a 64-bit CPU", soon to be wrong.
I don't think the millions of 32-bit systems will disappear overnight,
or even in 10 years, yet we already have single IDE disks > 100GB, and
in 2 or 3 years we will have single IDE disks > 1TB that people will
want to use in their 32-bit systems.
Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert
--------- Received message begins Here ---------
>
> On Mon, Mar 26, 2001 at 08:39:21AM -0800, LA Walsh wrote:
> > I vaguely remember a discussion about this a few months back.
> > If I remember, the reasoning was it would unnecessarily slow
> > down smaller systems that would never have block devices in
> > the 4-28T range attached.
>
> 4k page size * 2GB = 8TB.
>
> i consider it much more likely on such systems that the page size will
> be increased to maybe 16 or 64k which would give us 32TB or 128TB.
> you keep on trying to increase the size of types without looking at
> what gcc outputs in the way of code that manipulates 64-bit types.
> seriously, why don't you just try it? see what the performance is.
> see what the code size is. then come back with some numbers. and i mean
> numbers, not `it doesn't feel any slower'.
>
> personally, i'm going to see what the situation looks like in 5 years time
> and try to solve the problem then. there're enough real problems with the
> VFS today that i don't feel inclined to fix tomorrow's potential problems.
I don't feel that it is that far away ... IBM has already released a 64 CPU
intel based system (NUMA). We already have systems in that class (though
64 bit based) that use 5 TB file systems. The need is coming, and appears
to be coming fast. It should be resolved during the improvements to the
VFS.
A second reason to include it in the VFS is that the low level filesystem
implementation would NOT be required to use it. If the administrator
CHOOSES to access a 16TB filesystem from a workstation, then it should
be possible (likely something like the GFS, where the administrator is
just monitoring things, would be reasonable for a 32 bit system to do).
As I see it, the VFS itself doesn't really care what the block size is,
it just carries relatively opaque values that the filesystem implementation
uses. Most of the overhead should just be copying an extra 4 bytes around.
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]
Any opinions expressed are solely my own.
On Mon, 26 Mar 2001, Matthew Wilcox wrote:
> people who can afford 2TB of disc can afford to buy a 64-bit processor.
You realise that this'll double the price of storage? ;)
(at least, in a year or two)
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
Manfred Spraul wrote:
>
> >4k page size * 2GB = 8TB.
>
> Try it.
> If your drive (array) is larger than 512byte*4G (4TB) linux will eat
> your data.
---
I have a block device that doesn't use 'sectors'. It
only uses the logical block size (which is currently set for
1K). Seems I could up that to the max blocksize (4k?) and
get 8TB...No?
I don't use the generic block make request (have my
own).
--
L A Walsh | Trust Technology, Core Linux, SGI
[email protected] | Voice: (650) 933-5338
"Eric W. Biederman" wrote:
>
> Matthew Wilcox <[email protected]> writes:
>
> > On Mon, Mar 26, 2001 at 10:47:13AM -0700, Andreas Dilger wrote:
> > > What do you mean by problems 5 years down the road? The real issue is that
> > > this 32-bit block count limit affects composite devices like MD RAID and
> > > LVM today, not just individual disks. There have been several postings
> > > I have seen with people having a problem _today_ with a 2TB limit on
> > > devices.
> >
> > people who can afford 2TB of disc can afford to buy a 64-bit processor.
>
> Currently that doesn't solve the problem as block_nr is held in an int.
> And as gcc compiles an int to a 32bit number on a 64bit processor, the
> problem still isn't solved.
>
> That at least we need to address.
And then you must face the fact that there may be the need for
some of the shelf software, which isn't well supported on
correspondig 64 bit architectures... as well. So the
arguemnt doesn't hold up to the reality in any way.
BTW. For many reasons 32 bit architecutres are in
respoect of some application shemes *faster* the 64.
Ultra III in 64 mode just crawls in comparision to 32.
Alpha - unfortulatly an orphaned and dyring archtecutre... which
is not well supported by sw verndors...
>>>>> "Matthew" == Matthew Wilcox <[email protected]> writes:
Matthew> On Mon, Mar 26, 2001 at 10:47:13AM -0700, Andreas Dilger
Matthew> wrote:
>> What do you mean by problems 5 years down the road? The real issue
>> is that this 32-bit block count limit affects composite devices
>> like MD RAID and LVM today, not just individual disks. There have
>> been several postings I have seen with people having a problem
>> _today_ with a 2TB limit on devices.
Matthew> people who can afford 2TB of disc can afford to buy a 64-bit
Matthew> processor.
Oh great, and migrating a large application to a new architecture is
soo cheap. Disk costs nothing these days and there is a legitimate
need here.
Jes
On Mon, 26 Mar 2001, Andreas Dilger wrote:
> Matthew Wilcox writes:
> > people who can afford 2TB of disc can afford to buy a 64-bit processor.
> This whole "64-bit" fallacy has got to stop.
Indeed.
> Now it is "anybody who needs > 2TB disk should use a 64-bit CPU", soon
> to be wrong.
It was already wrong in 1995.
-Dan
Martin Dalecki <[email protected]>:
> "Eric W. Biederman" wrote:
> >
> > Matthew Wilcox <[email protected]> writes:
> >
> > > On Mon, Mar 26, 2001 at 10:47:13AM -0700, Andreas Dilger wrote:
> > > > What do you mean by problems 5 years down the road? The real issue is that
> > > > this 32-bit block count limit affects composite devices like MD RAID and
> > > > LVM today, not just individual disks. There have been several postings
> > > > I have seen with people having a problem _today_ with a 2TB limit on
> > > > devices.
> > >
> > > people who can afford 2TB of disc can afford to buy a 64-bit processor.
> >
> > Currently that doesn't solve the problem as block_nr is held in an int.
> > And as gcc compiles an int to a 32bit number on a 64bit processor, the
> > problem still isn't solved.
> >
> > That at least we need to address.
>
> And then you must face the fact that there may be the need for
> some of the shelf software, which isn't well supported on
> correspondig 64 bit architectures... as well. So the
> arguemnt doesn't hold up to the reality in any way.
You are missing the point - I may need to use a 32 bit system to monitor
a large file system. I don't need the compute power of most 64 bit systems
to monitor user file activity.
> BTW. For many reasons 32 bit architecutres are in
> respoect of some application shemes *faster* the 64.
Which is why I want to use them with a 64 bit file system. Some of the
weather models run here have been known to exceed 100 GB data file. Yes
one file. Most only need 20GB, but there are a couple of hundred of them...
> Ultra III in 64 mode just crawls in comparision to 32.
Depends on what you are doing. If you need to handle large arrays of
floating point it is reasonable (not great, just reasonable).
> Alpha - unfortulatly an orphaned and dyring archtecutre... which
> is not well supported by sw verndors...
These are NOT the only 64 bit systems - Intel, PPC, IBM (in various guises).
If you need raw compute power, the Alpha is pretty good (we have over a
1000 in a Cray T3..).
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]
Any opinions expressed are solely my own.
From: "LA Walsh" <[email protected]>
> Manfred Spraul wrote:
> >
> > >4k page size * 2GB = 8TB.
> >
> > Try it.
> > If your drive (array) is larger than 512byte*4G (4TB) linux will eat
> > your data.
> ---
> I have a block device that doesn't use 'sectors'. It
> only uses the logical block size (which is currently set for
> 1K). Seems I could up that to the max blocksize (4k?) and
> get 8TB...No?
>
> I don't use the generic block make request (have my
> own).
>
Which field do you access? bh->b_blocknr instead of bh->r_sector?
There were plans to split the buffer_head into 2 structures: buffer
cache data and the block io data.
b_blocknr is buffer cache only, no driver should access them.
http://groups.google.com/groups?q=NeilBrown+io_head&hl=en&lr=&safe=off&r
num=1&seld=928643305&ic=1
--
Manfred
>These are NOT the only 64 bit systems - Intel, PPC, IBM (in various guises).
>If you need raw compute power, the Alpha is pretty good (we have over a
>1000 in a Cray T3..).
Best of all, the PowerPC and the POWER are binary-compatible to a very
large degree - just the latter has an extra set of 64-bit instructions.
What was that I was hearing about having to redevelop or recompile your
apps for 64-bit?
I can easily imagine a 64-bit filesystem being accessed by a bunch of
RS/6000s and monitored using an old PowerMac. Goodness, the PowerMac 9600
even has 6 PCI slots to put all those SCSI-RAID and Ethernet cards in. :)
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]
The key to knowledge is not to rely on people to teach you it.
Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/
-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r++ y+(*)
-----END GEEK CODE BLOCK-----
Manfred Spraul wrote:
> Which field do you access? bh->b_blocknr instead of bh->r_sector?
---
Yes.
>
> There were plans to split the buffer_head into 2 structures: buffer
> cache data and the block io data.
> b_blocknr is buffer cache only, no driver should access them.
---
My 'device' only lives in the buffer cache. I write
to the device 95% only from kernel space and then it is read
out in large 256K reads by a user-land daemon to copy to a file.
The user-land daemon may also use 'sendfile' to pull the
data out of the device and copy it to a file which should, as I
understand it, result in a kernel only copy from the device
to the output file buffers -- meaning no copy of the data
to user space would be needed. My primary 'dig' in all this is the
32-bit block_nr's in the buffer cache.
-l
--
L A Walsh | Trust Technology, Core Linux, SGI
[email protected] | Voice: (650) 933-5338
On Mon, Mar 26, 2001 at 11:37:52AM -0700, Eric W. Biederman wrote:
> Matthew Wilcox <[email protected]> writes:
>
> > On Mon, Mar 26, 2001 at 10:47:13AM -0700, Andreas Dilger wrote:
> > > What do you mean by problems 5 years down the road? The real issue is that
> > > this 32-bit block count limit affects composite devices like MD RAID and
> > > LVM today, not just individual disks. There have been several postings
> > > I have seen with people having a problem _today_ with a 2TB limit on
> > > devices.
> >
> > people who can afford 2TB of disc can afford to buy a 64-bit processor.
>
> Currently that doesn't solve the problem as block_nr is held in an int.
> And as gcc compiles an int to a 32bit number on a 64bit processor, the
> problem still isn't solved.
>
> That at least we need to address.
What I don't understand is why we can't just put an option in the Linux
config to enable 64-bit block support, as we have with the High Memory
Support option. That way the user could select that option if they want it,
regardless of the processor they are using. Jens Axboe <[email protected]>
already mentioned he had patched the kernel to do someting similar earlier
this month on a similar thread on linux-kernel.
It makes sense to have this option when we have an enterprise level LVM
and 64-bit file systems such as the Global File System (GFS) for Linux.
Regards,
--
AJ Lewis
Sistina Software Inc. Voice: 612-379-3951
1313 5th St SE, Suite 111 Fax: 612-379-3952
Minneapolis, MN 55414 E-Mail: [email protected]
http://www.sistina.com
Current GPG fingerprint = 3B5F 6011 5216 76A5 2F6B 52A0 941E 1261 0029 2648
Get my key at: http://www.sistina.com/~lewis/gpgkey
(Unfortunately, the PKS-type keyservers do not work with multiple sub-keys)
-----Begin Obligatory Humorous Quote----------------------------------------
APATHY ERROR: Don't bother striking any key.
-----End Obligatory Humorous Quote------------------------------------------
On Mon, 26 Mar 2001, Jonathan Morton wrote:
>>These are NOT the only 64 bit systems - Intel, PPC, IBM (in various guises).
>>If you need raw compute power, the Alpha is pretty good (we have over a
>>1000 in a Cray T3..).
>
>Best of all, the PowerPC and the POWER are binary-compatible to a very
>large degree - just the latter has an extra set of 64-bit instructions.
>What was that I was hearing about having to redevelop or recompile your
>apps for 64-bit?
>
>I can easily imagine a 64-bit filesystem being accessed by a bunch of
>RS/6000s and monitored using an old PowerMac. Goodness, the PowerMac 9600
>even has 6 PCI slots to put all those SCSI-RAID and Ethernet cards in. :)
Save the money - get one fiber channel and connect to all that through
one interface...
--
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]
Any opinions expressed are solely my own.
Ion Badulescu wrote:
> Are you being deliberately insulting, "L", or are you one of those users
> who bitch and scream for features they *need* at *any cost*, and who
> have never even opened up the book for Computer Architecture 101?
---
Sorry, I was borderline insulting. I'm getting pressure on
personal fronts other than just here. But my degree is in computer
science and I've had almost 20 years experience programming things
as small as 8080's w/ 4K ram on up. I'm familiar with 'cost' of
emulation.
> Let's try to keep the discussion civilized, shall we?
---
Certainly.
>
> Compile option or not, 64-bit arithmetic is unacceptable on IA32. The
> introduction of LFS was bad enough, we don't need yet another proof that
> IA32 sucks. Especially when there *are* better alternatives.
===
So if it is a compile option -- the majority of people
wouldn't be affected, is that in agreement? Since the default would
be to use the same arithmetic as we use now.
In fact, I posit that if anything, the majority of the people
might be helped as the block_nr becomes a a 'typed' value -- and
perhaps the sector_nr as well. They remain the same size, but as
a typed value the kernel gains increased integrity from the increased
type checking. At worst, it finds no new bugs and there is no impact
in speed. Are we in agreement so far?
Now lets look at the sites want to process terabytes of
data -- perhaps files systems up into the Pentabyte range. Often I
can see these being large multi-node (think 16-1024 clusters as
are in use today for large super-clusters). If I was to characterize
the performance of them, I'd likely see the CPU pegged at 100%
with 99% usage in user space. Let's assume that increasing the
block size decreases disk accesses by as much as 10% (you'll have
to admit -- using a 64bit quantity vs. 32bit quantity isn't going
to even come close to increasing disk access times by 1 millisecond,
really, so it really is going to be a much smaller fraction when
compared to the actual disk latency).
Ok...but for the sake of
argument using 10% -- that's still only 10% of 1% spent in the system.
or a slowdown of .1%. Now that's using a really liberal figure
of 10%. If you look at the actual speed of 64 bit arithmatic vs.
32, we're likely talking -- upper bound, 10x the clocks for
disk block arithmetic. Disk block arithmetic is a small fraction
of time spent in the kernel. We have to be looking at *maximum*
slowdowns in the range of a few hundred maybe a few thousand extra clocks.
A 1000 extra clocks on a 1G machine is 1 microsecond, or approx
1/5000th your average seek latency on a *fast* hard disk. So
instead of 10% slowdown we are talking slowdowns in the 1/1000 range
or less. Now that's a slowdown in the 1% that was being spent in
the kernel, so now we've slowdown the total program speed by .001%
at the increase benefit (to that site) of being able to process
those mega-gig's (Pentabytes) of information. For a hit that is
not noticable to human perception, they go from not being able to
use super-clusters of IA32 machines (for which HW and SW is cheap),
to being able to use it. That's quite a cost savings for them.
Is there some logical flaw in the above reasoning?
-linda
--
L A Walsh | Trust Technology, Core Linux, SGI
[email protected] | Voice: (650) 933-5338
On Tue, Mar 27, 2001 at 09:15:08AM -0800, LA Walsh wrote:
> Now lets look at the sites want to process terabytes of
> data -- perhaps files systems up into the Pentabyte range. Often I
> can see these being large multi-node (think 16-1024 clusters as
> are in use today for large super-clusters). If I was to characterize
> the performance of them, I'd likely see the CPU pegged at 100%
> with 99% usage in user space. Let's assume that increasing the
> block size decreases disk accesses by as much as 10% (you'll have
> to admit -- using a 64bit quantity vs. 32bit quantity isn't going
> to even come close to increasing disk access times by 1 millisecond,
> really, so it really is going to be a much smaller fraction when
> compared to the actual disk latency).
[snip]
> Is there some logical flaw in the above reasoning?
But those changes will affect even the fastpath, i.e. data that is
already in the page/buffer caches. In which case we don't have to wait
for disk access latency. Why would anyone who is working with a
pentabyte of data even consider not relying on essentially always
hitting data that is available the read-ahead cache.
Using similar numbers as presented. If we are working our way through
every single block in a Pentabyte filesystem, and the blocksize is 512
bytes. Then the 1us in extra CPU cycles because of 64-bit operations
would add, according to by back of the envelope calculation, 2199023
seconds of CPU time a bit more than 25 days.
Seriously, there is a lot more that needs to be done than introducing a
64-bit blocknumber. Effectively 512 byte blocks are far too small for
that kind of data, and going to pagesize blocks (and increasing pagesize
to 64KB or 2MB at the same time) is a solution that is far more likely
to give good results since it reduces both the total the number of
'blocks' on the device as well as reducing the total amount of calls
throughout kernel space instead of increasing the cost per call.
Jan
LA Walsh <[email protected]>:
> Ion Badulescu wrote:
> > Compile option or not, 64-bit arithmetic is unacceptable on IA32. The
> > introduction of LFS was bad enough, we don't need yet another proof that
> > IA32 sucks. Especially when there *are* better alternatives.
> ===
> So if it is a compile option -- the majority of people
> wouldn't be affected, is that in agreement? Since the default would
> be to use the same arithmetic as we use now.
>
> In fact, I posit that if anything, the majority of the people
> might be helped as the block_nr becomes a a 'typed' value -- and
> perhaps the sector_nr as well. They remain the same size, but as
> a typed value the kernel gains increased integrity from the increased
> type checking. At worst, it finds no new bugs and there is no impact
> in speed. Are we in agreement so far?
>
> Now lets look at the sites want to process terabytes of
> data -- perhaps files systems up into the Pentabyte range. Often I
> can see these being large multi-node (think 16-1024 clusters as
> are in use today for large super-clusters). If I was to characterize
> the performance of them, I'd likely see the CPU pegged at 100%
> with 99% usage in user space. Let's assume that increasing the
> block size decreases disk accesses by as much as 10% (you'll have
> to admit -- using a 64bit quantity vs. 32bit quantity isn't going
> to even come close to increasing disk access times by 1 millisecond,
> really, so it really is going to be a much smaller fraction when
> compared to the actual disk latency).
Relatively small quibble - Current large clusters (SP3, 330 node 4cpu/node)
gets around 85% to 90% (real user) user mode total cpu. The rest is user
mode is attributed to overhead. Why:
1. Inter-node communication/synchronization
2. Memory bus saturation
3. Users usually use only 3 cpus/node and allow the last cpu to handle
filesystem/network/administration/batch handling functions. Using the
last cpu in the node for part of the job reduces the overall throughput
> Ok...but for the sake of
> argument using 10% -- that's still only 10% of 1% spent in the system.
> or a slowdown of .1%. Now that's using a really liberal figure
> of 10%. If you look at the actual speed of 64 bit arithmatic vs.
> 32, we're likely talking -- upper bound, 10x the clocks for
> disk block arithmetic. Disk block arithmetic is a small fraction
> of time spent in the kernel. We have to be looking at *maximum*
> slowdowns in the range of a few hundred maybe a few thousand extra clocks.
> A 1000 extra clocks on a 1G machine is 1 microsecond, or approx
> 1/5000th your average seek latency on a *fast* hard disk. So
> instead of 10% slowdown we are talking slowdowns in the 1/1000 range
> or less. Now that's a slowdown in the 1% that was being spent in
> the kernel, so now we've slowdown the total program speed by .001%
> at the increase benefit (to that site) of being able to process
> those mega-gig's (Pentabytes) of information. For a hit that is
> not noticable to human perception, they go from not being able to
> use super-clusters of IA32 machines (for which HW and SW is cheap),
> to being able to use it. That's quite a cost savings for them.
>
> Is there some logical flaw in the above reasoning?
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]
Any opinions expressed are solely my own.
--------- Received message begins Here ---------
>
> On Tue, Mar 27, 2001 at 09:15:08AM -0800, LA Walsh wrote:
> > Now lets look at the sites want to process terabytes of
> > data -- perhaps files systems up into the Pentabyte range. Often I
> > can see these being large multi-node (think 16-1024 clusters as
> > are in use today for large super-clusters). If I was to characterize
> > the performance of them, I'd likely see the CPU pegged at 100%
> > with 99% usage in user space. Let's assume that increasing the
> > block size decreases disk accesses by as much as 10% (you'll have
> > to admit -- using a 64bit quantity vs. 32bit quantity isn't going
> > to even come close to increasing disk access times by 1 millisecond,
> > really, so it really is going to be a much smaller fraction when
> > compared to the actual disk latency).
> [snip]
> > Is there some logical flaw in the above reasoning?
>
> But those changes will affect even the fastpath, i.e. data that is
> already in the page/buffer caches. In which case we don't have to wait
> for disk access latency. Why would anyone who is working with a
> pentabyte of data even consider not relying on essentially always
> hitting data that is available the read-ahead cache.
It depends entirely on the application. Where the cache can contain
20% of the data, most accesses should already be in memory. If the
data is significantly larger, there is a high chance that the data
will not be there.
>
> Using similar numbers as presented. If we are working our way through
> every single block in a Pentabyte filesystem, and the blocksize is 512
> bytes. Then the 1us in extra CPU cycles because of 64-bit operations
> would add, according to by back of the envelope calculation, 2199023
> seconds of CPU time a bit more than 25 days.
Ummm... I don't think it adds that much. You seem to be leaving out the
overlap disk/IO and computation for read-ahead. This should eliminate the
majority of the delay effect.
> Seriously, there is a lot more that needs to be done than introducing a
> 64-bit blocknumber. Effectively 512 byte blocks are far too small for
> that kind of data, and going to pagesize blocks (and increasing pagesize
> to 64KB or 2MB at the same time) is a solution that is far more likely
> to give good results since it reduces both the total the number of
> 'blocks' on the device as well as reducing the total amount of calls
> throughout kernel space instead of increasing the cost per call.
Talk about adding overhead... How long do you think it takes to read a
2MB block (not to mention the time to update that page..) The additional
contention on the fiberchannel I/O alone might kill it if the filesystem
is busy.
Granted, 512 bytes could be considered too small for some things, but
once you pass 32K you start adding a lot of rotational delay problems.
I've used file systems with 256K blocks - they are slow when compaired
to the throughput using 32K. I wasn't the one running the benchmarks,
but with a MaxStrat 400GB raid with 256K sized data transfer was much
slower (around 3 times slower) than 32K. (The target application was
a GIS server using Oracle).
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]
Any opinions expressed are solely my own.
On Tue, Mar 27, 2001 at 01:57:42PM -0600, Jesse Pollard wrote:
> > Using similar numbers as presented. If we are working our way through
> > every single block in a Pentabyte filesystem, and the blocksize is 512
> > bytes. Then the 1us in extra CPU cycles because of 64-bit operations
> > would add, according to by back of the envelope calculation, 2199023
> > seconds of CPU time a bit more than 25 days.
>
> Ummm... I don't think it adds that much. You seem to be leaving out the
> overlap disk/IO and computation for read-ahead. This should eliminate the
> majority of the delay effect.
1024 TB should be around 2*10^12 512-byte blocks, divide by 10^6 (1us)
of "assumed" overhead per block operation is 2*10^6 seconds, no I
believe I'm pretty close there. I am considering everything being
"available in the cache", i.e. no waiting for disk access.
> > Seriously, there is a lot more that needs to be done than introducing a
> > 64-bit blocknumber. Effectively 512 byte blocks are far too small for
> > that kind of data, and going to pagesize blocks (and increasing pagesize
> > to 64KB or 2MB at the same time) is a solution that is far more likely
> > to give good results since it reduces both the total the number of
> > 'blocks' on the device as well as reducing the total amount of calls
> > throughout kernel space instead of increasing the cost per call.
>
> Talk about adding overhead... How long do you think it takes to read a
> 2MB block (not to mention the time to update that page..) The additional
> contention on the fiberchannel I/O alone might kill it if the filesystem
> is busy.
The time to update the pagetables is identical to the time to update a
4KB page when the OS is using a 2MB pagesize. Ofcourse it will take more
time to load the data into the page, however it should be a consecutive
stretch of data on disk, which should give a more efficient transfer
than small blocks scattered around the disk.
> Granted, 512 bytes could be considered too small for some things, but
> once you pass 32K you start adding a lot of rotational delay problems.
> I've used file systems with 256K blocks - they are slow when compaired
> to the throughput using 32K. I wasn't the one running the benchmarks,
> but with a MaxStrat 400GB raid with 256K sized data transfer was much
> slower (around 3 times slower) than 32K. (The target application was
> a GIS server using Oracle).
But your subsystem (the disk) was probably still using 512 byte blocks,
possibly scattered. And the OS was still using 4KB pages, it takes more
time to reclaim and gather 64 pages per IO operation than one, that's
why I'm saying that the pagesize needs to scale along with the blocksize.
The application might have been assuming a small block size as well, and
the OS was told to do several read/modify/write cycles, perhaps even 512
times as much as necessary.
I'm not saying that the current system will perform well when working
with large blocks, but compared to increasing the size of block_t, a
larger blocksize has more potential to give improvements in the long
term without adding an unrecoverable performance hit.
Jan
Jan Harkes wrote:
>
> On Tue, Mar 27, 2001 at 01:57:42PM -0600, Jesse Pollard wrote:
> > > Using similar numbers as presented. If we are working our way through
> > > every single block in a Pentabyte filesystem, and the blocksize is 512
> > > bytes. Then the 1us in extra CPU cycles because of 64-bit operations
> > > would add, according to by back of the envelope calculation, 2199023
> > > seconds of CPU time a bit more than 25 days.
> >
> > Ummm... I don't think it adds that much. You seem to be leaving out the
> > overlap disk/IO and computation for read-ahead. This should eliminate the
> > majority of the delay effect.
>
> 1024 TB should be around 2*10^12 512-byte blocks, divide by 10^6 (1us)
> of "assumed" overhead per block operation is 2*10^6 seconds, no I
> believe I'm pretty close there. I am considering everything being
> "available in the cache", i.e. no waiting for disk access.
---
If everything being used is only used from the cache, then
the application probably doesn't need 64-bit block support.
I submit that your argument may be flawed in the assumption that
if an application needs multi-terabyte files and devices, that most
of the data will be in the in-memory cache.
> The time to update the pagetables is identical to the time to update a
> 4KB page when the OS is using a 2MB pagesize. Ofcourse it will take more
> time to load the data into the page, however it should be a consecutive
> stretch of data on disk, which should give a more efficient transfer
> than small blocks scattered around the disk.
---
Not if you were doing alot of random reads where you only
needd 1-2K of data. The read-time of the extra 2M-1K would seem
to eat into any performance boot gained by the large pagesize.
>
> > Granted, 512 bytes could be considered too small for some things, but
> > once you pass 32K you start adding a lot of rotational delay problems.
> > I've used file systems with 256K blocks - they are slow when compaired
> > to the throughput using 32K. I wasn't the one running the benchmarks,
> > but with a MaxStrat 400GB raid with 256K sized data transfer was much
> > slower (around 3 times slower) than 32K. (The target application was
> > a GIS server using Oracle).
>
> But your subsystem (the disk) was probably still using 512 byte blocks,
> possibly scattered. And the OS was still using 4KB pages, it takes more
> time to reclaim and gather 64 pages per IO operation than one, that's
> why I'm saying that the pagesize needs to scale along with the blocksize.
>
> The application might have been assuming a small block size as well, and
> the OS was told to do several read/modify/write cycles, perhaps even 512
> times as much as necessary.
>
> I'm not saying that the current system will perform well when working
> with large blocks, but compared to increasing the size of block_t, a
> larger blocksize has more potential to give improvements in the long
> term without adding an unrecoverable performance hit.
---
That's totally application dependent. Database applications
might tend to skip around in the data and do short/reads/writes over
a very large file. Large block sizes will degrade their performance.
This was the idea of making it a *configurable* option. If
you need it, configure it. Same with block size -- that should
likely have a wider range for configuration as well. But
configuration (and ideally auto-configuration where possible)
seems the ultimate win-win situation.
-l
--
The above thoughts are my own and do not necessarily represent those
of my employer.
L A Walsh | Trust Technology, Core Linux, SGI
[email protected] | Voice: (650) 933-5338
Jan Harkes <[email protected]>:
>
> On Tue, Mar 27, 2001 at 01:57:42PM -0600, Jesse Pollard wrote:
> > > Using similar numbers as presented. If we are working our way through
> > > every single block in a Pentabyte filesystem, and the blocksize is 512
> > > bytes. Then the 1us in extra CPU cycles because of 64-bit operations
> > > would add, according to by back of the envelope calculation, 2199023
> > > seconds of CPU time a bit more than 25 days.
> >
> > Ummm... I don't think it adds that much. You seem to be leaving out the
> > overlap disk/IO and computation for read-ahead. This should eliminate the
> > majority of the delay effect.
>
> 1024 TB should be around 2*10^12 512-byte blocks, divide by 10^6 (1us)
> of "assumed" overhead per block operation is 2*10^6 seconds, no I
> believe I'm pretty close there. I am considering everything being
> "available in the cache", i.e. no waiting for disk access.
That would be true for small files (< 5GB). I have to deal with files that
may be 20-100 GB. Except for the largest systems (200GB of main memory)
the data will NOT be in the cache except for ~50% of the time. (assuming
only one user....)
> > > Seriously, there is a lot more that needs to be done than introducing a
> > > 64-bit blocknumber. Effectively 512 byte blocks are far too small for
> > > that kind of data, and going to pagesize blocks (and increasing pagesize
> > > to 64KB or 2MB at the same time) is a solution that is far more likely
> > > to give good results since it reduces both the total the number of
> > > 'blocks' on the device as well as reducing the total amount of calls
> > > throughout kernel space instead of increasing the cost per call.
> >
> > Talk about adding overhead... How long do you think it takes to read a
> > 2MB block (not to mention the time to update that page..) The additional
> > contention on the fiberchannel I/O alone might kill it if the filesystem
> > is busy.
>
> The time to update the pagetables is identical to the time to update a
> 4KB page when the OS is using a 2MB pagesize. Ofcourse it will take more
> time to load the data into the page, however it should be a consecutive
> stretch of data on disk, which should give a more efficient transfer
> than small blocks scattered around the disk.
You assume the file is accessed sequentially. The wether models don't do
that. They do have some locality, but only in a 3D sense. When you include
time it becomes closer to a random disk block reference when everything has to
be linearized.
>
> > Granted, 512 bytes could be considered too small for some things, but
> > once you pass 32K you start adding a lot of rotational delay problems.
> > I've used file systems with 256K blocks - they are slow when compaired
> > to the throughput using 32K. I wasn't the one running the benchmarks,
> > but with a MaxStrat 400GB raid with 256K sized data transfer was much
> > slower (around 3 times slower) than 32K. (The target application was
> > a GIS server using Oracle).
>
> But your subsystem (the disk) was probably still using 512 byte blocks,
> possibly scattered. And the OS was still using 4KB pages, it takes more
> time to reclaim and gather 64 pages per IO operation than one, that's
> why I'm saying that the pagesize needs to scale along with the blocksize.
It wasn't - the "disks" were composed of groups of 5 drives in a raid striped
for speed and spread across 5 SCSI III controlers. Each raid attached had
16MB internal cache. I think the controlers were using an entire sector
read (32K).
> The application might have been assuming a small block size as well, and
> the OS was told to do several read/modify/write cycles, perhaps even 512
> times as much as necessary.
There was some of that, but not much. Oracle (as I recall) allows for the
specification of transfer size.
This also brings up the problem of small files. Allocating 2MB per file
would, waist quite a bit of disk space (assuming 5 - 10 million files
with only 15% having 25GB or more).
> I'm not saying that the current system will perform well when working
> with large blocks, but compared to increasing the size of block_t, a
> larger blocksize has more potential to give improvements in the long
> term without adding an unrecoverable performance hit.
Not when the filesystem is required for general use. It only makes it
simpler to actually have a large filesystem. It doesn't help when it
must be used.
Now you are saying that the throughput WILL go down, but only if you use
large block sizes.
I can go along with making block sizes up to 8K. Even 32K for special
circumstances (even 64K for dedicated use). But not larger. NFS overhead on
file I/O becomes way too excessive (...worst example now is having to read
a 2MB block to update 512 bytes, then write it back... :-)
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]
Any opinions expressed are solely my own.
Hi,
Just a brief add to the discussion, besides which I have a vested interest
in this!
I do not believe that you can make the addressability of a device larger at
the expense of granularity of address space at the bottom end. Just because
ext2 has a single size for metadata does not mean everything you put on the
disks does. XFS filesystems, for example, can be made with block sizes from
512 bytes to 64Kbytes (ok not working on linux across this range yet, but it
will).
In all of these cases we have chunks of metadata which are 512 bytes
long, and we have chunks bigger than the blocksize. The 512 byte chunks
are the superblock and the heads of the freespace structures, there
are multilples of them through the filesystem.
To top that, we have disk write ordering constraints that could mean that
for two of the 512 byte chunks next to each other one must be written to
disk now to free log space, the other must not be written to disk because it
is in a transaction. We would be forced to do read-modify-write down at
some lower level - wait the lower levels would not have the addressability.
There are probably other things which will not fly if you lose the
addressing granularity. Volume headers and such like would be one
possibility.
No I don't have a magic bullet solution, but I do not think that just
increasing the granularity of the addressing is the correct answer,
and yes I do agree that just growing the buffer_head fields is not
perfect either.
Steve Lord
p.s. there was mention of bigger page size, it is not hard to fix, but the
swap path will not even work with 64K pages right now.
Steve Lord wrote:
> Just a brief add to the discussion, besides which I have a vested interest
> in this!
I'll add my little comments as well, and hopefully not start a flamewar... :)
[snip comments about blocksize, etc.]
Here's a real-life example of something that most of you will probably hate
me for mentioning:
HFS uses variable sized blocks (made up of multiple 512 byte sectors), but
stores block numbers as a 16 bit value. (I know, everyone will say, "We're
talking about moving from 32 to 64 bits." Keep listening.) This gave great
performance on the then current massive storage of a 20M drive. However,
when it became possible to get the absolutely gigantic hard drive of 1G,
it became more and more obvious that it was a drawback that was causing
a huge amount of wasted space. Apple had to design a new filesystem (HFS+)
that was able to represent blocks with a 32 bit number to overcome the
effective limitation on how big a filesystem could be. It's getting to
the point now that it's easily possible to put together a disk array that
is large enough that even referring to blocks with a 32 bit value requires
relatively large blocks. I don't know if we have very many filesystems that
would support this feature, but it will become important a lot sooner than
anyone may be thinking.
Obviously this case isn't a perfect fit for the situation, since HFS was
designed to be read by 32 bit machines, and the upgrade to 32 bits didn't
give a CPU penalty, just a bus bandwidth problem. Also, I'm coming from
a platform that actually can do a decent job of 64 bit, unlike x86, but
we shouldn't disallow people from doing bigger and better things. It's
become very popular lately to position Linux as an enterprise-ready system,
and this is something that will be expected. People will want to access
a multi-TB database as a single file, as well as other things that may
seem crazy to most people now.
I understand people's aversion to the #ifdefs in the code, but if the changes
are made in a sane way, it can still be clean and easy to maintain. It's
worth it to add a little complexity (particularly as an option) to add a
feature that people will be demanding in the relatively near future. It
might be a good idea to wait for 2.5, tho...
Brad Boyer
[email protected]
P.S.: No, I have no personal reason to need any of this 64 bit filesystem
stuff. Just trying to point out possibilities. Don't expect me to actually
be writing this stuff...
On Mon, Mar 26, 2001 at 08:39:21AM -0800, LA Walsh wrote:
> So...is it the plan, or has it been though about -- 'abstracting'
> block numbes as a typedef 'block_nr', then at compile time
> having it be selectable as to whether or not this was to
> be a 32-bit or 64 bit quantity -- that way older systems would
Oh, did no-one mention the words `Module ABI' yet?
--
Revolutions do not require corporate support.
My turn to chime in.
JFS was designed around a 4K meta-data page size. It would require some
major re-design to use larger block sizes. On the other hand, JFS could
take advantage of 64-bit block addresses immediately. JFS internally
store the block address in 40 bits. (Sorry, file size & volume size are
both limited to 4 peta-bytes on JFS.)
At the rate that storage hardware and requirements are increasing,
increasing the block size is a short-term solution that is only going to
delay the inevitable requirement for 64-bit block addressability. There
is a practical limit to a usable block-size. Someone threw out 64K,
which seems reasonable to me.
--
David Kleikamp
IBM Linux Technology Center