2001-11-12 19:40:14

by Peter T. Breuer

[permalink] [raw]
Subject: what is teh current meaning of blk_size?

Is blk_size[][] supposed to contain the size in KB or blocks?

I'm confused because it looks to me as though it's still in KB,
but people say that devices can be up to 8 or 16TB, and there's
not enough room for that within a signed int containing KB
(2^31 * 2^10 = 2^41 = 2TB).

Either the rumour is true and it's in blocks, or the rumour
is false and it's in KB.

Clue, please!

Peter


2001-11-12 19:59:06

by Martin Dalecki

[permalink] [raw]
Subject: Re: what is teh current meaning of blk_size?

"Peter T. Breuer" wrote:
>
> Is blk_size[][] supposed to contain the size in KB or blocks?
>
> I'm confused because it looks to me as though it's still in KB,
> but people say that devices can be up to 8 or 16TB, and there's
> not enough room for that within a signed int containing KB
> (2^31 * 2^10 = 2^41 = 2TB).
>
> Either the rumour is true and it's in blocks, or the rumour
> is false and it's in KB.
>
> Clue, please!

There is no rumor it's in blocks.

There are three level of block size measurements in linux:

1. FS level one. (page chunks most of time main exceptions are dos and
isofs)
2. Driver level one. (nearly always 1024, main exception are ATAPI
cdrom)
3. Hardware device level one. (nearly always 512, only prehistoric SCSI
drives from the stone age are exceptional and providing 256 byte sized
block. We discovered them druing our archeological efforts recently
under
a thick layer of mood...)

It's left as an exercies to the reader which one of the sparse
matrices in ll_rw_blk.c is declaring which ;-).
...

OK I was to fast to figure it out:


/*
* blk_size contains the size of all block-devices in units of 1024 byte
* sectors:
*
* blk_size[MAJOR][MINOR]
*
* if (!blk_size[MAJOR]) then no minor size checking is done.
*/
int * blk_size[MAX_BLKDEV];

/*
* blksize_size contains the size of all block-devices:
*
* blksize_size[MAJOR][MINOR]
*
* if (!blksize_size[MAJOR]) then 1024 bytes is assumed.
*/
int * blksize_size[MAX_BLKDEV];

/*
* hardsect_size contains the size of the hardware sector of a device.
*
* hardsect_size[MAJOR][MINOR]
*
* if (!hardsect_size[MAJOR])
* then 512 bytes is assumed.
* else
* sector_size is hardsect_size[MAJOR][MINOR]
* This is currently set by some scsi devices and read by the msdos fs
driver.
* Other uses may appear later.
*/
int * hardsect_size[MAX_BLKDEV];


read_ahead there is a blunt - it's a "write only array anyway"....
Initialize
it to the system default and don't care.

2001-11-12 22:14:53

by Peter T. Breuer

[permalink] [raw]
Subject: Re: what is teh current meaning of blk_size?

"A month of sundays ago Martin Dalecki wrote:"
> "Peter T. Breuer" wrote:
> > Is blk_size[][] supposed to contain the size in KB or blocks?
> There is no rumor it's in blocks.
>
> There are three level of block size measurements in linux:
>
> 1. FS level one. (page chunks most of time main exceptions are dos and
> isofs)
> 2. Driver level one. (nearly always 1024, main exception are ATAPI
> cdrom)
> 3. Hardware device level one. (nearly always 512, only prehistoric SCSI
> drives from the stone age are exceptional and providing 256 byte sized
> block. We discovered them druing our archeological efforts recently
> under
> a thick layer of mood...)
>
> It's left as an exercies to the reader which one of the sparse
> matrices in ll_rw_blk.c is declaring which ;-).
> ...

Uh, thanks! I was looking at fs/block_dev.c.

if (blk_size[MAJOR(dev)])
size = ((loff_t) blk_size[MAJOR(dev)][MINOR(dev)] << BLOCK_SIZE_BITS) >> blocksize_bits;

which sets the size to the entered blk_size << 10 - blksize_bits.

I missed that BLOCK_SIZE_BITS was constant but blksize_bits is variable.
Amongst other things.


> OK I was to fast to figure it out:
>
> /*
> * blk_size contains the size of all block-devices in units of 1024 byte
> * sectors:

But this is not so .. it is the default, not the rule. And it is only
the default if the block size is the default value.

> int * blk_size[MAX_BLKDEV];
>
> /*
> * blksize_size contains the size of all block-devices:

Err .... they mean the BLOCK SIZE of all ...

> int * blksize_size[MAX_BLKDEV];
>
> /*
> * hardsect_size contains the size of the hardware sector of a device.

Never used. Thanks for clearing that up!

If you knew if the meaning of blk_size had ever changed, and when in
terms of kernel version, that would also be very very helpful.

Peter

2001-11-13 15:09:23

by Peter T. Breuer

[permalink] [raw]
Subject: Re: what is teh current meaning of blk_size?

"ptb wrote:"
> "Martin Dalecki wrote:"
> > "Peter T. Breuer" wrote:
> > > Is blk_size[][] supposed to contain the size in KB or blocks?
> > There is no rumor it's in blocks.

Nevertheless, experiments on 2.4.3 appear to show it is still in KB
there.

> Uh, thanks! I was looking at fs/block_dev.c.
>
> if (blk_size[MAJOR(dev)])
> size = ((loff_t) blk_size[MAJOR(dev)][MINOR(dev)] << BLOCK_SIZE_BITS) >> blocksize_bits;
>
> which sets the size to the entered blk_size << 10 - blksize_bits.
>
> I missed that BLOCK_SIZE_BITS was constant but blksize_bits is variable.
> Amongst other things.

Thing is, in my driver I have now chenged from setting blk_size to be in KB
and put it in blocks instead (while keeping the blksize the same) and
the result is that using lseek, the device measures to be 1/4 the size
it really is. This is in kernel 2.4.3.

If in look in ll_rw_blk.c, I see, for example:

if (blk_size[major]) {
unsigned long maxsector = (blk_size[major][MINOR(bh->b_rdev)] << 1) + 1;
// (ptb) 1ST SECTOR BEYOND END OF DISK

which implies to me that blk_size is still in KB there.

BTW, I don't know why there should be a +1 at the end. The code goes on
to say:

unsigned long sector = bh->b_rsector; // (ptb) 1ST SECTOR ON DISK
unsigned int count = bh->b_size >> 9; // (ptb) SECTORS IN BUFFER

if (maxsector < count || maxsector - count < sector) {
bh->b_state &= (1 << BH_Lock) | (1 << BH_Mapped);
... good stuff ...

So we look for the nr sectors in the buffer to be _greater_than_
the number of sectors in the device _plus 1_. It should be
_greater_than_or_equal_to ... _plus_1_. But even so it's meaningless.
What we want is to check to see if the buffer contents will overflow
the disk.

I'm not too sure about the other half of the condition either. This
is surely what I mentioned above: sector + count > maxsector?

But again it should be >=. If we are on sector 0 of a 2 sector disk,
and we try and write 3 sectors, then sector=0, count=3, and maxsector=3,
and 0+3 /> 3, so the condition would not trigger, while we want it to.
So it should be >=, not >.

I believe the 1st check is merely a faster calculation and is backed up
by the second check. However, the second check must be right!


> > OK I was to fast to figure it out:
> >
> > /*
> > * blk_size contains the size of all block-devices in units of 1024 byte
> > * sectors:
>
> But this is not so .. it is the default, not the rule. And it is only
> the default if the block size is the default value.
>
> > int * blk_size[MAX_BLKDEV];
> >
> > /*
> > * blksize_size contains the size of all block-devices:
>
> Err .... they mean the BLOCK SIZE of all ...

> If you knew if the meaning of blk_size had ever changed, and when in
> terms of kernel version, that would also be very very helpful.


Peter

2001-11-13 18:52:02

by Peter T. Breuer

[permalink] [raw]
Subject: blocks or KB? (was: .. current meaning of blk_size array)

Let me put it more plainly. Martin Daleki + rumour assures me that the
blk_size array nowadays measure in blocks not KB, yet to me it seems that
it doesn't. Look at this code from ll_rw_blk.c in 2.4.13:

unsigned long maxsector = (blk_size[major][MINOR(bh->b_rdev)] << 1) + 1;

and this comment:

* blk_size contains the size of all block-devices in units of 1024
* byte sectors:

so blk_size measures in KB. Where do I see it wrong? Is everybody
talking about 2.4.14 and 2.4.15? No .. it's just the same in 2.4.14:

if (blk_size[major])
minorsize = blk_size[major][MINOR(bh->b_rdev)];
if (minorsize) {
unsigned long maxsector = (minorsize << 1) + 1;

KB! Or is it the case that sectors don't mean 512B?


Peter

2001-11-14 09:54:14

by Martin Dalecki

[permalink] [raw]
Subject: Re: blocks or KB? (was: .. current meaning of blk_size array)

"Peter T. Breuer" wrote:
>
> Let me put it more plainly. Martin Daleki + rumour assures me that the
> blk_size array nowadays measure in blocks not KB, yet to me it seems that

sectors = 512 per default
blocks = 1024 per default.


Never said anything else.

Look at the initialization point for the arrays. They all use constants
which you can look up in the kernel headers.

./linux/fs.h:#define BLOCK_SIZE_BITS 10
./linux/fs.h:#define BLOCK_SIZE (1<<BLOCK_SIZE_BITS)

Which means 1024 bytes for blk_size as default value.

> it doesn't. Look at this code from ll_rw_blk.c in 2.4.13:


--
- phone: +49 214 8656 283
- job: eVision-Ventures AG, LEV .de (MY OPINIONS ARE MY OWN!)
- langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort:
ru_RU.KOI8-R

2001-11-14 20:41:30

by Peter T. Breuer

[permalink] [raw]
Subject: Re: blocks or KB? (was: .. current meaning of blk_size array)

"Martin Dalecki wrote:"
> "Peter T. Breuer" wrote:
> >
> > Let me put it more plainly. Martin Daleki + rumour assures me that the
> > blk_size array nowadays measure in blocks not KB, yet to me it seems that
>
> sectors = 512 per default
> blocks = 1024 per default.

I know that! But it's irrelevant. What I need to know is if blk_size is
still counting in KB, or if it has switched to blocks.

> Never said anything else.

Err .. you said that blk_size is now measured in blocks, not KB. You
said thet the rumour is true.

"A month of sundays ago Martin Dalecki wrote:"
> "Peter T. Breuer" wrote:
> > Is blk_size[][] supposed to contain the size in KB or blocks?
> There is no rumor it's in blocks.

Maybe I misinterpret what you write. I interpret it as meaning "the
rumour is not a rumour but a fact. It is in blocks".

> Look at the initialization point for the arrays. They all use constants
> which you can look up in the kernel headers.

I _know_ that. It's irrelevant.

The point is that if blk_size counts in KB, then the size of a device
cannot reach more that 2^32 * 2^10 = 2^42 = 4TB. I'd personally say
2TB, becuase the int blk_size number is signed.

That's rumoured not to be the case, and the max size of a device is
supposed to be about 8 to 16TB. Let's suppose the rumour is true ..

So we deduce that one has to assign a different meaning for the blk_size
array. "count in blocks" is how the rumour goes. That way you can get
4 times higher sizes .. all the way to 8 or 16TB per device. And this
is what is rumoured to be the case.

Is it or is it not so? A straight answer from the list would be nice!

> ./linux/fs.h:#define BLOCK_SIZE_BITS 10
> ./linux/fs.h:#define BLOCK_SIZE (1<<BLOCK_SIZE_BITS)

These are _defaults_ for _blksize_. Sure you can change it as you like,
but according to the "blk_size is in KB" hypothesis, this matters not one
iota to the size limit on devices. Change blksize and size does not
change. But according to the "blk_size is in blocks" hypothesis, yes
changing blksize will change the size of the device. Testing shows
that scenario "blk_size is in KB" is true.

Am I making plain the difference between blk_size and blksize?

blk_size is the number of blocks or KB (which?) in a device. blksize is
the size of the blocks. Is blk_size in KB or blocks?

It should be in blocks if the size of a device is to reach 8 or 16TB.
If it is in KB, we are limited to 2 or 4TB.

> Which means 1024 bytes for blk_size as default value.

But so what? That doesn't answer the question of whether blk_size
is in blocks or not.


> > it doesn't. Look at this code from ll_rw_blk.c in 2.4.[14]:

Look at it:

if (blk_size[major])
minorsize = blk_size[major][MINOR(bh->b_rdev)];
if (minorsize) {
unsigned long maxsector = (minorsize << 1) + 1;

This clearly hardcodes blk_size as measuring in units of 2 sectors, no
matter what we set for blksize. It should be, in my view

unsigned long maxsector =
minorsize * blksize_size[major][MINOR(bh->b_rdev] + 1;

or no device can be larger than 4TB. And neither can a filesystem, and
neither can a file ...

Now, I know I can write my own generic_make_request() code, but I have
no intention of maintaining it through different kernel versions just
to get the right size measurement. Besides, it's everyone's problem.

Persuade me that this is not a bug, and an important one at that :-)
Hellloooooo everybody! Linux cannot manage partitions greater than
4TB, ha ha ha hhhhhaaaa! ;-)

I at least am getting up to devicesizes at the 8TB range.

Best wishes!

Peter

2001-11-14 21:01:10

by Martin Dalecki

[permalink] [raw]
Subject: Re: blocks or KB? (was: .. current meaning of blk_size array)

"Peter T. Breuer" wrote:
>
> "Martin Dalecki wrote:"
> > "Peter T. Breuer" wrote:
> > >
> > > Let me put it more plainly. Martin Daleki + rumour assures me that the
> > > blk_size array nowadays measure in blocks not KB, yet to me it seems that
> >
> > sectors = 512 per default
> > blocks = 1024 per default.
>
> I know that! But it's irrelevant. What I need to know is if blk_size is
> still counting in KB, or if it has switched to blocks.
>
> > Never said anything else.
>
> Err .. you said that blk_size is now measured in blocks, not KB. You
> said thet the rumour is true.
>
> "A month of sundays ago Martin Dalecki wrote:"
> > "Peter T. Breuer" wrote:
> > > Is blk_size[][] supposed to contain the size in KB or blocks?
> > There is no rumor it's in blocks.
>
> Maybe I misinterpret what you write. I interpret it as meaning "the
> rumour is not a rumour but a fact. It is in blocks".
>
> > Look at the initialization point for the arrays. They all use constants
> > which you can look up in the kernel headers.
>
> I _know_ that. It's irrelevant.
>
> The point is that if blk_size counts in KB, then the size of a device
> cannot reach more that 2^32 * 2^10 = 2^42 = 4TB. I'd personally say
> 2TB, becuase the int blk_size number is signed.
>
> That's rumoured not to be the case, and the max size of a device is
> supposed to be about 8 to 16TB. Let's suppose the rumour is true ..
>
> So we deduce that one has to assign a different meaning for the blk_size
> array. "count in blocks" is how the rumour goes. That way you can get
> 4 times higher sizes .. all the way to 8 or 16TB per device. And this
> is what is rumoured to be the case.
>
> Is it or is it not so? A straight answer from the list would be nice!
>
> > ./linux/fs.h:#define BLOCK_SIZE_BITS 10
> > ./linux/fs.h:#define BLOCK_SIZE (1<<BLOCK_SIZE_BITS)
>
> These are _defaults_ for _blksize_. Sure you can change it as you like,
> but according to the "blk_size is in KB" hypothesis, this matters not one
> iota to the size limit on devices. Change blksize and size does not
> change. But according to the "blk_size is in blocks" hypothesis, yes
> changing blksize will change the size of the device. Testing shows
> that scenario "blk_size is in KB" is true.
>
> Am I making plain the difference between blk_size and blksize?
>
> blk_size is the number of blocks or KB (which?) in a device. blksize is
> the size of the blocks. Is blk_size in KB or blocks?
>
> It should be in blocks if the size of a device is to reach 8 or 16TB.
> If it is in KB, we are limited to 2 or 4TB.
>
> > Which means 1024 bytes for blk_size as default value.
>
> But so what? That doesn't answer the question of whether blk_size
> is in blocks or not.
>
> > > it doesn't. Look at this code from ll_rw_blk.c in 2.4.[14]:
>
> Look at it:
>
> if (blk_size[major])
> minorsize = blk_size[major][MINOR(bh->b_rdev)];
> if (minorsize) {
> unsigned long maxsector = (minorsize << 1) + 1;
>
> This clearly hardcodes blk_size as measuring in units of 2 sectors, no
> matter what we set for blksize. It should be, in my view
>
> unsigned long maxsector =
> minorsize * blksize_size[major][MINOR(bh->b_rdev] + 1;
>
> or no device can be larger than 4TB. And neither can a filesystem, and
> neither can a file ...
>
> Now, I know I can write my own generic_make_request() code, but I have
> no intention of maintaining it through different kernel versions just
> to get the right size measurement. Besides, it's everyone's problem.
>
> Persuade me that this is not a bug, and an important one at that :-)
> Hellloooooo everybody! Linux cannot manage partitions greater than
> 4TB, ha ha ha hhhhhaaaa! ;-)
>
> I at least am getting up to devicesizes at the 8TB range.
>

The usage of it in block_dev.c is showing that in fact the matters
are more complicated that all your hypothesis together...

if (blk_size[MAJOR(dev)])
size = ((loff_t) blk_size[MAJOR(dev)][MINOR(dev)] << BLOCK_SIZE_BITS)
>> blocksize_bits;
else
size = INT_MAX;

The blk_size is in 90 out of 100 cases in units of 1024,
which is the default *logical* blocksize used by linux.
When this overflows, the block device layer just simply will not care
a damn bit about it and it will rely on the driver to notice overflow.
Therefore the answer is that yes it is in units of KB but linux will
still happy work with devices bigger then this.

Correct me please if I'm wrong... Slowly I start to look puzled myself.
If this is the case you can regard blk_size as the same kind of silly
blunt like the read_ahead array.



> Best wishes!
>
> Peter

--
- phone: +49 214 8656 283
- job: eVision-Ventures AG, LEV .de (MY OPINIONS ARE MY OWN!)
- langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort:
ru_RU.KOI8-R

2001-11-14 21:22:21

by Andreas Dilger

[permalink] [raw]
Subject: Re: blocks or KB? (was: .. current meaning of blk_size array)

On Nov 14, 2001 21:41 +0100, Peter T. Breuer wrote:
> "A month of sundays ago Martin Dalecki wrote:"
> > "Peter T. Breuer" wrote:
> > > Is blk_size[][] supposed to contain the size in KB or blocks?
> > There is no rumor it's in blocks.
>
> Maybe I misinterpret what you write. I interpret it as meaning "the
> rumour is not a rumour but a fact. It is in blocks".

Check what /proc/partitions shows us. #blocks, with units of 1kB.
This has been standard in the kernel for a loooong time.

> The point is that if blk_size counts in KB, then the size of a device
> cannot reach more that 2^32 * 2^10 = 2^42 = 4TB. I'd personally say
> 2TB, becuase the int blk_size number is signed.
>
> That's rumoured not to be the case, and the max size of a device is
> supposed to be about 8 to 16TB. Let's suppose the rumour is true ..

Well, the rumor is wrong. There has always been a single-device 1TB/2TB
limit in the kernel (2^31 or 2^32 * 512 byte sector size), and until
recently it has not been a problem. To remove the problem Jens Axboe
(I think, or Ben LaHaise, can't remember) has a patch to support 64-bit
block counts and has been tested with > 2TB devices.

> So we deduce that one has to assign a different meaning for the blk_size
> array. "count in blocks" is how the rumour goes. That way you can get
> 4 times higher sizes .. all the way to 8 or 16TB per device. And this
> is what is rumoured to be the case.

Where do you get these rumors?

> It should be in blocks if the size of a device is to reach 8 or 16TB.
> If it is in KB, we are limited to 2 or 4TB.

In theory this is possible (it was discussed on the LVM list a bit), but
it would take a bunch of work to make it real. For LVM (and MD RAID),
since we are dealing with multiple real devices < 2TB in size, we could
use a blocksize of 4kB to get a larger virtual device. In the end this
only wins for a short time and you need 64-bit block numbers anyways.

> Persuade me that this is not a bug, and an important one at that :-)
> Hellloooooo everybody! Linux cannot manage partitions greater than
> 4TB, ha ha ha hhhhhaaaa! ;-)

And it can't handle more than 64GB of RAM on ia32 (was previously 1GB).
So what? When a limit is reached for any reasonable number of people,
it is fixed.

> I at least am getting up to devicesizes at the 8TB range.

If you are in that ballpark, then get the 64-bit blocknumber patch, and
start testing/fixing, instead of complaining.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-14 21:50:25

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: blocks or KB? (was: .. current meaning of blk_size array)

On Wed, Nov 14, 2001 at 02:16:39PM -0700, Andreas Dilger wrote:
> Well, the rumor is wrong. There has always been a single-device 1TB/2TB
> limit in the kernel (2^31 or 2^32 * 512 byte sector size), and until
> recently it has not been a problem. To remove the problem Jens Axboe
> (I think, or Ben LaHaise, can't remember) has a patch to support 64-bit
> block counts and has been tested with > 2TB devices.

It was tested with a 10TB loopback raid, not a real device. Strangly,
nobody made any effort to test on real physical hardware (or offer any
hardware for me to test on ;-). The patch was against ~2.4.6 and will
need to get dusted off again soon.

-ben
--
Fish.

2001-11-14 22:33:45

by Scott Laird

[permalink] [raw]
Subject: Re: blocks or KB? (was: .. current meaning of blk_size array)



On Wed, 14 Nov 2001, Benjamin LaHaise wrote:
>
> On Wed, Nov 14, 2001 at 02:16:39PM -0700, Andreas Dilger wrote:
> > Well, the rumor is wrong. There has always been a single-device 1TB/2TB
> > limit in the kernel (2^31 or 2^32 * 512 byte sector size), and until
> > recently it has not been a problem. To remove the problem Jens Axboe
> > (I think, or Ben LaHaise, can't remember) has a patch to support 64-bit
> > block counts and has been tested with > 2TB devices.
>
> It was tested with a 10TB loopback raid, not a real device. Strangly,
> nobody made any effort to test on real physical hardware (or offer any
> hardware for me to test on ;-). The patch was against ~2.4.6 and will
> need to get dusted off again soon.
>

Interesting. I have a couple 14x 100GB IDE boxes scheduled to show
up next week. If I can get a patch for a reasonably recent kernel, I
could do a few tests on a ~1.2 GB FS, and maybe on one a bit bigger.

Once 160GB drives start shipping, it should be possible to make a 2TB
software RAID5 box in a 4U case for around $7k.

Interesting question: does Linux have problems with large NFS imports?


Scott

2001-11-15 01:50:10

by William Park

[permalink] [raw]
Subject: Re: blocks or KB? (was: .. current meaning of blk_size array)

On Wed, Nov 14, 2001 at 02:16:39PM -0700, Andreas Dilger wrote:
> > I at least am getting up to devicesizes at the 8TB range.
>
> If you are in that ballpark, then get the 64-bit blocknumber patch,
> and start testing/fixing, instead of complaining.
>
> Cheers, Andreas

Hi Andreas, can you give us URL for this 64-bit patch? I also want to
go past 1TB (512 * 2^31) filesystem size.

--
William Park, Open Geometry Consulting, <[email protected]>.
8 CPU cluster, NAS, (Slackware) Linux, Python, LaTeX, Vim, Mutt, Tin

2001-11-15 05:29:24

by Andreas Dilger

[permalink] [raw]
Subject: Re: blocks or KB? (was: .. current meaning of blk_size array)

On Nov 14, 2001 20:48 -0500, William Park wrote:
> On Wed, Nov 14, 2001 at 02:16:39PM -0700, Andreas Dilger wrote:
> > > I at least am getting up to devicesizes at the 8TB range.
> >
> > If you are in that ballpark, then get the 64-bit blocknumber patch,
> > and start testing/fixing, instead of complaining.
>
> Hi Andreas, can you give us URL for this 64-bit patch? I also want to
> go past 1TB (512 * 2^31) filesystem size.

I don't have it, try a search of the l-k archives, around June of this year.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-15 05:35:23

by William Park

[permalink] [raw]
Subject: Re: blocks or KB? (was: .. current meaning of blk_size array)

On Wed, Nov 14, 2001 at 09:41:11PM +0100, Peter T. Breuer wrote:
> Am I making plain the difference between blk_size and blksize?
>
> blk_size is the number of blocks or KB (which?) in a device. blksize is
> the size of the blocks. Is blk_size in KB or blocks?
>
> It should be in blocks if the size of a device is to reach 8 or 16TB.
> If it is in KB, we are limited to 2 or 4TB.

I've been following this thread intensely. I need to use Network Block
Device to get very large network-RAID. And, resolution to this issue is
of great interest to me.

Judging by 'driver/block/nbd.c', it counts by BLOCK_SIZE=1204
(BLOCK_SIZE_BITS=10), even though you can set the block size to
[512,1024,...,PAGE_SIZE=4096]. Since NBD counts this 1KB block using
'u64' integer, the ultimate size of filesystem is determined by the
kernel block device support.

Looking at 'fs/block_dev.c', you can set the block size to
[512,1024,...,PAGE_SIZE=4096] also. But, 'max_block()' returns block
count in whatever block size of the device, not in BLOCK_SIZE:

static unsigned long max_block(kdev_t dev)
{
unsigned int retval = ~0U;
int major = MAJOR(dev);

if (blk_size[major]) {
int minor = MINOR(dev);
unsigned int blocks = blk_size[major][minor];
if (blocks) {
unsigned int size = block_size(dev);
unsigned int sizebits = blksize_bits(size);
blocks += (size-1) >> BLOCK_SIZE_BITS;
retval = blocks << (BLOCK_SIZE_BITS - sizebits);
if (sizebits > BLOCK_SIZE_BITS)
retval = blocks >> (sizebits - BLOCK_SIZE_BITS);
}
}
return retval;
}

In particular, if block size is 512, then the block count is multiplied
by 2; and if block size if 4096, then the block count is divided by 4.
It thinks that 'blk_size[][]' is block count in KB. So, I can only
deduce that block count is in KB.

Also, from 'include/linux/blkdev.h',
extern int * blk_size[MAX_BLKDEV];
'blk_size[][]' is 'int', which means maximum size of block device is
2^10 x 2^31 = 2^41 = 2TB. However, because it is always converted to
'unsigned int' for block count calculation, I think you can take it as
4TB.

Am I right?

--
William Park, Open Geometry Consulting, <[email protected]>.
8 CPU cluster, NAS, (Slackware) Linux, Python, LaTeX, Vim, Mutt, Tin

2001-11-15 05:56:00

by Andreas Dilger

[permalink] [raw]
Subject: Re: blocks or KB? (was: .. current meaning of blk_size array)

On Nov 15, 2001 00:34 -0500, William Park wrote:
> Judging by 'driver/block/nbd.c', it counts by BLOCK_SIZE=1204
> (BLOCK_SIZE_BITS=10), even though you can set the block size to
> [512,1024,...,PAGE_SIZE=4096]. Since NBD counts this 1KB block using
> 'u64' integer, the ultimate size of filesystem is determined by the
> kernel block device support.
>
> Looking at 'fs/block_dev.c', you can set the block size to
> [512,1024,...,PAGE_SIZE=4096] also. But, 'max_block()' returns block
> count in whatever block size of the device, not in BLOCK_SIZE:

Sadly, while you _might_ be able to change the BLOCK_SIZE to be something
other than 1kB, there are probably so many places that assume a 1kB size
that you will need a lot of fixing. I'm not saying that fixing these
things is bad (it would actually be good for many reasons), but just a
heads-up that changing the BLOCK_SIZE define _probably_ won't get you 8TB
devices (maybe a broken system, or corrupt fs instead). Use caution.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-15 10:43:16

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: blocks or KB? (was: .. current meaning of blk_size array)

At 05:55 15/11/01, Andreas Dilger wrote:
>On Nov 15, 2001 00:34 -0500, William Park wrote:
> > Judging by 'driver/block/nbd.c', it counts by BLOCK_SIZE=1204
> > (BLOCK_SIZE_BITS=10), even though you can set the block size to
> > [512,1024,...,PAGE_SIZE=4096]. Since NBD counts this 1KB block using
> > 'u64' integer, the ultimate size of filesystem is determined by the
> > kernel block device support.
> >
> > Looking at 'fs/block_dev.c', you can set the block size to
> > [512,1024,...,PAGE_SIZE=4096] also. But, 'max_block()' returns block
> > count in whatever block size of the device, not in BLOCK_SIZE:
>
>Sadly, while you _might_ be able to change the BLOCK_SIZE to be something
>other than 1kB, there are probably so many places that assume a 1kB size
>that you will need a lot of fixing. I'm not saying that fixing these
>things is bad (it would actually be good for many reasons), but just a
>heads-up that changing the BLOCK_SIZE define _probably_ won't get you 8TB
>devices (maybe a broken system, or corrupt fs instead). Use caution.

I changed BLOCK_SIZE back in the 2.4.0-test8 to 512 and had to do some
modifications to drivers/ide, drivers/scsi, fs/partitions and to fs/ext2 to
get it to work (patch is 10kiB so not too bad but it doesn't deal with the
MD driver nor with any of the devices/fs I don't actually use). It then
worked nicely for me. (Only minor problem with floppy disk resulting in a
block size error from ll_rw_block but it always went ahead and worked after
outputting the error.)

And yes, the fixes needed are mostly because of assumptions about
BLOCK_SIZE being 1024 bytes... If anyone is interested in having a look,
the now outdated patch is available on the web:

http://www-stu.christs.cam.ac.uk/~aia21/linux/blksize512.patch

Anton


--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2001-11-15 12:36:16

by Peter T. Breuer

[permalink] [raw]
Subject: Re: blocks or KB? (was: .. current meaning of blk_size array)

"A month of sundays ago William Park wrote:"
> On Wed, Nov 14, 2001 at 09:41:11PM +0100, Peter T. Breuer wrote:
> > blk_size is the number of blocks or KB (which?) in a device. blksize is
> > the size of the blocks. Is blk_size in KB or blocks?
> >
> > It should be in blocks if the size of a device is to reach 8 or 16TB.
> > If it is in KB, we are limited to 2 or 4TB.
>
> I've been following this thread intensely. I need to use Network Block
> Device to get very large network-RAID. And, resolution to this issue is
> of great interest to me.

Yes, well, you may be the person on whose behalf I started checking it
out.

To put your mind at rest, all devices (partitions, etc.) are limited by
the 32 bit int that holds the number of sectors on a device. This is
31+9 bits of space (i don't know whether negative sectors are counted
as positive :), which is 1 (or 2, if you use unsigned interpretation)
TB.

So the blk_size/blksize business is irrelevant.

> Judging by 'driver/block/nbd.c', it counts by BLOCK_SIZE=1204
> (BLOCK_SIZE_BITS=10), even though you can set the block size to
> [512,1024,...,PAGE_SIZE=4096]. Since NBD counts this 1KB block using
> 'u64' integer, the ultimate size of filesystem is determined by the
> kernel block device support.

This is correct, but it's quite a deep dependence in the kernel.
Though nbd (and my enbd) use 64 bit sizes in their network protocols,
you can't get rid of the limitation just like that. The kernel is
infested with the limit associated with a 32bit sector count. The
32bit KB count in blk_size is also a limit, but never an active one
as the sector count bites first, before the other is reached. The
kernel's VM thinks those sector counts are 32 bit and that sectors
are 512B, which means the person in charge of the VM must handle any
changes.

To tell the truth, counting in 512B sectors is the only sane way to
go. It sends you mad counting in units of blocks, because that
can be a variable size.

> Looking at 'fs/block_dev.c', you can set the block size to
> [512,1024,...,PAGE_SIZE=4096] also. But, 'max_block()' returns block
> count in whatever block size of the device, not in BLOCK_SIZE:

It looks to me as though block_dev.c has been "prepared" to be more
flexible, and that it will be a short job to either use 64bit sector
counts for it, or move to counting in blocks. The same work
has not gone into ll_rw_blk.c yet.

> In particular, if block size is 512, then the block count is multiplied
> by 2; and if block size if 4096, then the block count is divided by 4.
> It thinks that 'blk_size[][]' is block count in KB. So, I can only
> deduce that block count is in KB.

It still is, yes.

> Also, from 'include/linux/blkdev.h',
> extern int * blk_size[MAX_BLKDEV];
> 'blk_size[][]' is 'int', which means maximum size of block device is
> 2^10 x 2^31 = 2^41 = 2TB. However, because it is always converted to
> 'unsigned int' for block count calculation, I think you can take it as
> 4TB.
>
> Am I right?

As far as I can tell. I was trying to ask here if it had changed, but
evidently it has not.

What is the forward strategy? I see no alternative but moving to 64bit
sector counts.

Peter

2001-11-15 18:31:58

by William Park

[permalink] [raw]
Subject: Re: blocks or KB? (was: .. current meaning of blk_size array)

On Thu, Nov 15, 2001 at 01:35:26PM +0100, Peter T. Breuer wrote:
> What is the forward strategy? I see no alternative but moving to 64bit
> sector counts.

Me too.

I looked around, and 1KB block size is hard-coded in too many places.
For example, function 'generic_make_request()' in
'drivers/block/ll_rw_blk.c' assumes 512 sector and 1024 block size:

if (blk_size[major])
minorsize = blk_size[major][MINOR(bh->b_rdev)];
if (minorsize) {
unsigned long maxsector = (minorsize << 1) + 1; <--
unsigned long sector = bh->b_rsector;
unsigned int count = bh->b_size >> 9;

So, using 'u64 *blk_size[][]' seems to be the most straightforward
solution, leaving BLOCK_SIZE alone.

I thought 'drivers/block/nbd.c' was already using 64-bit count,
according to its comment at the top. But, curiously, it reverts back to
'int' count of BLOCK_SIZE. I tried searching list archives for 64-bit
patch, but no luck.

Any URL would be helpful.

Is changing 'int' to 'u64' (and all the dependent code) enough to get
64-bit block devices? I'm willing to do the work. I don't care about
filesystem; that's the job for maintainer of particular filesystem. I
understand XFS is 64-bit, so I can use that.

--
William Park, Open Geometry Consulting, <[email protected]>.
8 CPU cluster, NAS, (Slackware) Linux, Python, LaTeX, Vim, Mutt, Tin

2001-11-15 20:20:05

by Andreas Dilger

[permalink] [raw]
Subject: Re: blocks or KB? (was: .. current meaning of blk_size array)

On Nov 15, 2001 13:31 -0500, William Park wrote:
> I looked around, and 1KB block size is hard-coded in too many places.
> For example, function 'generic_make_request()' in
> 'drivers/block/ll_rw_blk.c' assumes 512 sector and 1024 block size:

Yes, it _would_ be nice to clean this up, but it is a lot of work. You
could check out Anton's patch (posted today) for this as a starting point.

> Is changing 'int' to 'u64' (and all the dependent code) enough to get
> 64-bit block devices? I'm willing to do the work.

It is already done, please don't duplicate. Search for 64 bit block
devices around June of this year for a URL to Jens'/Ben's patch. Please
repost the URL, as several people have asked.

> I don't care about filesystem; that's the job for maintainer of particular
> filesystem. I understand XFS is 64-bit, so I can use that.

FYI, ext2/ext3 _should_ be OK up to 8TB (possibly 16TB depending on sign
issues) filesystem, with individual files at 2TB, when using a 4kB block
size. However, there appear to be other issues like VFS and page cache
which may have problems at this point as well.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-15 22:05:29

by William Park

[permalink] [raw]
Subject: Re: blocks or KB?

On Thu, Nov 15, 2001 at 01:19:38PM -0700, Andreas Dilger wrote:
> > Is changing 'int' to 'u64' (and all the dependent code) enough to
> > get 64-bit block devices? I'm willing to do the work.
>
> It is already done, please don't duplicate. Search for 64 bit block
> devices around June of this year for a URL to Jens'/Ben's patch.
> Please repost the URL, as several people have asked.

Found it -- http://people.redhat.com/bcrl/lb/. Strangely, it wasn't
in the linux-kernel list.

--
William Park, Open Geometry Consulting, <[email protected]>.
8 CPU cluster, NAS, (Slackware) Linux, Python, LaTeX, Vim, Mutt, Tin