Has anybody looked into working on NVMHCI support? It is a new
controller + new command set for direct interaction with non-volatile
memory devices:
http://download.intel.com/standards/nvmhci/spec.pdf
Although NVMHCI is nice from a hardware design perspective, it is a bit
problematic for Linux because
* NVMHCI might be implemented as part of an AHCI controller's
register set, much like how Marvell's AHCI clones implement
a PATA port: with wholly different per-port registers
and DMA data structures, buried inside the standard AHCI
per-port interrupt dispatch mechanism.
Or, NVMHCI might be implemented as its own PCI device,
wholly independent from the AHCI PCI device.
The per-port registers and DMA data structure remain the same,
whether or not it is embedded within AHCI or not.
* NVMHCI introduces a brand new command set, completely
incompatible with ATA or SCSI. Presumably it is tuned
specifically for non-volatile memory.
* The sector size can vary wildly from device to device. There
is no 512-byte legacy to deal with, for a brand new
command set. We should handle this OK, but...... who knows
until you try.
The spec describes the sector size as
"512, 1k, 2k, 4k, 8k, etc." It will be interesting to reach
"etc" territory.
Here is my initial idea:
- Move 95% of ahci.c into libahci.c.
This will make implementation of AHCI-and-more devices like
NVMHCI (AHCI 1.3) and Marvell much easier, while avoiding
the cost of NVMHCI or Marvell support, for those users without
such hardware.
- ahci.c becomes a tiny stub with a pci_device_id match table,
calling functions in libahci.c.
- I can move my libata-dev.git#mv-ahci-pata work, recently refreshed,
into mv-ahci.c.
- nvmhci.c implements the NVMHCI controller standard. Maybe referenced
from ahci.c, or used standalone.
- nvmhci-blk.c implements a block device for NVMHCI-attached devices,
using the new NVMHCI command set.
With a brand new command set, might as well avoid SCSI completely IMO,
and create a brand new block device.
Open questions are...
1) When will we see hardware? This is a feature newly introduced in
AHCI 1.3. AHCI 1.3 spec is public, but I have not seen any machines
yet. http://download.intel.com/technology/serialata/pdf/rev1_3.pdf
My ICH10 box uses AHCI 1.2. dmesg | grep '^ahci'
> ahci 0000:00:1f.2: AHCI 0001.0200 32 slots 6 ports 3 Gbps 0x3f impl SATA mode
> ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ems
2) Has anyone else started working on this? All relevant specs are
public on intel.com.
3) Are there major objections to doing this as a native block device (as
opposed to faking SCSI, for example...) ?
Thanks,
Jeff (engaging in some light Saturday reading...)
> The spec describes the sector size as
> "512, 1k, 2k, 4k, 8k, etc." It will be interesting to reach
> "etc" territory.
Over 4K will be fun.
> - ahci.c becomes a tiny stub with a pci_device_id match table,
> calling functions in libahci.c.
It needs to a be a little bit bigger because of the folks wanting to do
non PCI AHCI, so you need a little bit of PCI wrapping etc
> With a brand new command set, might as well avoid SCSI completely IMO,
> and create a brand new block device.
Providing we allow for the (inevitable ;)) joys of NVHCI over SAS etc 8)
Alan
Alan Cox wrote:
>> The spec describes the sector size as
>> "512, 1k, 2k, 4k, 8k, etc." It will be interesting to reach
>> "etc" territory.
>
> Over 4K will be fun.
>
>> - ahci.c becomes a tiny stub with a pci_device_id match table,
>> calling functions in libahci.c.
>
> It needs to a be a little bit bigger because of the folks wanting to do
> non PCI AHCI, so you need a little bit of PCI wrapping etc
True...
>> With a brand new command set, might as well avoid SCSI completely IMO,
>> and create a brand new block device.
>
> Providing we allow for the (inevitable ;)) joys of NVHCI over SAS etc 8)
Perhaps... from what I can tell, this is a direct, asynchronous NVM
interface. It appears to lack any concept of bus or bus enumeration.
No worries about link up/down, storage device hotplug, etc. (you still
have PCI hotplug case, of course)
Jeff
On Sat, 11 Apr 2009, Alan Cox wrote:
>
> > The spec describes the sector size as
> > "512, 1k, 2k, 4k, 8k, etc." It will be interesting to reach
> > "etc" territory.
>
> Over 4K will be fun.
And by "fun", you mean "irrelevant".
If anybody does that, they'll simply not work. And it's not worth it even
trying to handle it.
That said, I'm pretty certain Windows has the same 4k issue, so we can
hope nobody will ever do that kind of idiotically broken hardware. Of
course, hardware people often do incredibly stupid things, so no
guarantees.
Linus
Linus Torvalds wrote:
>
> On Sat, 11 Apr 2009, Alan Cox wrote:
>>> The spec describes the sector size as
>>> "512, 1k, 2k, 4k, 8k, etc." It will be interesting to reach
>>> "etc" territory.
>> Over 4K will be fun.
>
> And by "fun", you mean "irrelevant".
>
> If anybody does that, they'll simply not work. And it's not worth it even
> trying to handle it.
FSVO trying to handle...
At the driver level, it would be easy to clamp sector size to 4k, and
point the scatterlist to a zero-filled region for the >4k portion of
each sector. Inefficient, sure, but it is low-cost to the driver and
gives the user something other than a brick.
if (too_large_sector_size)
nvmhci_fill_sg_clamped_interleave()
else
nvmhci_fill_sg()
Regards,
Jeff
>>>>> "Jeff" == Jeff Garzik <[email protected]> writes:
Jeff> Alan Cox wrote:
>>> With a brand new command set, might as well avoid SCSI completely
>>> IMO, and create a brand new block device.
>>
>> Providing we allow for the (inevitable ;)) joys of NVHCI over SAS etc 8)
Jeff> Perhaps... from what I can tell, this is a direct, asynchronous
Jeff> NVM interface. It appears to lack any concept of bus or bus
Jeff> enumeration. No worries about link up/down, storage device
Jeff> hotplug, etc. (you still have PCI hotplug case, of course)
Didn't we just spend years merging the old IDE PATA block devices into
the libata/scsi block device setup to get a more unified userspace and
to share common code?
I'm a total ignoramous here, but it would seem that it would be nice
to keep the /dev/sd# stuff around for this, esp since it is supported
through/with/around AHCI and libata stuff.
Honestly, I don't care as long as userspace isn't too affected and I
can just format it using ext3. :] Which I realize would be silly
since it's probably nothing like regular disk access, but more like
the NVRAM used on Netapps for caching writes to disk so they can be
acknowledged quicker to the clients. Or like the old PrestoServe
NVRAM modules on DECsystems and Alphas.
John
>>>>> "John" == John Stoffel <[email protected]> writes:
>>>>> "Jeff" == Jeff Garzik <[email protected]> writes:
Jeff> Alan Cox wrote:
>>>> With a brand new command set, might as well avoid SCSI completely
>>>> IMO, and create a brand new block device.
>>>
>>> Providing we allow for the (inevitable ;)) joys of NVHCI over SAS etc 8)
Jeff> Perhaps... from what I can tell, this is a direct, asynchronous
Jeff> NVM interface. It appears to lack any concept of bus or bus
Jeff> enumeration. No worries about link up/down, storage device
Jeff> hotplug, etc. (you still have PCI hotplug case, of course)
John> Didn't we just spend years merging the old IDE PATA block devices into
John> the libata/scsi block device setup to get a more unified userspace and
John> to share common code?
John> I'm a total ignoramous here, but it would seem that it would be nice
John> to keep the /dev/sd# stuff around for this, esp since it is supported
John> through/with/around AHCI and libata stuff.
John> Honestly, I don't care as long as userspace isn't too affected and I
John> can just format it using ext3. :] Which I realize would be silly
John> since it's probably nothing like regular disk access, but more like
John> the NVRAM used on Netapps for caching writes to disk so they can be
John> acknowledged quicker to the clients. Or like the old PrestoServe
John> NVRAM modules on DECsystems and Alphas.
And actually spending some thought on this, I'm thinking that this
will be like the MTD block device and such... seperate specialized
block devices, but still usable. So maybe I'll just shutup now. :]
John
On Sat, Apr 11, 2009 at 12:52 PM, Linus Torvalds
<[email protected]> wrote:
>
>
> On Sat, 11 Apr 2009, Alan Cox wrote:
>>
>> > The spec describes the sector size as
>> > "512, 1k, 2k, 4k, 8k, etc." It will be interesting to reach
>> > "etc" territory.
>>
>> Over 4K will be fun.
>
> And by "fun", you mean "irrelevant".
>
> If anybody does that, they'll simply not work. And it's not worth it even
> trying to handle it.
Why does it matter what the sector size is?
I'm failing to see what the fuss is about.
We've abstract the DMA mapping/SG list handling enough that the
block size should make no more difference than it does for the
MTU size of a network.
And the linux VM does handle bigger than 4k pages (several architectures
have implemented it) - even if x86 only supports 4k as base page size.
Block size just defines the granularity of the device's address space in
the same way the VM base page size defines the Virtual address space.
> That said, I'm pretty certain Windows has the same 4k issue, so we can
> hope nobody will ever do that kind of idiotically broken hardware. Of
> course, hardware people often do incredibly stupid things, so no
> guarantees.
That's just flame-bait. Not touching that.
thanks,
grant
>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Sat, 11 Apr 2009, Grant Grundler wrote:
>
> Why does it matter what the sector size is?
> I'm failing to see what the fuss is about.
>
> We've abstract the DMA mapping/SG list handling enough that the
> block size should make no more difference than it does for the
> MTU size of a network.
The VM is not ready or willing to do more than 4kB pages for any normal
cacheing scheme.
> And the linux VM does handle bigger than 4k pages (several architectures
> have implemented it) - even if x86 only supports 4k as base page size.
4k is not just the "supported" base page size, it's the only sane one.
Bigger pages waste memory like mad on any normal load due to
fragmentation. Only basically single-purpose servers are worth doing
bigger pages for.
> Block size just defines the granularity of the device's address space in
> the same way the VM base page size defines the Virtual address space.
.. and the point is, if you have granularity that is bigger than 4kB, you
lose binary compatibility on x86, for example. The 4kB thing is encoded in
mmap() semantics.
In other words, if you have sector size >4kB, your hardware is CRAP. It's
unusable sh*t. No ifs, buts or maybe's about it.
Sure, we can work around it. We can work around it by doing things like
read-modify-write cycles with bounce buffers (and where DMA remapping can
be used to avoid the copy). Or we can work around it by saying that if you
mmap files on such a a filesystem, your mmap's will have to have 8kB
alignment semantics, and the hardware is only useful for servers.
Or we can just tell people what a total piece of shit the hardware is.
So if you're involved with any such hardware or know people who are, you
might give people strong hints that sector sizes >4kB will not be taken
seriously by a huge number of people. Maybe it's not too late to head the
crap off at the pass.
Btw, this is not a new issue. Sandisk and some other totally clueless SSD
manufacturers tried to convince people that 64kB access sizes were the
RightThing(tm) to do. The reason? Their SSD's were crap, and couldn't do
anything better, so they tried to blame software.
Then Intel came out with their controller, and now the same people who
tried to sell their sh*t-for-brain SSD's are finally admittign that
it was crap hardware.
Do you really want to go through that one more time?
Linus
> We've abstract the DMA mapping/SG list handling enough that the
> block size should make no more difference than it does for the
> MTU size of a network.
You need to start managing groups of pages in the vm and keeping them
together and writing them out together and paging them together even if
one of them is dirty and the other isn't. You have to deal with cases
where a process forks and the two pages are dirtied one in each but still
have to be written together.
Alternatively you go for read-modify-write (nasty performance hit
especially for RAID or a log structured fs).
Yes you can do it but it sure won't be pretty with a conventional fs.
Some of the log structured file systems have no problems with this and
some kinds of journalling can help but for a typical block file system
it'll suck.
Alan Cox wrote:
>> We've abstract the DMA mapping/SG list handling enough that the
>> block size should make no more difference than it does for the
>> MTU size of a network.
>
> You need to start managing groups of pages in the vm and keeping them
> together and writing them out together and paging them together even if
> one of them is dirty and the other isn't. You have to deal with cases
> where a process forks and the two pages are dirtied one in each but still
> have to be written together.
>
> Alternatively you go for read-modify-write (nasty performance hit
> especially for RAID or a log structured fs).
Or just ignore the extra length, thereby excising the 'read-modify'
step... Total storage is halved or worse, but you don't take as much of
a performance hit.
Jeff
On Sat, 11 Apr 2009, Jeff Garzik wrote:
>
> Or just ignore the extra length, thereby excising the 'read-modify' step...
> Total storage is halved or worse, but you don't take as much of a performance
> hit.
Well, the people who want > 4kB sectors usually want _much_ bigger (ie
32kB sectors), and if you end up doing the "just use the first part"
thing, you're wasting 7/8ths of the space.
Yes, it's doable, and yes, it obviously makes for a simple driver thing,
but no, I don't think people will consider it acceptable to lose that much
of their effective size of the disk.
I suspect people would scream even with a 8kB sector.
Treating all writes as read-modify-write cycles on a driver level (and
then opportunistically avoiding the read part when you are lucky and see
bigger contiguous writes) is likely more acceptable. But it _will_ suck
dick from a performance angle, because no regular filesystem will care
enough, so even with nicely behaved big writes, the two end-points will
have a fairly high chance of requiring a rmw cycle.
Even the journaling ones that might have nice logging write behavior tend
to have a non-logging part that then will behave badly. Rather few
filesystems are _purely_ log-based, and the ones that are tend to have
various limitations. Most commonly read performance just sucks.
We just merged nilfs2, and I _think_ that one is a pure logging filesystem
with just linear writes (within a segment). But I think random read
performance (think: loading executables off the disk) is bad.
And people tend to really dislike hardware that forces a particular
filesystem on them. Guess how big the user base is going to be if you
cannot format the device as NTFS, for example? Hint: if a piece of
hardware only works well with special filesystems, that piece of hardware
won't be a big seller.
Modern technology needs big volume to become cheap and relevant.
And maybe I'm wrong, and NTFS works fine as-is with sectors >4kB. But let
me doubt that.
Linus
On Sun, 12 Apr 2009, Alan Cox wrote:
>> We've abstract the DMA mapping/SG list handling enough that the
>> block size should make no more difference than it does for the
>> MTU size of a network.
>
> You need to start managing groups of pages in the vm and keeping them
> together and writing them out together and paging them together even if
> one of them is dirty and the other isn't. You have to deal with cases
> where a process forks and the two pages are dirtied one in each but still
> have to be written together.
gaining this sort of ability would not be a bad thing. with current
hardware (SSDs and raid arrays) you can very easily be in a situation
where it's much cheaper to deal with a group of related pages as one group
rather than processing them individually. this is just an extention of the
same issue.
David Lang
> Alternatively you go for read-modify-write (nasty performance hit
> especially for RAID or a log structured fs).
>
> Yes you can do it but it sure won't be pretty with a conventional fs.
> Some of the log structured file systems have no problems with this and
> some kinds of journalling can help but for a typical block file system
> it'll suck.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Linus Torvalds wrote:
> And maybe I'm wrong, and NTFS works fine as-is with sectors >4kB. But let
> me doubt that.
FWIW... No clue about sector size, but NTFS cluster size (i.e. block
size) goes up to 64k. Compression is disabled after 4k.
Jeff
On Sat, 11 Apr 2009, [email protected] wrote:
>
> gaining this sort of ability would not be a bad thing.
.. and if my house was built of gold, that wouldn't be a bad thing either.
What's your point?
Are you going to create the magical patches that make that happen? Are you
going to maintain the added complexity that comes from suddenly having
multiple dirty bits per "page"? Are you going to create the mythical
filesystems that magically start doing tail packing in order to not waste
tons of disk-space with small files, even if they have a 32kB block-size?
In other words, your whole argument is built in "wouldn't it be nice".
And I'm just the grumpy old guy who tells you that there's this small
thing called REALITY that comes and bites you in the *ss. And I'm sorry,
but the very nature of "reality" is that it doesn't care one whit whether
you believe me or not.
The fact is, >4kB sectors just aren't realistic right now, and I don't
think you have any _clue_ about the pain of trying to make them so. You're
just throwing pennies down a wishing well.
Linus
Alan Cox wrote:
..
> Alternatively you go for read-modify-write (nasty performance hit
> especially for RAID or a log structured fs).
..
Initially, at least, I'd guess that this NVM-HCI thing is all about
built-in flash memory on motherboards, to hold the "instant-boot"
software that hardware companies (eg. ASUS) are rapidly growing fond of.
At present, that means a mostly read-only Linux installation,
though MS for sure are hoping for Moore's Law to kick in and
provide sufficient space for a copy of Vista there or something.
The point being, it's probable *initial* intended use is for a
run-time read-only filesystem, so having to do dirty R-M-W sequences
for writes might not be a significant issue.
At present. And even if it were, it might not be much worse than
having the hardware itself do it internally, which is what would
have to happen if it always only ever showed 4KB to us.
Longer term, as flash densities increase, we're going to end up
with motherboards that have huge SSDs built-in, through an interface
like this one, or over a virtual SATA link or something.
I wonder how long until "desktop/notebook" computers no longer
have replaceable "hard disks" at all?
Cheers
> The atomic building units (sector size, block size, etc) of NTFS are
> entirely parametric. The maximum values could be bigger than the
> currently "configured" maximum limits.
That isn't what bites you - you can run 8K-32K ext2 file systems but if
your physical page size is smaller than the fs page size you have a
problem.
The question is whether the NT VM can cope rather than the fs.
Alan
On Sun, 12 Apr 2009, Szabolcs Szakacsits wrote:
>
> I did not hear about NTFS using >4kB sectors yet but technically
> it should work.
>
> The atomic building units (sector size, block size, etc) of NTFS are
> entirely parametric. The maximum values could be bigger than the
> currently "configured" maximum limits.
It's probably trivial to make ext3 support 16kB blocksizes (if it doesn't
already).
That's not the problem. The "filesystem layout" part is just a parameter.
The problem is then trying to actually access such a filesystem, in
particular trying to write to it, or trying to mmap() small chunks of it.
The FS layout is the trivial part.
> At present the limits are set in the BIOS Parameter Block in the NTFS
> Boot Sector. This is 2 bytes for the "Bytes Per Sector" and 1 byte for
> "Sectors Per Block". So >4kB sector size should work since 1993.
>
> 64kB+ sector size could be possible by bootstrapping NTFS drivers
> in a different way.
Try it. And I don't mean "try to create that kind of filesystem". Try to
_use_ it. Does Window actually support using it it, or is it just a matter
of "the filesystem layout is _specified_ for up to 64kB block sizes"?
And I really don't know. Maybe Windows does support it. I'm just very
suspicious. I think there's a damn good reason why NTFS supports larger
block sizes in theory, BUT EVERYBODY USES A 4kB BLOCKSIZE DESPITE THAT!
Because it really is a hard problem. It's really pretty nasty to have your
cache blocking be smaller than the actual filesystem blocksize (the other
way is much easier, although it's certainly not pleasant either - Linux
supports it because we _have_ to, but sector-size of hardware had
traditionally been 4kB, I'd certainly also argue against adding complexity
just to make it smaller, the same way I argue against making it much
larger).
And don't get me wrong - we could (fairly) trivially make the
PAGE_CACHE_SIZE be bigger - even eventually go so far as to make it a
per-mapping thing, so that you could have some filesystems with that
bigger sector size and some with smaller ones. I think Andrea had patches
that did a fair chunk of it, and that _almost_ worked.
But it ABSOLUTELY SUCKS. If we did a 16kB page-cache-size, it would
absolutely blow chunks. It would be disgustingly horrible. Putting the
kernel source tree on such a filesystem would waste about 75% of all
memory (the median size of a source file is just about 4kB), so your page
cache would be effectively cut in a quarter for a lot of real loads.
And to fix up _that_, you'd need to now do things like sub-page
allocations, and now your page-cache size isn't even fixed per filesystem,
it would be per-file, and the filesystem (and the drievrs!) would hav to
handle the cases of getting those 4kB partial pages (and do r-m-w IO after
all if your hardware sector size is >4kB).
IOW, there are simple things we can do - but they would SUCK. And there
are really complicated things we could do - and they would _still_ SUCK,
plus now I pretty much guarantee that your system would also be a lot less
stable.
It really isn't worth it. It's much better for everybody to just be aware
of the incredible level of pure suckage of a general-purpose disk that has
hardware sectors >4kB. Just educate people that it's not good. Avoid the
whole insane suckage early, rather than be disappointed in hardware that
is total and utter CRAP and just causes untold problems.
Now, for specialty uses, things are different. CD-ROM's have had 2kB
sector sizes for a long time, and the reason it was never as big of a
problem isn't that they are still smaller than 4kB - it's that they are
read-only, and use special filesystems. And people _know_ they are
special. Yes, even when you write to them, it's a very special op. You'd
never try to put NTFS on a CD-ROM, and everybody knows it's not a disk
replacement.
In _those_ kinds of situations, a 64kB block isn't much of a problem. We
can do read-only media (where "read-only" doesn't have to be absolute: the
important part is that writing is special), and never have problems.
That's easy. Almost all the problems with block-size go away if you think
reading is 99.9% of the load.
But if you want to see it as a _disk_ (ie replacing SSD's or rotational
media), 4kB blocksize is the maximum sane one for Linux/x86 (or, indeed,
any "Linux/not-just-database-server" - it really isn't so much about x86,
as it is about large cache granularity causing huge memory fragmentation
issues).
Linus
Linus Torvalds wrote:
> And people tend to really dislike hardware that forces a particular
> filesystem on them. Guess how big the user base is going to be if you
> cannot format the device as NTFS, for example? Hint: if a piece of
> hardware only works well with special filesystems, that piece of hardware
> won't be a big seller.
>
> Modern technology needs big volume to become cheap and relevant.
>
> And maybe I'm wrong, and NTFS works fine as-is with sectors >4kB. But let
> me doubt that.
I did not hear about NTFS using >4kB sectors yet but technically
it should work.
The atomic building units (sector size, block size, etc) of NTFS are
entirely parametric. The maximum values could be bigger than the
currently "configured" maximum limits.
At present the limits are set in the BIOS Parameter Block in the NTFS
Boot Sector. This is 2 bytes for the "Bytes Per Sector" and 1 byte for
"Sectors Per Block". So >4kB sector size should work since 1993.
64kB+ sector size could be possible by bootstrapping NTFS drivers
in a different way.
Szaka
--
NTFS-3G: http://ntfs-3g.org
Alan Cox wrote:
>> The atomic building units (sector size, block size, etc) of NTFS are
>> entirely parametric. The maximum values could be bigger than the
>> currently "configured" maximum limits.
>>
>
> That isn't what bites you - you can run 8K-32K ext2 file systems but if
> your physical page size is smaller than the fs page size you have a
> problem.
>
> The question is whether the NT VM can cope rather than the fs.
>
A quick test shows that it can. I didn't try mmap(), but copying files
around worked.
Did you expect it not to work?
--
error compiling committee.c: too many arguments to function
Linus Torvalds wrote:
>
> On Sun, 12 Apr 2009, Szabolcs Szakacsits wrote:
>> I did not hear about NTFS using >4kB sectors yet but technically
>> it should work.
>>
>> The atomic building units (sector size, block size, etc) of NTFS are
>> entirely parametric. The maximum values could be bigger than the
>> currently "configured" maximum limits.
>
> It's probably trivial to make ext3 support 16kB blocksizes (if it doesn't
> already).
>
> That's not the problem. The "filesystem layout" part is just a parameter.
>
> The problem is then trying to actually access such a filesystem, in
> particular trying to write to it, or trying to mmap() small chunks of it.
> The FS layout is the trivial part.
>
>> At present the limits are set in the BIOS Parameter Block in the NTFS
>> Boot Sector. This is 2 bytes for the "Bytes Per Sector" and 1 byte for
>> "Sectors Per Block". So >4kB sector size should work since 1993.
>>
>> 64kB+ sector size could be possible by bootstrapping NTFS drivers
>> in a different way.
>
> Try it. And I don't mean "try to create that kind of filesystem". Try to
> _use_ it. Does Window actually support using it it, or is it just a matter
> of "the filesystem layout is _specified_ for up to 64kB block sizes"?
>
> And I really don't know. Maybe Windows does support it. I'm just very
> suspicious. I think there's a damn good reason why NTFS supports larger
> block sizes in theory, BUT EVERYBODY USES A 4kB BLOCKSIZE DESPITE THAT!
I can't find any mention that any formattable block size can't be used,
other than the fact that "The maximum default cluster size under Windows
NT 3.51 and later is 4K due to the fact that NTFS file compression is
not possible on drives with a larger allocation size. So format will
never use larger than 4k clusters unless the user specifically overrides
the defaults".
It could be there are other downsides to >4K cluster sizes as well, but
that's the reason they state.
What about FAT? It supports cluster sizes up to 32K at least (possibly
up to 256K as well, although somewhat nonstandard), and that works.. We
support that in Linux, don't we?
>
> Because it really is a hard problem. It's really pretty nasty to have your
> cache blocking be smaller than the actual filesystem blocksize (the other
> way is much easier, although it's certainly not pleasant either - Linux
> supports it because we _have_ to, but sector-size of hardware had
> traditionally been 4kB, I'd certainly also argue against adding complexity
> just to make it smaller, the same way I argue against making it much
> larger).
>
> And don't get me wrong - we could (fairly) trivially make the
> PAGE_CACHE_SIZE be bigger - even eventually go so far as to make it a
> per-mapping thing, so that you could have some filesystems with that
> bigger sector size and some with smaller ones. I think Andrea had patches
> that did a fair chunk of it, and that _almost_ worked.
>
> But it ABSOLUTELY SUCKS. If we did a 16kB page-cache-size, it would
> absolutely blow chunks. It would be disgustingly horrible. Putting the
> kernel source tree on such a filesystem would waste about 75% of all
> memory (the median size of a source file is just about 4kB), so your page
> cache would be effectively cut in a quarter for a lot of real loads.
>
> And to fix up _that_, you'd need to now do things like sub-page
> allocations, and now your page-cache size isn't even fixed per filesystem,
> it would be per-file, and the filesystem (and the drievrs!) would hav to
> handle the cases of getting those 4kB partial pages (and do r-m-w IO after
> all if your hardware sector size is >4kB).
>
> IOW, there are simple things we can do - but they would SUCK. And there
> are really complicated things we could do - and they would _still_ SUCK,
> plus now I pretty much guarantee that your system would also be a lot less
> stable.
>
> It really isn't worth it. It's much better for everybody to just be aware
> of the incredible level of pure suckage of a general-purpose disk that has
> hardware sectors >4kB. Just educate people that it's not good. Avoid the
> whole insane suckage early, rather than be disappointed in hardware that
> is total and utter CRAP and just causes untold problems.
>
> Now, for specialty uses, things are different. CD-ROM's have had 2kB
> sector sizes for a long time, and the reason it was never as big of a
> problem isn't that they are still smaller than 4kB - it's that they are
> read-only, and use special filesystems. And people _know_ they are
> special. Yes, even when you write to them, it's a very special op. You'd
> never try to put NTFS on a CD-ROM, and everybody knows it's not a disk
> replacement.
>
> In _those_ kinds of situations, a 64kB block isn't much of a problem. We
> can do read-only media (where "read-only" doesn't have to be absolute: the
> important part is that writing is special), and never have problems.
> That's easy. Almost all the problems with block-size go away if you think
> reading is 99.9% of the load.
>
> But if you want to see it as a _disk_ (ie replacing SSD's or rotational
> media), 4kB blocksize is the maximum sane one for Linux/x86 (or, indeed,
> any "Linux/not-just-database-server" - it really isn't so much about x86,
> as it is about large cache granularity causing huge memory fragmentation
> issues).
>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Sun, 12 Apr 2009, Avi Kivity wrote:
>
> A quick test shows that it can. I didn't try mmap(), but copying files around
> worked.
You being who you are, I'm assuming you're doing this in a virtual
environment, so you might be able to see the IO patterns..
Can you tell if it does the IO in chunks of 16kB or smaller? That can be
hard to see with trivial tests (since any filesystem will try to chunk up
writes regardless of how small the cache entry is, and on file creation it
will have to write the full 16kB anyway just to initialize the newly
allocated blocks on disk), but there's a couple of things that should be
reasonably good litmus tests of what WNT does internally:
- create a big file, then rewrite just a few bytes in it, and look at the
IO pattern of the result. Does it actually do the rewrite IO as one
16kB IO, or does it do sub-blocking?
If the latter, then the 16kB thing is just a filesystem layout issue,
not an internal block-size issue, and WNT would likely have exactly the
same issues as Linux.
- can you tell how many small files it will cache in RAM without doing
IO? If it always uses 16kB blocks for caching, it will be able to cache
a _lot_ fewer files in the same amount of RAM than with a smaller block
size.
Of course, the _really_ conclusive thing (in a virtualized environment) is
to just make the virtual disk only able to do 16kB IO accesses (and with
16kB alignment). IOW, actually emulate a disk with a 16kB hard sector
size, and reporting a 16kB sector size to the READ CAPACITY command. If it
works then, then clearly WNT has no issues with bigger sectors.
Linus
On Sun, 2009-04-12 at 08:41 -0700, Linus Torvalds wrote:
>
> On Sun, 12 Apr 2009, Szabolcs Szakacsits wrote:
> >
> > I did not hear about NTFS using >4kB sectors yet but technically
> > it should work.
> >
> > The atomic building units (sector size, block size, etc) of NTFS are
> > entirely parametric. The maximum values could be bigger than the
> > currently "configured" maximum limits.
>
> It's probably trivial to make ext3 support 16kB blocksizes (if it doesn't
> already).
>
> That's not the problem. The "filesystem layout" part is just a parameter.
>
> The problem is then trying to actually access such a filesystem, in
> particular trying to write to it, or trying to mmap() small chunks of it.
> The FS layout is the trivial part.
>
> > At present the limits are set in the BIOS Parameter Block in the NTFS
> > Boot Sector. This is 2 bytes for the "Bytes Per Sector" and 1 byte for
> > "Sectors Per Block". So >4kB sector size should work since 1993.
> >
> > 64kB+ sector size could be possible by bootstrapping NTFS drivers
> > in a different way.
>
> Try it. And I don't mean "try to create that kind of filesystem". Try to
> _use_ it. Does Window actually support using it it, or is it just a matter
> of "the filesystem layout is _specified_ for up to 64kB block sizes"?
>
> And I really don't know. Maybe Windows does support it. I'm just very
> suspicious. I think there's a damn good reason why NTFS supports larger
> block sizes in theory, BUT EVERYBODY USES A 4kB BLOCKSIZE DESPITE THAT!
>
> Because it really is a hard problem. It's really pretty nasty to have your
> cache blocking be smaller than the actual filesystem blocksize (the other
> way is much easier, although it's certainly not pleasant either - Linux
> supports it because we _have_ to, but sector-size of hardware had
> traditionally been 4kB, I'd certainly also argue against adding complexity
> just to make it smaller, the same way I argue against making it much
> larger).
>
> And don't get me wrong - we could (fairly) trivially make the
> PAGE_CACHE_SIZE be bigger - even eventually go so far as to make it a
> per-mapping thing, so that you could have some filesystems with that
> bigger sector size and some with smaller ones. I think Andrea had patches
> that did a fair chunk of it, and that _almost_ worked.
>
> But it ABSOLUTELY SUCKS. If we did a 16kB page-cache-size, it would
> absolutely blow chunks. It would be disgustingly horrible. Putting the
> kernel source tree on such a filesystem would waste about 75% of all
> memory (the median size of a source file is just about 4kB), so your page
> cache would be effectively cut in a quarter for a lot of real loads.
>
> And to fix up _that_, you'd need to now do things like sub-page
> allocations, and now your page-cache size isn't even fixed per filesystem,
> it would be per-file, and the filesystem (and the drievrs!) would hav to
> handle the cases of getting those 4kB partial pages (and do r-m-w IO after
> all if your hardware sector size is >4kB).
We might not have to go that far for a device with these special
characteristics. It should be possible to build a block size remapping
Read Modify Write type device to present a 4k block size to the OS while
operating in n*4k blocks for the device. We could implement the read
operations as readahead in the page cache, so if we're lucky we mostly
end up operating on full n*4k blocks anyway. For the cases where we've
lost pieces of the n*4k native block and we have to do a write, we'd
just suck it up and do a read modify write on a separate memory area, a
bit like the new 4k sector devices do emulating 512 byte blocks. The
suck factor of this double I/O plus memory copy overhead should be
mitigated partially by the fact that the underlying device is very fast.
James
On Sun, 12 Apr 2009, Robert Hancock wrote:
>
> What about FAT? It supports cluster sizes up to 32K at least (possibly up to
> 256K as well, although somewhat nonstandard), and that works.. We support that
> in Linux, don't we?
Sure.
The thing is, "cluster size" in an FS is totally different from sector
size.
People are missing the point here. You can trivially implement bigger
cluster sizes by just writing multiple sectors. In fact, even just a 4kB
cluster size is actually writing 8 512-byte hardware sectors on all normal
disks.
So you can support big clusters without having big sectors. A 32kB cluster
size in FAT is absolutely trivial to do: it's really purely an allocation
size. So a fat filesystem allocates disk-space in 32kB chunks, but then
when you actually do IO to it, you can still write things 4kB at a time
(or smaller), because once the allocation has been made, you still treat
the disk as a series of smaller blocks.
IOW, when you allocate a new 32kB cluster, you will have to allocate 8
pages to do IO on it (since you'll have to initialize the diskspace), but
you can still literally treat those pages as _individual_ pages, and you
can write them out in any order, and you can free them (and then look them
up) one at a time.
Notice? The cluster size really only ends up being a disk-space allocation
issue, not an issue for actually caching the end result or for the actual
size of the IO.
The hardware sector size is very different. If you have a 32kB hardware
sector size, that implies that _all_ IO has to be done with that
granularity. Now you can no longer treat the eight pages as individual
pages - you _have_ to write them out and read them in as one entity. If
you dirty one page, you effectively dirty them all. You can not drop and
re-allocate pages one at a time any more.
Linus
Mark Lord wrote:
> Alan Cox wrote:
> ..
>> Alternatively you go for read-modify-write (nasty performance hit
>> especially for RAID or a log structured fs).
> ..
>
> Initially, at least, I'd guess that this NVM-HCI thing is all about
> built-in flash memory on motherboards, to hold the "instant-boot"
> software that hardware companies (eg. ASUS) are rapidly growing fond of.
>
> At present, that means a mostly read-only Linux installation,
> though MS for sure are hoping for Moore's Law to kick in and
> provide sufficient space for a copy of Vista there or something.
Yeah... instant boot, and "trusted boot" (booting a signed image),
storage of useful details like boot drive layouts, etc.
I'm sure we can come up with other funs uses, too...
Jeff
Linus Torvalds wrote:
> IOW, when you allocate a new 32kB cluster, you will have to allocate 8
> pages to do IO on it (since you'll have to initialize the diskspace), but
> you can still literally treat those pages as _individual_ pages, and you
> can write them out in any order, and you can free them (and then look them
> up) one at a time.
>
> Notice? The cluster size really only ends up being a disk-space allocation
> issue, not an issue for actually caching the end result or for the actual
> size of the IO.
Right.. I didn't realize we were actually that smart (not writing out
the entire cluster when dirtying one page) but I guess it makes sense.
>
> The hardware sector size is very different. If you have a 32kB hardware
> sector size, that implies that _all_ IO has to be done with that
> granularity. Now you can no longer treat the eight pages as individual
> pages - you _have_ to write them out and read them in as one entity. If
> you dirty one page, you effectively dirty them all. You can not drop and
> re-allocate pages one at a time any more.
>
> Linus
I suspect that in this case trying to gang together multiple pages
inside the VM to actually handle it this way all the way through would
be insanity. My guess is the only way you could sanely do it is the
read-modify-write approach when writing out the data (in the block layer
maybe?) where the read can be optimized away if the pages for the entire
hardware sector are already in cache or the write is large enough to
replace the entire sector. I assume we already do this in the md code
somewhere for cases like software RAID 5 with a stripe size of >4KB..
That obviously would have some performance drawbacks compared to a
smaller sector size, but if the device is bound and determined to use
bigger sectors internally one way or the other and the alternative is
the drive does R-M-W internally to emulate smaller sectors - which for
some devices seems to be the case - maybe it makes more sense to do it
in the kernel if we have more information to allow us to do it more
efficiently. (Though, at least on the normal ATA disk side of things, 4K
is the biggest number I've heard tossed about for a future expanded
sector size, but flash devices like this may be another story..)
Linus Torvalds wrote:
> On Sun, 12 Apr 2009, Avi Kivity wrote:
>
>> A quick test shows that it can. I didn't try mmap(), but copying files around
>> worked.
>>
>
> You being who you are, I'm assuming you're doing this in a virtual
> environment, so you might be able to see the IO patterns..
>
>
Yes. I just used the Windows performance counters rather than mess with
qemu for the test below.
> Can you tell if it does the IO in chunks of 16kB or smaller? That can be
> hard to see with trivial tests (since any filesystem will try to chunk up
> writes regardless of how small the cache entry is, and on file creation it
> will have to write the full 16kB anyway just to initialize the newly
> allocated blocks on disk), but there's a couple of things that should be
> reasonably good litmus tests of what WNT does internally:
>
> - create a big file,
Just creating a 5GB file in a 64KB filesystem was interesting - Windows
was throwing out 256KB I/Os even though I was generating 1MB writes
(and cached too). Looks like a paranoid IDE driver (qemu exposes a PIIX4).
> then rewrite just a few bytes in it, and look at the
> IO pattern of the result. Does it actually do the rewrite IO as one
> 16kB IO, or does it do sub-blocking?
>
It generates 4KB writes (I was generating aligned 512 byte overwrites).
What's more interesting, it was also issuing 32KB reads to fill the
cache, not 64KB. Since the number of reads and writes per second is
almost equal, it's not splitting a 64KB read into two.
> If the latter, then the 16kB thing is just a filesystem layout issue,
> not an internal block-size issue, and WNT would likely have exactly the
> same issues as Linux.
>
A 1 byte write on an ordinary file generates a RMW, same as a 4KB write
on a 16KB block. So long as the filesystem is just a layer behind the
pagecache (which I think is the case on Windows), I don't see what
issues it can have.
> - can you tell how many small files it will cache in RAM without doing
> IO? If it always uses 16kB blocks for caching, it will be able to cache
> a _lot_ fewer files in the same amount of RAM than with a smaller block
> size.
>
I'll do this later, but given the 32KB reads for the test above, I'm
guessing it will cache pages, not blocks.
> Of course, the _really_ conclusive thing (in a virtualized environment) is
> to just make the virtual disk only able to do 16kB IO accesses (and with
> 16kB alignment). IOW, actually emulate a disk with a 16kB hard sector
> size, and reporting a 16kB sector size to the READ CAPACITY command. If it
> works then, then clearly WNT has no issues with bigger sectors.
>
I don't think IDE supports this? And Windows 2008 doesn't like the LSI
emulated device we expose.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
Linus Torvalds wrote:
> The hardware sector size is very different. If you have a 32kB hardware
> sector size, that implies that _all_ IO has to be done with that
> granularity. Now you can no longer treat the eight pages as individual
> pages - you _have_ to write them out and read them in as one entity. If
> you dirty one page, you effectively dirty them all. You can not drop and
> re-allocate pages one at a time any more.
>
You can still drop clean pages. Sure, that costs you performance as
you'll have to do re-read them in order to write a dirty page, but in
the common case, the clean pages around would still be available and
you'd avoid it.
Applications that randomly write to large files can be tuned to use the
disk sector size. As for the rest, they're either read-only (executable
mappings) or sequential.
--
error compiling committee.c: too many arguments to function
On Mon, 13 Apr 2009, Avi Kivity wrote:
> >
> > - create a big file,
>
> Just creating a 5GB file in a 64KB filesystem was interesting - Windows
> was throwing out 256KB I/Os even though I was generating 1MB writes (and
> cached too). Looks like a paranoid IDE driver (qemu exposes a PIIX4).
Heh, ok. So the "big file" really only needed to be big enough to not be
cached, and 5GB was probably overkill. In fact, if there's some way to
blow the cache, you could have made it much smaller. But 5G certainly
works ;)
And yeah, I'm not surprised it limits the size of the IO. Linux will
generally do the same. I forget what our default maximum bio size is, but
I suspect it is in that same kind of range.
There are often problems with bigger IO's (latency being one, actual
controller bugs being another), and even if the hardware has no bugs and
its limits are higher, you usually don't want to have excessively large
DMA mapping tables _and_ the advantage of bigger IO is usually not that
big once you pass the "reasonably sized" limit (which is 64kB+). Plus they
happen seldom enough in practice anyway that it's often not worth
optimizing for.
> > then rewrite just a few bytes in it, and look at the IO pattern of the
> > result. Does it actually do the rewrite IO as one 16kB IO, or does it
> > do sub-blocking?
>
> It generates 4KB writes (I was generating aligned 512 byte overwrites).
> What's more interesting, it was also issuing 32KB reads to fill the
> cache, not 64KB. Since the number of reads and writes per second is
> almost equal, it's not splitting a 64KB read into two.
Ok, that sounds pretty much _exactly_ like the Linux IO patterns would
likely be.
The 32kB read has likely nothing to do with any filesystem layout issues
(especially as you used a 64kB cluster size), but is simply because
(a) Windows caches things with a 4kB granularity, so the 512-byte write
turned into a read-modify-write
(b) the read was really for just 4kB, but once you start reading you want
to do read-ahead anyway since it hardly gets any more expensive to
read a few pages than to read just one.
So once it had to do the read anyway, windows just read 8 pages instead of
one - very reasonable.
> > If the latter, then the 16kB thing is just a filesystem layout
> > issue, not an internal block-size issue, and WNT would likely have
> > exactly the same issues as Linux.
>
> A 1 byte write on an ordinary file generates a RMW, same as a 4KB write on a
> 16KB block. So long as the filesystem is just a layer behind the pagecache
> (which I think is the case on Windows), I don't see what issues it can have.
Right. It's all very straightforward from a filesystem layout issue. The
problem is all about managing memory.
You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for
your example!). It's a total disaster. Imagine what would happen to user
application performance if kmalloc() always returned 16kB-aligned chunks
of memory, all sized as integer multiples of 16kB? It would absolutely
_suck_. Sure, it would be fine for your large allocations, but any time
you handle strings, you'd allocate 16kB of memory for any small 5-byte
string. You'd have horrible cache behavior, and you'd run out of memory
much too quickly.
The same is true in the kernel. The single biggest memory user under
almost all normal loads is the disk cache. That _is_ the normal allocator
for any OS kernel. Everything else is almost details (ok, so Linux in
particular does cache metadata very aggressively, so the dcache and inode
cache are seldom "just details", but the page cache is still generally the
most important part).
So having a 16kB or 64kB granularity is a _disaster_. Which is why no sane
system does that. It's only useful if you absolutely _only_ work with
large files - ie you're a database server. For just about any other
workload, that kind of granularity is totally unnacceptable.
So doing a read-modify-write on a 1-byte (or 512-byte) write, when the
block size is 4kB is easy - we just have to do it anyway.
Doing a read-modify-write on a 4kB write and a 16kB (or 64kB) blocksize is
also _doable_, and from the IO pattern standpoint it is no different. But
from a memory allocation pattern standpoint it's a disaster - because now
you're always working with chunks that are just 'too big' to be good
building blocks of a reasonable allocator.
If you always allocate 64kB for file caches, and you work with lots of
small files (like a source tree), you will literally waste all your
memory.
And if you have some "dynamic" scheme, you'll have tons and tons of really
nasty cases when you have to grow a 4kB allocation to a 64kB one when the
file grows. Imagine doing "realloc()", but doing it in a _threaded_
environment, where any number of threads may be using the old allocation
at the same time. And that's a kernel - it has to be _the_ most
threaded program on the whole machine, because otherwise the kernel
would be the scaling bottleneck.
And THAT is why 64kB blocks is such a disaster.
> > - can you tell how many small files it will cache in RAM without doing
> > IO? If it always uses 16kB blocks for caching, it will be able to cache a
> > _lot_ fewer files in the same amount of RAM than with a smaller block
> > size.
>
> I'll do this later, but given the 32KB reads for the test above, I'm guessing
> it will cache pages, not blocks.
Yeah, you don't need to.
I can already guarantee that Windows does caching on a page granularity.
I can also pretty much guarantee that that is also why Windows stops
compressing files once the blocksize is bigger than 4kB: because at that
point, the block compressions would need to handle _multiple_ cache
entities, and that's really painful for all the same reasons that bigger
sectors would be really painful - you'd always need to make sure that you
always have all of those cache entries in memory together, and you could
never treat your cache entries as individual entities.
> > Of course, the _really_ conclusive thing (in a virtualized environment) is
> > to just make the virtual disk only able to do 16kB IO accesses (and with
> > 16kB alignment). IOW, actually emulate a disk with a 16kB hard sector size,
> > and reporting a 16kB sector size to the READ CAPACITY command. If it works
> > then, then clearly WNT has no issues with bigger sectors.
>
> I don't think IDE supports this? And Windows 2008 doesn't like the LSI
> emulated device we expose.
Yeah, you'd have to have the OS use the SCSI commands for disk discovery,
so at least a SATA interface. With IDE disks, the sector size always has
to be 512 bytes, I think.
Linus
On Mon, 2009-04-13 at 08:10 -0700, Linus Torvalds wrote:
> On Mon, 13 Apr 2009, Avi Kivity wrote:
> > > Of course, the _really_ conclusive thing (in a virtualized environment) is
> > > to just make the virtual disk only able to do 16kB IO accesses (and with
> > > 16kB alignment). IOW, actually emulate a disk with a 16kB hard sector size,
> > > and reporting a 16kB sector size to the READ CAPACITY command. If it works
> > > then, then clearly WNT has no issues with bigger sectors.
> >
> > I don't think IDE supports this? And Windows 2008 doesn't like the LSI
> > emulated device we expose.
>
> Yeah, you'd have to have the OS use the SCSI commands for disk discovery,
> so at least a SATA interface. With IDE disks, the sector size always has
> to be 512 bytes, I think.
Actually, the latest ATA rev supports different sector sizes in
preparation for native 4k sector size SATA disks (words 117-118 of
IDENTIFY). Matthew Wilcox already has the patches for libata ready.
James
Linus Torvalds <[email protected]> writes:
>
> You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for
> your example!).
AFAIK at least for user visible anonymous memory Windows uses 64k
chunks. At least that is what Cygwin's mmap exposes. I don't know
if it does the same for disk cache.
-Andi
--
[email protected] -- Speaking for myself only.
Linus Torvalds wrote:
> On Mon, 13 Apr 2009, Avi Kivity wrote:
>
>>> - create a big file,
>>>
>> Just creating a 5GB file in a 64KB filesystem was interesting - Windows
>> was throwing out 256KB I/Os even though I was generating 1MB writes (and
>> cached too). Looks like a paranoid IDE driver (qemu exposes a PIIX4).
>>
>
> Heh, ok. So the "big file" really only needed to be big enough to not be
> cached, and 5GB was probably overkill. In fact, if there's some way to
> blow the cache, you could have made it much smaller. But 5G certainly
> works ;)
>
I wanted to make sure my random writes later don't get coalesced. A 1GB
file, half of which is cached (I used a 1GB guest), offers lots of
chances for coalescing if Windows delays the writes sufficiently. At
5GB, Windows can only cache 10% of the file, so it will be continuously
flushing.
>
> (a) Windows caches things with a 4kB granularity, so the 512-byte write
> turned into a read-modify-write
>
>
[...]
> You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for
> your example!). It's a total disaster. Imagine what would happen to user
> application performance if kmalloc() always returned 16kB-aligned chunks
> of memory, all sized as integer multiples of 16kB? It would absolutely
> _suck_. Sure, it would be fine for your large allocations, but any time
> you handle strings, you'd allocate 16kB of memory for any small 5-byte
> string. You'd have horrible cache behavior, and you'd run out of memory
> much too quickly.
>
> The same is true in the kernel. The single biggest memory user under
> almost all normal loads is the disk cache. That _is_ the normal allocator
> for any OS kernel. Everything else is almost details (ok, so Linux in
> particular does cache metadata very aggressively, so the dcache and inode
> cache are seldom "just details", but the page cache is still generally the
> most important part).
>
> So having a 16kB or 64kB granularity is a _disaster_. Which is why no sane
> system does that. It's only useful if you absolutely _only_ work with
> large files - ie you're a database server. For just about any other
> workload, that kind of granularity is totally unnacceptable.
>
> So doing a read-modify-write on a 1-byte (or 512-byte) write, when the
> block size is 4kB is easy - we just have to do it anyway.
>
> Doing a read-modify-write on a 4kB write and a 16kB (or 64kB) blocksize is
> also _doable_, and from the IO pattern standpoint it is no different. But
> from a memory allocation pattern standpoint it's a disaster - because now
> you're always working with chunks that are just 'too big' to be good
> building blocks of a reasonable allocator.
>
> If you always allocate 64kB for file caches, and you work with lots of
> small files (like a source tree), you will literally waste all your
> memory.
>
>
Well, no one is talking about 64KB granularity for in-core files. Like
you noticed, Windows uses the mmu page size. We could keep doing that,
and still have 16KB+ sector sizes. It just means a RMW if you don't
happen to have the adjoining clean pages in cache.
Sure, on a rotating disk that's a disaster, but we're talking SSD here,
so while you're doubling your access time, you're doubling a fairly
small quantity. The controller would do the same if it exposed smaller
sectors, so there's no huge loss.
We still lose on disk storage efficiency, but I'm guessing that a modern
tree with some object files with debug information and a .git directory
it won't be such a great hit. For more mainstream uses, it would be
negligible.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
Andi Kleen wrote:
> Linus Torvalds <[email protected]> writes:
>
>> You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for
>> your example!).
>>
>
> AFAIK at least for user visible anonymous memory Windows uses 64k
> chunks. At least that is what Cygwin's mmap exposes. I don't know
> if it does the same for disk cache.
>
I think that's just the region address and size granularity (as in
vmas). For paging they still use the mmu page size.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
Avi Kivity wrote:
> Well, no one is talking about 64KB granularity for in-core files. Like
> you noticed, Windows uses the mmu page size. We could keep doing that,
> and still have 16KB+ sector sizes. It just means a RMW if you don't
> happen to have the adjoining clean pages in cache.
>
> Sure, on a rotating disk that's a disaster, but we're talking SSD here,
> so while you're doubling your access time, you're doubling a fairly
> small quantity. The controller would do the same if it exposed smaller
> sectors, so there's no huge loss.
>
> We still lose on disk storage efficiency, but I'm guessing that a modern
> tree with some object files with debug information and a .git directory
> it won't be such a great hit. For more mainstream uses, it would be
> negligible.
Speaking of RMW... in one sense, we have to deal with RMW anyway.
Upcoming ATA hard drives will be configured with a normal 512b sector
API interface, but underlying physical sector size is 1k or 4k.
The disk performs the RMW for us, but we must be aware of physical
sector size in order to determine proper alignment of on-disk data, to
minimize RMW cycles.
At the moment, it seems like most of the effort to get these ATA devices
to perform efficiently is in getting partition / RAID stripe offsets set
up properly.
So perhaps for NVMHCI we could
(a) hardcode NVM sector size maximum at 4k
(b) do RMW in the driver for sector size >4k, and
(c) export information indicating the true sector size, in a manner
similar to how the ATA driver passes that info to userland partitioning
tools.
Jeff
Jeff Garzik wrote:
> Speaking of RMW... in one sense, we have to deal with RMW anyway.
> Upcoming ATA hard drives will be configured with a normal 512b sector
> API interface, but underlying physical sector size is 1k or 4k.
>
> The disk performs the RMW for us, but we must be aware of physical
> sector size in order to determine proper alignment of on-disk data, to
> minimize RMW cycles.
>
Virtualization has the same issue. OS installers will typically setup
the first partition at sector 63, and that means every page-sized block
access will be misaligned. Particularly bad when the guest's disk is
backed on a regular file.
Windows 2008 aligns partitions on a 1MB boundary, IIRC.
> At the moment, it seems like most of the effort to get these ATA
> devices to perform efficiently is in getting partition / RAID stripe
> offsets set up properly.
>
> So perhaps for NVMHCI we could
> (a) hardcode NVM sector size maximum at 4k
> (b) do RMW in the driver for sector size >4k, and
Why not do it in the block layer? That way it isn't limited to one driver.
> (c) export information indicating the true sector size, in a manner
> similar to how the ATA driver passes that info to userland
> partitioning tools.
Eventually we'll want to allow filesystems to make use of the native
sector size.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
Avi Kivity wrote:
> Jeff Garzik wrote:
>> Speaking of RMW... in one sense, we have to deal with RMW anyway.
>> Upcoming ATA hard drives will be configured with a normal 512b sector
>> API interface, but underlying physical sector size is 1k or 4k.
>>
>> The disk performs the RMW for us, but we must be aware of physical
>> sector size in order to determine proper alignment of on-disk data, to
>> minimize RMW cycles.
>>
>
> Virtualization has the same issue. OS installers will typically setup
> the first partition at sector 63, and that means every page-sized block
> access will be misaligned. Particularly bad when the guest's disk is
> backed on a regular file.
>
> Windows 2008 aligns partitions on a 1MB boundary, IIRC.
Makes a lot of sense...
>> At the moment, it seems like most of the effort to get these ATA
>> devices to perform efficiently is in getting partition / RAID stripe
>> offsets set up properly.
>>
>> So perhaps for NVMHCI we could
>> (a) hardcode NVM sector size maximum at 4k
>> (b) do RMW in the driver for sector size >4k, and
>
> Why not do it in the block layer? That way it isn't limited to one driver.
Sure. "in the driver" is a highly relative phrase :) If there is code
to be shared among multiple callsites, let's share it.
>> (c) export information indicating the true sector size, in a manner
>> similar to how the ATA driver passes that info to userland
>> partitioning tools.
>
> Eventually we'll want to allow filesystems to make use of the native
> sector size.
At the kernel level, you mean?
Filesystems already must deal with issues such as avoiding RAID stripe
boundaries (man mke2fs, search for 'RAID').
So I hope that same code should be applicable to cases where the
"logical sector size" (as exported by storage interface) differs from
"physical sector size" (the underlying hardware sector size, not
directly accessible by OS).
But if you are talking about filesystems directly supporting sector
sizes >4kb, well, I'll let Linus and others settle that debate :) I
will just write the driver once the dust settles...
Jeff
On Tue, 14 Apr 2009, Jeff Garzik wrote:
> Avi Kivity wrote:
> > Jeff Garzik wrote:
> > > Speaking of RMW... in one sense, we have to deal with RMW anyway.
> > > Upcoming ATA hard drives will be configured with a normal 512b sector API
> > > interface, but underlying physical sector size is 1k or 4k.
> > >
> > > The disk performs the RMW for us, but we must be aware of physical sector
> > > size in order to determine proper alignment of on-disk data, to minimize
> > > RMW cycles.
> > >
> >
> > Virtualization has the same issue. OS installers will typically setup the
> > first partition at sector 63, and that means every page-sized block access
> > will be misaligned. Particularly bad when the guest's disk is backed on a
> > regular file.
> >
> > Windows 2008 aligns partitions on a 1MB boundary, IIRC.
>
> Makes a lot of sense...
Since Vista at least the first partition is 2048 sector aligned.
Szaka
--
NTFS-3G: http://ntfs-3g.org
Jeff Garzik wrote:
>>> (c) export information indicating the true sector size, in a manner
>>> similar to how the ATA driver passes that info to userland
>>> partitioning tools.
>>
>> Eventually we'll want to allow filesystems to make use of the native
>> sector size.
>
> At the kernel level, you mean?
>
Yes. You'll want to align extents and I/O requests on that boundary.
>
> But if you are talking about filesystems directly supporting sector
> sizes >4kb, well, I'll let Linus and others settle that debate :) I
> will just write the driver once the dust settles...
IMO drivers should expose whatever sector size the device have,
filesystems should expose their block size, and the block layer should
correct any impedance mismatches by doing RMW.
Unfortunately, sector size > fs block size means a lot of pointless
locking for the RMW, so if large sector sizes take off, we'll have to
adjust filesystems to use larger block sizes.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
Avi Kivity wrote:
> Jeff Garzik wrote:
>>>> (c) export information indicating the true sector size, in a manner
>>>> similar to how the ATA driver passes that info to userland
>>>> partitioning tools.
>>>
>>> Eventually we'll want to allow filesystems to make use of the native
>>> sector size.
>>
>> At the kernel level, you mean?
>>
>
> Yes. You'll want to align extents and I/O requests on that boundary.
Sure. And RAID today presents these issues to the filesystem...
man mke2fs(8), and look at extended options 'stride' and 'stripe-width'.
It includes mention of RMW issues.
>> But if you are talking about filesystems directly supporting sector
>> sizes >4kb, well, I'll let Linus and others settle that debate :) I
>> will just write the driver once the dust settles...
>
> IMO drivers should expose whatever sector size the device have,
> filesystems should expose their block size, and the block layer should
> correct any impedance mismatches by doing RMW.
>
> Unfortunately, sector size > fs block size means a lot of pointless
> locking for the RMW, so if large sector sizes take off, we'll have to
> adjust filesystems to use larger block sizes.
Don't forget the case where the device does RMW for you, and does not
permit direct access to physical sector size (all operations are in
terms of logical sector size).
Jeff
On Tue, 2009-04-14 at 10:52 -0700, Jared Hulbert wrote:
> It really isn't worth it. It's much better for everybody to
> just be aware
> of the incredible level of pure suckage of a general-purpose
> disk that has
> hardware sectors >4kB. Just educate people that it's not good.
> Avoid the
> whole insane suckage early, rather than be disappointed in
> hardware that
> is total and utter CRAP and just causes untold problems.
>
> I don't disagree that >4KB DISKS are a bad idea. But I don't think
> that's what's going on here. As I read it, NVMHCI would plug into the
> MTD subsystem, not the block subsystem.
>
>
> NVMHCI, as far as I understand the spec, is not trying to be a
> general-purpose disk, it's for exposing more or less the raw NAND. As
> far as I can tell it's a DMA engine spec for large arrays of NAND.
> BTW, anybody actually seen a NVMHCI device or plan on making one?
I briefly glanced at the doc, and it does not look like this is an
interface to expose raw NAND. E.g., I could not find "erase" operation.
I could not find information about bad eraseblocks.
It looks like it is not about raw NANDs. May be about "managed" NANDs.
Also, the following sentences from the "Outside of Scope" sub-section
suggest I'm right:
"NVMHCI is also specified above any non-volatile memory management, like
wear leveling. Erases and other management tasks for NVM technologies
like NAND are abstracted.".
So it says NVMHCI is _above_ wear levelling, which means FTL would be
_inside_ the NVMHCI device, which is not about raw NAND.
But I may be wrong, I spent less than 10 minutes looking at the doc,
sorry.
--
Best regards,
Artem Bityutskiy (Битюцкий Артём)
Szabolcs Szakacsits wrote:
>>>
>>> Windows 2008 aligns partitions on a 1MB boundary, IIRC.
>> Makes a lot of sense...
>
> Since Vista at least the first partition is 2048 sector aligned.
>
> Szaka
>
2048 * 512 = 1 MB, yes.
I *think* it's actually 1 MB and not 2048 sectors, but yes, they've
finally dumped the idiotic DOS misalignment. Unfortunately the GNU
parted people have said that the parted code is too fragile to fix
parted in this way. Sigh.
-hpa
Hi!
>> Well, no one is talking about 64KB granularity for in-core files. Like
>> you noticed, Windows uses the mmu page size. We could keep doing that,
>> and still have 16KB+ sector sizes. It just means a RMW if you don't
>> happen to have the adjoining clean pages in cache.
>>
>> Sure, on a rotating disk that's a disaster, but we're talking SSD here,
>> so while you're doubling your access time, you're doubling a fairly
>> small quantity. The controller would do the same if it exposed smaller
>> sectors, so there's no huge loss.
>>
>> We still lose on disk storage efficiency, but I'm guessing that a
>> modern tree with some object files with debug information and a .git
>> directory it won't be such a great hit. For more mainstream uses, it
>> would be negligible.
>
>
> Speaking of RMW... in one sense, we have to deal with RMW anyway.
> Upcoming ATA hard drives will be configured with a normal 512b sector
> API interface, but underlying physical sector size is 1k or 4k.
>
> The disk performs the RMW for us, but we must be aware of physical
> sector size in order to determine proper alignment of on-disk data, to
> minimize RMW cycles.
Also... RMW hsa some nasty reliability implications. If we use 1KB
block size ext3 (or something like that), unrelated data may now be
damaged during powerfail. Filesystems can not handle that :-(.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Wed, 15 April 2009 09:37:50 +0300, Artem Bityutskiy wrote:
>
> I briefly glanced at the doc, and it does not look like this is an
> interface to expose raw NAND. E.g., I could not find "erase" operation.
> I could not find information about bad eraseblocks.
>
> It looks like it is not about raw NANDs. May be about "managed" NANDs.
I'm not sure whether your distinction is exactly valid anymore. "raw
NAND" used to mean two things. 1) A single chip of silicon without
additional hardware. 2) NAND without FTL.
Traditionally the FTL was implemented either in software or in a
constroller chip. So you could not get "cooked" flash as in FTL without
"cooked" flash as in extra hardware. Today you can, which makes "raw
NAND" a less useful term.
And I'm not sure what to think about flash chips with the (likely
crappy) FTL inside either. Not having to deal with bad blocks anymore
is a bliss. Not having to deal with wear leveling anymore is a lie.
Not knowing whether errors occurred and whether uncorrected data was
left on the device or replaced with corrected data is a pain.
But like it or not, the market seems to be moving in that direction.
Which means we will have "block devices" that have all the interfaces of
disks and behave much like flash - modulo the crap FTL.
Jörn
--
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon
Jörn Engel wrote:
> But like it or not, the market seems to be moving in that direction.
> Which means we will have "block devices" that have all the interfaces of
> disks and behave much like flash - modulo the crap FTL.
One driving goal behind NVMHCI was to avoid disk-originated interfaces,
because they are not as well suited to flash storage.
The NVMHCI command set (distinguished from NVMHCI, the silicon) is
specifically targetted towards flash.
Jeff