Hi list,
Block allocation is a key component of file system. Every file systems try to
improve the performance with optimizing the block allocation of a file. But no
matter what file system does, it just guesses what the user expects. Thus, it
is not very accurate. fadvise(2) provides a method to let the user to give a
hint to file system. However, until now, only few flags are provided. So we
can provide more flags to tell file system how to allocate the blocks for a
file.
For example:
we can add these flags into fadvise(2):
FADV_ALLOC_READ_SEQ
FADV_ALLOC_READ_RANDOM
FADV_ALLOC_WRITE_ONCE
FADV_ALLOC_WRITE_APPEND
FADV_ALLOC_READ_* are not similar with FADV_SEQUENTIAL and FADV_RANDOM.
FADV_ALLOC_READ_SEQ tells file system that this file need to allocate some
sequential blocks, and FADV_ALLOC_READ_RADOM tells file system that this file
can endure the fragmentation.
FADV_ALLOC_WRITE_ONCE indicates that this file just is written once. So file
system can allocate some sequential blocks for it to improve the read
performance. FADV_ALLOC_WRITE_APPEND flag is set to point out that data will be
appended to the end of this file, and file system can reserve some blocks for it
to guarantee the sequence as much as possible.
File systems can support a subset of these flags according to its design. These
flags provide a rich interface that lets the user to control block allocation of
files. The user could precisely control the allocation of their files to
improve the performance of appliatons.
Any comments or suggestions are appreciated. Thank you.
Regards,
Zheng
On 03/05/2012 04:50 AM, Zheng Liu wrote:
> Hi list,
>
> Block allocation is a key component of file system. Every file systems try to
> improve the performance with optimizing the block allocation of a file. But no
> matter what file system does, it just guesses what the user expects. Thus, it
> is not very accurate. fadvise(2) provides a method to let the user to give a
> hint to file system. However, until now, only few flags are provided. So we
> can provide more flags to tell file system how to allocate the blocks for a
> file.
>
> For example:
> we can add these flags into fadvise(2):
> FADV_ALLOC_READ_SEQ
> FADV_ALLOC_READ_RANDOM
> FADV_ALLOC_WRITE_ONCE
> FADV_ALLOC_WRITE_APPEND
>
> FADV_ALLOC_READ_* are not similar with FADV_SEQUENTIAL and FADV_RANDOM.
> FADV_ALLOC_READ_SEQ tells file system that this file need to allocate some
> sequential blocks, and FADV_ALLOC_READ_RADOM tells file system that this file
> can endure the fragmentation.
File systems typically allocate the best layout they can for a file
at the time of write. Does _RANDOM mean do not do that. Find single
bits scattered around the disk. If so, why will people use it. I mean,
random IOs are slow. What you are proposing it is a further slowdown.
Hardly a feature that will be attractive to users.
> FADV_ALLOC_WRITE_ONCE indicates that this file just is written once. So file
> system can allocate some sequential blocks for it to improve the read
> performance. FADV_ALLOC_WRITE_APPEND flag is set to point out that data will be
> appended to the end of this file, and file system can reserve some blocks for it
> to guarantee the sequence as much as possible.
Define ONCE. Is it one write(2)? I guess not. You probably mean
that once the file descriptor is closed, it will not be written
to. But we have no way of knowing how many writes there will be.
So it will be treated the same as APPEND. And file systems already
provide allocation reservation and/or delayed allocation to handle
APPEND write loads. So this flag does not offer much to the user
or the fs.
On Mon, Mar 05, 2012 at 11:48:43AM -0800, Sunil Mushran wrote:
> On 03/05/2012 04:50 AM, Zheng Liu wrote:
> >Hi list,
> >
> >Block allocation is a key component of file system. Every file systems try to
> >improve the performance with optimizing the block allocation of a file. But no
> >matter what file system does, it just guesses what the user expects. Thus, it
> >is not very accurate. fadvise(2) provides a method to let the user to give a
> >hint to file system. However, until now, only few flags are provided. So we
> >can provide more flags to tell file system how to allocate the blocks for a
> >file.
> >
> >For example:
> >we can add these flags into fadvise(2):
> >FADV_ALLOC_READ_SEQ
> >FADV_ALLOC_READ_RANDOM
> >FADV_ALLOC_WRITE_ONCE
> >FADV_ALLOC_WRITE_APPEND
> >
> >FADV_ALLOC_READ_* are not similar with FADV_SEQUENTIAL and FADV_RANDOM.
> >FADV_ALLOC_READ_SEQ tells file system that this file need to allocate some
> >sequential blocks, and FADV_ALLOC_READ_RADOM tells file system that this file
> >can endure the fragmentation.
Hi Sunil,
Thank you for your feedback.
>
>
> File systems typically allocate the best layout they can for a file
> at the time of write. Does _RANDOM mean do not do that. Find single
> bits scattered around the disk. If so, why will people use it. I
> mean, random IOs are slow. What you are proposing it is a further
> slowdown.
> Hardly a feature that will be attractive to users.
No, _RANDOM means that file system doesn't need to try its best to find
a proper position to allocate some blocks for this file. Furthermore,
currently random IOs seem that they are not obviously slower than
sequential IOs in Flash/SSD device. For example, when users know a file
that is accessed infrequently, they can put this file in a corner, such
as in some discontinuously blocks. Then sequential blocks are reserved
for the file that needs to be accessed frequently and users can obtain
the better performance.
>
>
> >FADV_ALLOC_WRITE_ONCE indicates that this file just is written once. So file
> >system can allocate some sequential blocks for it to improve the read
> >performance. FADV_ALLOC_WRITE_APPEND flag is set to point out that data will be
> >appended to the end of this file, and file system can reserve some blocks for it
> >to guarantee the sequence as much as possible.
>
>
> Define ONCE. Is it one write(2)? I guess not. You probably mean
> that once the file descriptor is closed, it will not be written
> to. But we have no way of knowing how many writes there will be.
> So it will be treated the same as APPEND. And file systems already
> provide allocation reservation and/or delayed allocation to handle
> APPEND write loads. So this flag does not offer much to the user
> or the fs.
Sorry, I don't express clearly. _ONCE means that the size of a file
doesn't be chagned after it has been created. Certainly, you are right.
We can use fallocate(2) to obtain the same result. ;-)
Regards,
Zheng
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/5/2012 6:35 PM, Zheng Liu wrote:
> No, _RANDOM means that file system doesn't need to try its best to find
> a proper position to allocate some blocks for this file. Furthermore,
> currently random IOs seem that they are not obviously slower than
> sequential IOs in Flash/SSD device. For example, when users know a file
> that is accessed infrequently, they can put this file in a corner, such
> as in some discontinuously blocks. Then sequential blocks are reserved
> for the file that needs to be accessed frequently and users can obtain
> the better performance.
Then FADV_ALLOC_HOT_REGION and FADV_ALLOC_COLD_REGION are
probably better terms.
On Mon, 5 Mar 2012, Zheng Liu wrote:
> Hi list,
>
> Block allocation is a key component of file system. Every file systems try to
> improve the performance with optimizing the block allocation of a file. But no
> matter what file system does, it just guesses what the user expects. Thus, it
> is not very accurate. fadvise(2) provides a method to let the user to give a
> hint to file system. However, until now, only few flags are provided. So we
> can provide more flags to tell file system how to allocate the blocks for a
> file.
>
> For example:
> we can add these flags into fadvise(2):
> FADV_ALLOC_READ_SEQ
> FADV_ALLOC_READ_RANDOM
> FADV_ALLOC_WRITE_ONCE
> FADV_ALLOC_WRITE_APPEND
>
> FADV_ALLOC_READ_* are not similar with FADV_SEQUENTIAL and FADV_RANDOM.
> FADV_ALLOC_READ_SEQ tells file system that this file need to allocate some
> sequential blocks, and FADV_ALLOC_READ_RADOM tells file system that this file
> can endure the fragmentation.
>
> FADV_ALLOC_WRITE_ONCE indicates that this file just is written once. So file
> system can allocate some sequential blocks for it to improve the read
> performance. FADV_ALLOC_WRITE_APPEND flag is set to point out that data will be
> appended to the end of this file, and file system can reserve some blocks for it
> to guarantee the sequence as much as possible.
Hi Zheng,
those two flags does not make sense to me. The FADV_ALLOC_WRITE_ONCE is
actually the same as fallocate, and we certainly do not need more ways
to do fallocate, one is more than enough.
FADV_ALLOC_WRITE_APPEND seems weird. File systems already do some
preallocations for the files, so we do not fragment them as much. So
what might be more interesting is to be able to set how much space we
want to keep preallocated for the particular file, however strictly
speaking it is not something we would not achieve with fallocate, but it
would certainly be more convenient.
-Lukas
>
> File systems can support a subset of these flags according to its design. These
> flags provide a rich interface that lets the user to control block allocation of
> files. The user could precisely control the allocation of their files to
> improve the performance of appliatons.
>
> Any comments or suggestions are appreciated. Thank you.
>
> Regards,
> Zheng
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
On Mon, Mar 05, 2012 at 08:26:03PM -0800, Sunil Mushran wrote:
> On 3/5/2012 6:35 PM, Zheng Liu wrote:
> >No, _RANDOM means that file system doesn't need to try its best to find
> >a proper position to allocate some blocks for this file. Furthermore,
> >currently random IOs seem that they are not obviously slower than
> >sequential IOs in Flash/SSD device. For example, when users know a file
> >that is accessed infrequently, they can put this file in a corner, such
> >as in some discontinuously blocks. Then sequential blocks are reserved
> >for the file that needs to be accessed frequently and users can obtain
> >the better performance.
>
> Then FADV_ALLOC_HOT_REGION and FADV_ALLOC_COLD_REGION are
> probably better terms.
Make sense to me. Thanks a lot. ;-)
Regards,
Zheng
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 06, 2012 at 09:27:16AM +0100, Lukas Czerner wrote:
> On Mon, 5 Mar 2012, Zheng Liu wrote:
>
> > Hi list,
> >
> > Block allocation is a key component of file system. Every file systems try to
> > improve the performance with optimizing the block allocation of a file. But no
> > matter what file system does, it just guesses what the user expects. Thus, it
> > is not very accurate. fadvise(2) provides a method to let the user to give a
> > hint to file system. However, until now, only few flags are provided. So we
> > can provide more flags to tell file system how to allocate the blocks for a
> > file.
> >
> > For example:
> > we can add these flags into fadvise(2):
> > FADV_ALLOC_READ_SEQ
> > FADV_ALLOC_READ_RANDOM
> > FADV_ALLOC_WRITE_ONCE
> > FADV_ALLOC_WRITE_APPEND
> >
> > FADV_ALLOC_READ_* are not similar with FADV_SEQUENTIAL and FADV_RANDOM.
> > FADV_ALLOC_READ_SEQ tells file system that this file need to allocate some
> > sequential blocks, and FADV_ALLOC_READ_RADOM tells file system that this file
> > can endure the fragmentation.
> >
> > FADV_ALLOC_WRITE_ONCE indicates that this file just is written once. So file
> > system can allocate some sequential blocks for it to improve the read
> > performance. FADV_ALLOC_WRITE_APPEND flag is set to point out that data will be
> > appended to the end of this file, and file system can reserve some blocks for it
> > to guarantee the sequence as much as possible.
>
> Hi Zheng,
>
> those two flags does not make sense to me. The FADV_ALLOC_WRITE_ONCE is
> actually the same as fallocate, and we certainly do not need more ways
> to do fallocate, one is more than enough.
>
> FADV_ALLOC_WRITE_APPEND seems weird. File systems already do some
> preallocations for the files, so we do not fragment them as much. So
> what might be more interesting is to be able to set how much space we
> want to keep preallocated for the particular file, however strictly
> speaking it is not something we would not achieve with fallocate, but it
> would certainly be more convenient.
>
> -Lukas
>
Hi Lukas,
I have realized that these two flags seem redundant, and we don't need
them.
As we discussed previously and Sunil's suggestions. The key issue is
that user provides a hint to file system, and file system can know
whether or not this file can be stored in a corner or be allocated in
non-sequential blocks. Then the sequential blocks are reserved for the
particular file that has a *_HOT* flag. Although fallocate(2) can
preallocate some blocks for a file, it cannot put a file at the
beginning of the disk to obtain a better performance. So maybe file
system can use these flags to optimize the layout of a file.
Regards,
Zheng
> >
> > File systems can support a subset of these flags according to its design. These
> > flags provide a rich interface that lets the user to control block allocation of
> > files. The user could precisely control the allocation of their files to
> > improve the performance of appliatons.
> >
> > Any comments or suggestions are appreciated. Thank you.
> >
> > Regards,
> > Zheng
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> --
On Tue, 6 Mar 2012, Zheng Liu wrote:
> On Tue, Mar 06, 2012 at 09:27:16AM +0100, Lukas Czerner wrote:
> > On Mon, 5 Mar 2012, Zheng Liu wrote:
> >
> > > Hi list,
> > >
> > > Block allocation is a key component of file system. Every file systems try to
> > > improve the performance with optimizing the block allocation of a file. But no
> > > matter what file system does, it just guesses what the user expects. Thus, it
> > > is not very accurate. fadvise(2) provides a method to let the user to give a
> > > hint to file system. However, until now, only few flags are provided. So we
> > > can provide more flags to tell file system how to allocate the blocks for a
> > > file.
> > >
> > > For example:
> > > we can add these flags into fadvise(2):
> > > FADV_ALLOC_READ_SEQ
> > > FADV_ALLOC_READ_RANDOM
> > > FADV_ALLOC_WRITE_ONCE
> > > FADV_ALLOC_WRITE_APPEND
> > >
> > > FADV_ALLOC_READ_* are not similar with FADV_SEQUENTIAL and FADV_RANDOM.
> > > FADV_ALLOC_READ_SEQ tells file system that this file need to allocate some
> > > sequential blocks, and FADV_ALLOC_READ_RADOM tells file system that this file
> > > can endure the fragmentation.
> > >
> > > FADV_ALLOC_WRITE_ONCE indicates that this file just is written once. So file
> > > system can allocate some sequential blocks for it to improve the read
> > > performance. FADV_ALLOC_WRITE_APPEND flag is set to point out that data will be
> > > appended to the end of this file, and file system can reserve some blocks for it
> > > to guarantee the sequence as much as possible.
> >
> > Hi Zheng,
> >
> > those two flags does not make sense to me. The FADV_ALLOC_WRITE_ONCE is
> > actually the same as fallocate, and we certainly do not need more ways
> > to do fallocate, one is more than enough.
> >
> > FADV_ALLOC_WRITE_APPEND seems weird. File systems already do some
> > preallocations for the files, so we do not fragment them as much. So
> > what might be more interesting is to be able to set how much space we
> > want to keep preallocated for the particular file, however strictly
> > speaking it is not something we would not achieve with fallocate, but it
> > would certainly be more convenient.
> >
> > -Lukas
> >
>
> Hi Lukas,
>
> I have realized that these two flags seem redundant, and we don't need
> them.
>
> As we discussed previously and Sunil's suggestions. The key issue is
> that user provides a hint to file system, and file system can know
> whether or not this file can be stored in a corner or be allocated in
> non-sequential blocks. Then the sequential blocks are reserved for the
> particular file that has a *_HOT* flag. Although fallocate(2) can
> preallocate some blocks for a file, it cannot put a file at the
> beginning of the disk to obtain a better performance. So maybe file
> system can use these flags to optimize the layout of a file.
However the file system do not have the information which part of the
device it resides on is faster. It might be the beginning of the file
system, but it might not be the case at all.
Moreover the flag which is stating that the file does not have to be
allocated sequentially is not particularly helpful, I can not imagine
people using it. Why would someone want to lower their performance ?
Well, they might think that it will increase performance of the other
files, but that is highly disputable and there are better solutions like
using faster storage for the files that actually needs it.
Additionally *_HOT* flag does not say anything about the allocation
policy. It might be accessed often ,but no in sequential manner, or it
can be written to a lot, it can be appended a lot, or it the content
might be changed without changing its size etc... *Hot* might mean so
many thing that this is just not useful for the file system. It would
certainly be better to come up with something less esoteric which would
actually address concrete user issues and help file system to deal with
them better, like, I do not know, do not fsync/force allocation on
rename maybe...(or whatever we are doing right now).
Thanks!
-Lukas
>
> Regards,
> Zheng
>
> > >
> > > File systems can support a subset of these flags according to its design. These
> > > flags provide a rich interface that lets the user to control block allocation of
> > > files. The user could precisely control the allocation of their files to
> > > improve the performance of appliatons.
> > >
> > > Any comments or suggestions are appreciated. Thank you.
> > >
> > > Regards,
> > > Zheng
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> >
> > --
>
--
On 03/06/2012 06:29 AM, Lukas Czerner wrote:
> However the file system do not have the information which part of the
> device it resides on is faster. It might be the beginning of the file
> system, but it might not be the case at all.
Think HSM and flash storage as the hot region. Remember these are
hints and not guaranteed to work in all cases.
> Moreover the flag which is stating that the file does not have to be
> allocated sequentially is not particularly helpful, I can not imagine
> people using it. Why would someone want to lower their performance ?
> Well, they might think that it will increase performance of the other
> files, but that is highly disputable and there are better solutions like
> using faster storage for the files that actually needs it.
>
> Additionally *_HOT* flag does not say anything about the allocation
> policy. It might be accessed often ,but no in sequential manner, or it
> can be written to a lot, it can be appended a lot, or it the content
> might be changed without changing its size etc... *Hot* might mean so
> many thing that this is just not useful for the file system. It would
> certainly be better to come up with something less esoteric which would
> actually address concrete user issues and help file system to deal with
> them better, like, I do not know, do not fsync/force allocation on
> rename maybe...(or whatever we are doing right now).
_HOT/_COLD is descriptive for allocation policy though fadvise() is
the wrong call as it pertains to access patterns.
Sunil
On Mon, Mar 05, 2012 at 08:50:29PM +0800, Zheng Liu wrote:
> Hi list,
>
> Block allocation is a key component of file system. Every file systems try to
> improve the performance with optimizing the block allocation of a file. But no
> matter what file system does, it just guesses what the user expects. Thus, it
> is not very accurate. fadvise(2) provides a method to let the user to give a
> hint to file system. However, until now, only few flags are provided. So we
> can provide more flags to tell file system how to allocate the blocks for a
> file.
>
> For example:
> we can add these flags into fadvise(2):
> FADV_ALLOC_READ_SEQ
fallocate()
> FADV_ALLOC_READ_RANDOM
Allocation can't be optimised as the read pattern cannot be defined.
> FADV_ALLOC_WRITE_ONCE
fallocate()
> FADV_ALLOC_WRITE_APPEND
chattr +a
Cheers,
Dave.
--
Dave Chinner
[email protected]
On 2012-03-07, at 8:51 AM, Dave Chinner wrote:
> On Mon, Mar 05, 2012 at 08:50:29PM +0800, Zheng Liu wrote:
>> Block allocation is a key component of file system. Every file systems try to
>> improve the performance with optimizing the block allocation of a file. But no
>> matter what file system does, it just guesses what the user expects. Thus, it
>> is not very accurate. fadvise(2) provides a method to let the user to give a
>> hint to file system. However, until now, only few flags are provided. So we
>> can provide more flags to tell file system how to allocate the blocks for a
>> file.
>>
>> For example:
>> we can add these flags into fadvise(2):
>> FADV_ALLOC_READ_SEQ
>
> fallocate()
I think this is already the assumed default for any file IO, but is included for completeness (e.g. to be able to turn off READ_RANDOM).
>> FADV_ALLOC_READ_RANDOM
>
> Allocation can't be optimised as the read pattern cannot be defined.
I think what this is intended for is to tell the filesystem "don't work very hard to find optimum allocation, it will have a random read pattern anyway".
>> FADV_ALLOC_WRITE_ONCE
>
> fallocate()
>
>> FADV_ALLOC_WRITE_APPEND
>
> chattr +a
and/or fallocate(KEEP_SIZE)
Having a consistent API definitely makes sense.
This proposal definitely needs to have some clear explanation of how the flags are intended to be used by applications, and why they will help filesystems to improve allocation. I'm not for adding gratuitous APIs, but at the same time I think that filesystems are often working in the dark and could benefit from more information being passed from the application.
Cheers, Andreas
>>>>> "Andreas" == Andreas Dilger <[email protected]> writes:
Andreas> This proposal definitely needs to have some clear explanation
Andreas> of how the flags are intended to be used by applications, and
Andreas> why they will help filesystems to improve allocation.
This goes a bit deeper than just filesystem block allocation strategy.
With SMR drives lurking on the horizon it is becoming increasingly
important for us to classify anticipated future access patterns as we
send I/Os out to storage. We'll need something much smarter than just
REQ_META for these devices. Tiered storage arrays and tiered flash also
benefit from this information.
There's lots of work going on in the standards space in this department
right now and I was hoping we could spend some time discussing the
current proposals in one of the plenary sessions at LSF. Ideally we'd
tie fadvise() and any filesystem internal knowledge into appropriate
storage hints at the bottom of the stack.
--
Martin K. Petersen Oracle Linux Engineering
On Tue, 6 Mar 2012, Sunil Mushran wrote:
> On 03/06/2012 06:29 AM, Lukas Czerner wrote:
> > However the file system do not have the information which part of the
> > device it resides on is faster. It might be the beginning of the file
> > system, but it might not be the case at all.
>
>
> Think HSM and flash storage as the hot region. Remember these are
> hints and not guaranteed to work in all cases.
Exactly, first we have to define what we actually need to achieve with
it. Not just randomly making up stupid pseudo-optimizations. Moreover
there is _no way_ file system has the information about the HSM nor the
flash regions, fast regions or whatever, it does not even know where is
the beginning of the disk. Stop constructing building from the roof!!
There just is not any interface for the file system to use to get such
information!
I also believe that regarding HSM user is in no damn position to decide
whether his file will be on flash or not. It just does not work that
way, every user's, or application's files has to be accessed faster than
others from their point of view.
>
>
> > Moreover the flag which is stating that the file does not have to be
> > allocated sequentially is not particularly helpful, I can not imagine
> > people using it. Why would someone want to lower their performance ?
> > Well, they might think that it will increase performance of the other
> > files, but that is highly disputable and there are better solutions like
> > using faster storage for the files that actually needs it.
> >
> > Additionally *_HOT* flag does not say anything about the allocation
> > policy. It might be accessed often ,but no in sequential manner, or it
> > can be written to a lot, it can be appended a lot, or it the content
> > might be changed without changing its size etc... *Hot* might mean so
> > many thing that this is just not useful for the file system. It would
> > certainly be better to come up with something less esoteric which would
> > actually address concrete user issues and help file system to deal with
> > them better, like, I do not know, do not fsync/force allocation on
> > rename maybe...(or whatever we are doing right now).
>
> _HOT/_COLD is descriptive for allocation policy though fadvise() is
> the wrong call as it pertains to access patterns.
Of course _HOT/_COLD is totally stupid flags from both user and file
system POV. It could mean whatever you can imagine behind HOT/COLD. In
this case it is so damn esoteric I can not imagine even file systems
agree on the meaning of it. But when it comes to user it will be even
worse - total disaster - no one would be able to say what benefit should
it actually bring.
Just come up with concrete optimizations and give them concrete names.
If this is going to be of any use to file systems and users, both should
know exactly what workload would be applied to the file, or what user
actually intents to do with it, so that file system can take concrete
action. What you proposing is a flag which should spawns ponies all
around, it does not work
And if you can not come up with any flag like that, well then it certainly
tells you something about this feature as a whole.
-Lukas
>
> Sunil
>
--
On Wed, Mar 07, 2012 at 12:02:19AM -0500, Martin K. Petersen wrote:
> >>>>> "Andreas" == Andreas Dilger <[email protected]> writes:
>
> Andreas> This proposal definitely needs to have some clear explanation
> Andreas> of how the flags are intended to be used by applications, and
> Andreas> why they will help filesystems to improve allocation.
>
> This goes a bit deeper than just filesystem block allocation strategy.
>
> With SMR drives lurking on the horizon it is becoming increasingly
> important for us to classify anticipated future access patterns as we
> send I/Os out to storage. We'll need something much smarter than just
> REQ_META for these devices. Tiered storage arrays and tiered flash also
> benefit from this information.
>From what I've seen of the proposed SMR device standards, we're
going to have to redesign filesystem allocation policies completely
to use anything other than a single emulated random read/write
region in a SMR drive. Filesystems are going to need to know about
the different regions and their attributes to determine how they can
allocate space and what type of write IO that can be directed to
such areas. e.g. a filesystem that overwrites metadata in place must
use a random RW region for all it's metadata - there is no other
choice. And for regions that are append only, they cannot have their
space reused until the entire region has had all active data moved
out of it first.
>From that perspective, I don't see fadvise as the best interface for
this - per-file access pattern/allocation policy information needs
to be kept persistent in the filesystem. Indeed, there is no end of
different allocation policies a filesystem could define, so I don't
think that iterating them in fadvise() is a good thing to do. I'm
not sure that fallocate() is even the right place for this, though
it is a much better match for such extensions because it is for
persistent changes to file allocation ranges.
> There's lots of work going on in the standards space in this department
> right now and I was hoping we could spend some time discussing the
> current proposals in one of the plenary sessions at LSF. Ideally we'd
> tie fadvise() and any filesystem internal knowledge into appropriate
> storage hints at the bottom of the stack.
I didn't see much in way of scope for hints at the bottom of the
stack for SMR devices - once the filesystem has allocated space in
the region for the given access type, there is no additional
information that needs to be supplied by the storage stack. I
suspect the same is true for tiered storage....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Mar 07, 2012 at 09:51:27AM +0100, Lukas Czerner wrote:
> Exactly, first we have to define what we actually need to achieve with
> it. Not just randomly making up stupid pseudo-optimizations. Moreover
> there is _no way_ file system has the information about the HSM nor the
> flash regions, fast regions or whatever, it does not even know where is
> the beginning of the disk. Stop constructing building from the roof!!
> There just is not any interface for the file system to use to get such
> information!
I'm not really worried about this problem. This is something which
can easily be set by the system administrator via mkfs or tune2fs.
And just as we now have /sys/block/sda/queue/rotational so that upper
layers can make optimizations based on whether or not a disk is an
SSD, as we can prove that manual configuration of storage attributes
can make a measurable difference, it can be a spur to the standards
bodies to eventually (years and years and years later) come up with a
standardized way for the file system to get such interfaces
automatically.
In the meantime, we have flash vendors (or at least one flash vendor
who works with embedded/handset customers) interested in potentially
providing private interfaces to make additional storage attributes
available, or for file systems to provide information to the storage
devices so they can better optimize their behaivor.
So I'm not too worried about the fact that we don't have a way to
specify all of these things yet. If we can find a way to make things
faster, eventually the rest of the infrastructure can get plumbed in.
(Even standards bodies that move in geologic time scales. :-)
> I also believe that regarding HSM user is in no damn position to decide
> whether his file will be on flash or not. It just does not work that
> way, every user's, or application's files has to be accessed faster than
> others from their point of view.
Access control is going to be an interesting problem, and what the
requirements are for a file system running on an HPC system, or an
Android device, or a generic time-sharing system are quite different.
Given that many of us have grown up in an environment where there are
mutually suspicious (and untrustworthy) time sharing users, or equally
untrustworthy application writers who tend to optimize their
application without considering anything else, it's easy for us to
assume that if we can't solve the authorization problem completely,
that it's hopeless.
But the same argument can be made for real time scheduling priorities
(which is even easier for untrustworthy application authors to abuse),
but that's turned out to be extremely important in allowing Linux to
break through in various new fields --- including Naval Warships and
laser-wielding industrial robots. :-)
- Ted
>>>>> "Dave" == Dave Chinner <[email protected]> writes:
Dave> From what I've seen of the proposed SMR device standards, we're
Dave> going to have to redesign filesystem allocation policies
[...]
The initial proposal involved SMR disks having a sparse LBA map carved
into 2GB chunks. However, that was shot down pretty hard.
The approach currently being worked uses either dynamic (flash, tiered
storage) or static hints (SMR) to put things in an appropriate area
given the nature of the I/O.
This puts the burden of virtual to physical LBA management on the device
rather than in the filesystem allocators. And gives us the benefit of
having a single interface that can be used for many different device
types.
That said, the current proposal is crazy complex and clearly written
with Windows in mind. They are creating different access profiles for
.DLLs, .INI files, apps in the startup folder, and so on.
Dave> Indeed, there is no end of different allocation policies a
Dave> filesystem could define, so I don't think that iterating them in
Dave> fadvise() is a good thing to do.
I have no particular opinion on the proposed fadvise() flags. Just
saying that no matter whether we like it or not we'll have to be able to
pass information about expected access patterns down to the storage.
--
Martin K. Petersen Oracle Linux Engineering
On Wed, Mar 07, 2012 at 11:23:49PM -0500, Martin K. Petersen wrote:
> >>>>> "Dave" == Dave Chinner <[email protected]> writes:
>
> Dave> From what I've seen of the proposed SMR device standards, we're
> Dave> going to have to redesign filesystem allocation policies
>
> [...]
>
> The initial proposal involved SMR disks having a sparse LBA map carved
> into 2GB chunks.
2TB chunks, IIRC - the lower 32 bits of the 48bit LBA was intended
to be the relative offset into the region (RBA), with the upper 16
bits being the region number.
> However, that was shot down pretty hard.
That's unfortunate - it maps really well to how XFS uses allocation
groups. XFS already uses sparse regions for breaking up allocation
to enable parallelism. XFS could map to this sort of layout pretty
easily by placing an allocation group per region. That immediately
separates the SMR regions into discrete regions in the filesystem,
and just requires some tweaking to make use of the different
characteristics of the regions.
For example, use of the standard btree freespace allocator for the
random write regions, and use of the bitmap allocator (used by the
realtime device) for regions that are sequential write because it's
metadata is held externally to the region it is tracking. i.e. it
can be located in the random write regions. This could all be
handled by mkfs.xfs, including setting up the regions on the SMR
drives....
IOWs, XFS already has most of the allocation infrastructure to
handle the proposed region based SMR devices, and would only need a
bit of modification and extension to fully support sequential write
regions along with random write regions. The allocation policy
stuff (deciding what sort of region to allocate from and aggregating
writes appropriately) is where all the new complexity lies, but that
we have to do that anyway to handle all the different sorts of
access hints we are likely to see.
> The approach currently being worked uses either dynamic (flash, tiered
> storage) or static hints (SMR) to put things in an appropriate area
> given the nature of the I/O.
> This puts the burden of virtual to physical LBA management on the device
> rather than in the filesystem allocators. And gives us the benefit of
> having a single interface that can be used for many different device
> types.
So the current proposal hides all the physical characteristics of
the devices from the file system and remaps the LBA internally based
on the IO hint? But that is the opposite direction to what we've
been taking over the past couple of years - we want more visibility
of device characteristics at the filesystem level so we can optimise
the filesystem better, not less.
> That said, the current proposal is crazy complex and clearly written
> with Windows in mind. They are creating different access profiles for
> .DLLs, .INI files, apps in the startup folder, and so on.
I'll pass judgement when I see it.
To tell the truth, I'd much prefer that we have direct control of
physical layout in the filesystem rather than have the storage
device virtualise it with some unknown algorithm. Every device will
have different algorithms, so we won't get relatively conistent
behaviour across devices from different manufacturers like we have
now. If that is all hidden in the drive firmware and is different
for each different device we see, then we've got no hope of being
able to diagnose why two files with identical filesystem layouts at
adjacent LBAs have vastly different performance for the same access
pattern....
Cheers,
Dave.
--
Dave Chinner
[email protected]
>>>>> "Dave" == Dave Chinner <[email protected]> writes:
Dave> 2TB chunks, IIRC - the lower 32 bits of the 48bit LBA was intended
Dave> to be the relative offset into the region (RBA), with the upper 16
Dave> bits being the region number.
Correct.
>> However, that was shot down pretty hard.
Dave> That's unfortunate - it maps really well to how XFS uses
Dave> allocation groups.
The proposal met a lot of resistance. To the extent that the SMR folks
were asked to develop a new command set instead of using the standard
SCSI Block Commands.
I still think we can get most of what you want out of the static access
hints, however.
Dave> So the current proposal hides all the physical characteristics of
Dave> the devices from the file system and remaps the LBA internally
Dave> based on the IO hint? But that is the opposite direction to what
Dave> we've been taking over the past couple of years - we want more
Dave> visibility of device characteristics at the filesystem level so we
Dave> can optimise the filesystem better, not less.
The standards bodies are trying to avoid having to special-case handling
of shingled drives since they are only a transitional technology with a
short life expectancy.
We're getting close to the 8-year mark for 4K logical block size
transition and it hasn't happened yet. And at this stage it looks like
it might not happen at all (in the consumer space at least).
So I am not entirely convinced that SMR drives will still be relevant
when the standards have been ratified and the filesystems of the world
adapted to work with them.
--
Martin K. Petersen Oracle Linux Engineering