When a file system becomes fragmented (using MythTV, for example), the
bigalloc window searching ends up causing huge performance problems. In
a file system presented by a user experiencing this bug, the file system
was 90% free, but no 32-block free windows existed on the entire file system.
This causes the allocator to scan the entire file system for each 128k write
before backing down to searching for individual blocks.
In the end, finding a contiguous window for all the blocks in a write is
an advantageous special case, but one that can be found naturally when
such a window exists anyway.
This patch removes the bigalloc window searching, and has been proven to fix
the test case described above.
Signed-off-by: Jeff Mahoney <[email protected]>
diff -ruNp linux-2.6.18-rc4.orig/fs/reiserfs/bitmap.c linux-2.6.18-rc4.orig.devel/fs/reiserfs/bitmap.c
--- linux-2.6.18-rc4.orig/fs/reiserfs/bitmap.c 2006-08-22 09:49:45.000000000 -0400
+++ linux-2.6.18-rc4.orig.devel/fs/reiserfs/bitmap.c 2006-08-22 10:19:35.000000000 -0400
@@ -1019,7 +1019,6 @@ static inline int blocknrs_and_prealloc_
b_blocknr_t finish = SB_BLOCK_COUNT(s) - 1;
int passno = 0;
int nr_allocated = 0;
- int bigalloc = 0;
determine_prealloc_size(hint);
if (!hint->formatted_node) {
@@ -1046,28 +1045,9 @@ static inline int blocknrs_and_prealloc_
hint->preallocate = hint->prealloc_size = 0;
}
/* for unformatted nodes, force large allocations */
- bigalloc = amount_needed;
}
do {
- /* in bigalloc mode, nr_allocated should stay zero until
- * the entire allocation is filled
- */
- if (unlikely(bigalloc && nr_allocated)) {
- reiserfs_warning(s, "bigalloc is %d, nr_allocated %d\n",
- bigalloc, nr_allocated);
- /* reset things to a sane value */
- bigalloc = amount_needed - nr_allocated;
- }
- /*
- * try pass 0 and pass 1 looking for a nice big
- * contiguous allocation. Then reset and look
- * for anything you can find.
- */
- if (passno == 2 && bigalloc) {
- passno = 0;
- bigalloc = 0;
- }
switch (passno++) {
case 0: /* Search from hint->search_start to end of disk */
start = hint->search_start;
@@ -1105,8 +1085,7 @@ static inline int blocknrs_and_prealloc_
new_blocknrs +
nr_allocated,
start, finish,
- bigalloc ?
- bigalloc : 1,
+ 1,
amount_needed -
nr_allocated,
hint->
--
Jeff Mahoney
SUSE Labs
Jeff Mahoney wrote:
> When a file system becomes fragmented (using MythTV, for example), the
> bigalloc window searching ends up causing huge performance problems. In
> a file system presented by a user experiencing this bug, the file system
> was 90% free, but no 32-block free windows existed on the entire file system.
> This causes the allocator to scan the entire file system for each 128k write
> before backing down to searching for individual blocks.
Question: Would it be better to take that performance hit once, then
cache the result for awhile? If we can't find enough consecutive space,
such space isn't likely to appear until a lot of space is freed or a
repacker is run.
> In the end, finding a contiguous window for all the blocks in a write is
> an advantageous special case, but one that can be found naturally when
> such a window exists anyway.
Hmm. Ok, I don't understand how this works, so I'll shut up.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
David Masover wrote:
> Jeff Mahoney wrote:
>> When a file system becomes fragmented (using MythTV, for example), the
>> bigalloc window searching ends up causing huge performance problems. In
>> a file system presented by a user experiencing this bug, the file system
>> was 90% free, but no 32-block free windows existed on the entire file
>> system.
>> This causes the allocator to scan the entire file system for each
>> 128k write
>> before backing down to searching for individual blocks.
>
> Question: Would it be better to take that performance hit once, then
> cache the result for awhile? If we can't find enough consecutive space,
> such space isn't likely to appear until a lot of space is freed or a
> repacker is run.
The problem is that finding the window isn't really a direct function of
free space, it's a function of fragmentation. You could have a 50% full
file system that still can't find a 32 block window by having every
other block used. I know it's an extremely unlikely case, but it
demonstrates the point perfectly.
>> In the end, finding a contiguous window for all the blocks in a write is
>> an advantageous special case, but one that can be found naturally when
>> such a window exists anyway.
>
> Hmm. Ok, I don't understand how this works, so I'll shut up.
If the space after the end of the file has 32 or more blocks free, even
without the bigalloc behavior, those blocks will be used.
Also, I think the bigalloc behavior just ultimately ends up introducing
even more fragmentation on an already fragmented file system. It'll keep
contiguous chunks together, but those chunks can end up being spread all
over the disk.
- -Jeff
- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org
iD8DBQFE6yjsLPWxlyuTD7IRAuT0AJ9ssQafYPW+Gy/E/xN+LKCxamjycwCgqL6P
aUbgXdn+0+K3sJhWGBWtrno=
=NDyT
-----END PGP SIGNATURE-----
Jeff Mahoney wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> David Masover wrote:
>> Jeff Mahoney wrote:
>>> When a file system becomes fragmented (using MythTV, for example), the
>>> bigalloc window searching ends up causing huge performance problems. In
>>> a file system presented by a user experiencing this bug, the file system
>>> was 90% free, but no 32-block free windows existed on the entire file
>>> system.
>>> This causes the allocator to scan the entire file system for each
>>> 128k write
>>> before backing down to searching for individual blocks.
>> Question: Would it be better to take that performance hit once, then
>> cache the result for awhile? If we can't find enough consecutive space,
>> such space isn't likely to appear until a lot of space is freed or a
>> repacker is run.
>
> The problem is that finding the window isn't really a direct function of
> free space, it's a function of fragmentation. You could have a 50% full
> file system that still can't find a 32 block window by having every
> other block used. I know it's an extremely unlikely case, but it
> demonstrates the point perfectly.
Maybe, but it's still not a counterpoint. No matter how fragmented a
filesystem is, freeing space can open up contiguous space, whereas if
space is not freed, you won't open up contiguous space.
Thus, if your FS is 50% full and 100% fragmented, then you wait till
space is freed, because if nothing happens, or if more space is filled
in, you'll have the same problem at 60% than you did at 50%. If,
however, you're at 60% full, and 10% of the space is freed, then it's
fairly unlikely that you still don't have contiguous space, and it's
worth it to scan once more at 50%, and again if it then drops to 40%.
So, if your FS is 90% full and space is being freed, I'd think it would
be worth it to scan again at 80%, 70%, and so on. I'd also imagine it
would do little or nothing to constantly monitor an FS that stays mostly
full -- maybe give it a certain amount of time, but if we're repacking
anyway, just wait for a repacker run. It seems very unlikely that
between repacker runs, activity between 86% and 94% would open up
contiguous space.
It's still not a direct function of freed space (as opposed to free
space), but it starts to look better.
I'm not endorsing one way or the other without benchmarks, though.
>>> In the end, finding a contiguous window for all the blocks in a write is
>>> an advantageous special case, but one that can be found naturally when
>>> such a window exists anyway.
>> Hmm. Ok, I don't understand how this works, so I'll shut up.
>
> If the space after the end of the file has 32 or more blocks free, even
> without the bigalloc behavior, those blocks will be used.
For what behavior -- appending?
> Also, I think the bigalloc behavior just ultimately ends up introducing
> even more fragmentation on an already fragmented file system. It'll keep
> contiguous chunks together, but those chunks can end up being spread all
> over the disk.
This sounds like the NTFS strategy, which was basically to allow all
hell to break loose -- above a certain chunk size. Keep chunks of a
certain size contiguous, and you limit the number of seeks by quite a lot.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
David Masover wrote:
> Jeff Mahoney wrote:
>> The problem is that finding the window isn't really a direct function of
>> free space, it's a function of fragmentation. You could have a 50% full
>> file system that still can't find a 32 block window by having every
>> other block used. I know it's an extremely unlikely case, but it
>> demonstrates the point perfectly.
>
> Maybe, but it's still not a counterpoint. No matter how fragmented a
> filesystem is, freeing space can open up contiguous space, whereas if
> space is not freed, you won't open up contiguous space.
>
> Thus, if your FS is 50% full and 100% fragmented, then you wait till
> space is freed, because if nothing happens, or if more space is filled
> in, you'll have the same problem at 60% than you did at 50%. If,
> however, you're at 60% full, and 10% of the space is freed, then it's
> fairly unlikely that you still don't have contiguous space, and it's
> worth it to scan once more at 50%, and again if it then drops to 40%.
>
> So, if your FS is 90% full and space is being freed, I'd think it would
> be worth it to scan again at 80%, 70%, and so on. I'd also imagine it
> would do little or nothing to constantly monitor an FS that stays mostly
> full -- maybe give it a certain amount of time, but if we're repacking
> anyway, just wait for a repacker run. It seems very unlikely that
> between repacker runs, activity between 86% and 94% would open up
> contiguous space.
>
> It's still not a direct function of freed space (as opposed to free
> space), but it starts to look better.
>
> I'm not endorsing one way or the other without benchmarks, though.
I'd like to see benchmarks too. The goal is obviously to minimize seeks,
but my feeling is that blocks that aren't entirely contiguous but are
located in close enough proximity to each other so that they are all in
the drive's cache anyway will perform better than 128k chunks spread all
over the disk.
Your solution is one possible approach, but I'd rather kill off bigalloc
for reasons described below.
Also, for clarification, the 128k I keep quoting is just what
reiserfs_file_write() breaks larger writes into. It seems MythTV writes
in large chunks (go figure, it's a streaming media application ;), so
they get split up. For smaller writes, they'll go to the allocator with
a request of that many blocks.
reiserfs_{writepage,prepare_write,commit_write} all operate on one page
(and so one block, usually) at a time.
>>>> In the end, finding a contiguous window for all the blocks in a
>>>> write is
>>>> an advantageous special case, but one that can be found naturally when
>>>> such a window exists anyway.
>>> Hmm. Ok, I don't understand how this works, so I'll shut up.
>>
>> If the space after the end of the file has 32 or more blocks free, even
>> without the bigalloc behavior, those blocks will be used.
>
> For what behavior -- appending?
For any allocation after the first one. The allocator chooses a starting
position based on the last block it knows about before the position of
the write. This applies for both appends and sparse files.
>> Also, I think the bigalloc behavior just ultimately ends up introducing
>> even more fragmentation on an already fragmented file system. It'll keep
>> contiguous chunks together, but those chunks can end up being spread all
>> over the disk.
>
> This sounds like the NTFS strategy, which was basically to allow all
> hell to break loose -- above a certain chunk size. Keep chunks of a
> certain size contiguous, and you limit the number of seeks by quite a lot.
The bigalloc behavior ends up reducing local fragmentation at the
expense of global fragmentation. The free space of the test file system
that prompted this patch was *loaded* with 31 block chunks. All of these
were skipped until we backed off and searched for single block chunks -
or worse, ignored the close chunks in favor of a contiguous chunk
elsewhere. I don't think this is ideal behavior at all. Certainly it's
better to have a contiguous chunk of 63 blocks and one block elsewhere.
That lone block might only be a few blocks away and in the disk's cache
already, but bigalloc doesn't take that into account either. The start
of the allocation could be at the end of a bitmap group, leaving empty
space where we naturally should have just grown the file.
Without bigalloc, we still end up getting as many blocks together as we
can in a particular bitmap before moving on to another one. It will
group as many free blocks together as it can, and then try to find the
next window. Bigalloc just meant that two windows of 16 blocks, a block
apart, wasn't good enough. Once it's time to move on to another bitmap,
the skip_busy behavior (enabled by default), will search for bitmap
groups that are at least 10% free until the file system is 95% full[1].
We're already seeking anyway so this gives us the best chance of finding
a group with room to grow. It also leaves room in bitmaps for existing
files to grow, avoiding fragmentation there as well. It could stand to
be a bit smarter though, perhaps taking into account its proximity to a
neighboring bitmap group in making that determination.
- -Jeff
[1]: Although, the comment says 80%. One or the other is a bug. Mea culpa.
- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org
iD8DBQFE63UXLPWxlyuTD7IRAg8yAJ4/sFePRtuV8b2TDA/49pMNSeyp8QCeMymb
n3AnyFC2jyPe28Q16B7WhAQ=
=gNSt
-----END PGP SIGNATURE-----
Jeff Mahoney wrote:
>
>
> Also, I think the bigalloc behavior just ultimately ends up introducing
> even more fragmentation on an already fragmented file system. It'll keep
> contiguous chunks together, but those chunks can end up being spread all
> over the disk.
>
> -Jeff
>
Yes, and almost as important, it makes it difficult to understand and
predict the allocator, which means other optimizations become harder to do.
I can see that the bigalloc code as-is is pretty sad, but this is a scary
patch. It has the potential to cause people significant performance
problems way, way ahead in the future.
For example, suppose userspace is growing two files concurrently. It could
be that the bigalloc code would cause one file's allocation cursor to
repeatedly jump far away from the second. ie: a beneficial side-effect.
Without bigalloc that side-effect is lost and the two files blocks end up
all intermingled.
I don't know if that scenario is realistic, but I bet there are similar
accidental oddities which can occur as a result of this change.
But what are they?
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Andrew Morton wrote:
> I can see that the bigalloc code as-is is pretty sad, but this is a scary
> patch. It has the potential to cause people significant performance
> problems way, way ahead in the future.
>
> For example, suppose userspace is growing two files concurrently. It could
> be that the bigalloc code would cause one file's allocation cursor to
> repeatedly jump far away from the second. ie: a beneficial side-effect.
> Without bigalloc that side-effect is lost and the two files blocks end up
> all intermingled.
>
> I don't know if that scenario is realistic, but I bet there are similar
> accidental oddities which can occur as a result of this change.
>
> But what are they?
Bigalloc doesn't cause that effect one way or the other. You'll end up
with blocks still intermingled, just in 32 block[1] chunks. It doesn't
throw the cursor way out, just until the next 32 block free window.
Another thread writing will do the same thing, and the blocks can end up
getting intermingled in the same manner on a different part of the disk.
The behavior you're describing can only be caused by bad hinting: Two
files that are placed too close to each other. This patch changes the
part of the allocator that is *only* responsible for finding the free
bits. Where it should start looking for them is a decision made earlier
in determine_search_start().
This patch just reverts the change that Chris and I submitted ages ago
as part of a number of block allocator enhancements, not as a bug fix. I
think I traced it to the 2.5 days, but I can't find that particular
email. Neither of us anticipated the problem that MythTV users are
hitting with it. Reverting it just makes that part of the allocator
behave similarly to the ext[23] allocator where it just collects
available blocks from a starting point. For every day use, I don't think
performance should be terribly affected, and it definitely fixes the
pathological case that the MythTV users were seeing.
- -Jeff
[1]: For simplicity, I'll continue to reference 32 blocks as the chunk
size. In reality, it can be anything up to 32 blocks.
- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org
iD8DBQFE66OXLPWxlyuTD7IRApXFAJ9bUsrtfmRC2kOMMWcCel4BZq6/SgCfTcsV
rS6dvKc6MowiAY+r/0Jhp5A=
=MwOb
-----END PGP SIGNATURE-----
On 17:11 Tue 22 Aug , Andrew Morton wrote:
>
> I can see that the bigalloc code as-is is pretty sad, but this is a scary
> patch. It has the potential to cause people significant performance
> problems way, way ahead in the future.
>
> For example, suppose userspace is growing two files concurrently. It could
> be that the bigalloc code would cause one file's allocation cursor to
> repeatedly jump far away from the second. ie: a beneficial side-effect.
> Without bigalloc that side-effect is lost and the two files blocks end up
> all intermingled.
Perhaps I mis-recall, but shouldn't delayed writes (or something along
those lines) cause a case where two files are being incrementally
written rare? It seems that this case would only occur if two processes
were writing two files in small chunks and calling fsync constantly
(*cough* evolution column resizing bug *cough*), PLUS the two would
have to have the same offset (or close) for the file writes.
It seems that the risk of fragmentation is a lesser danger than the full
system near lock-up caused by the old behavour.
--Clay
>
> I don't know if that scenario is realistic, but I bet there are similar
> accidental oddities which can occur as a result of this change.
>
> But what are they?
On Tue, 22 Aug 2006 19:46:34 -0700
Clay Barnes <[email protected]> wrote:
> Perhaps I mis-recall, but shouldn't delayed writes (or something along
> those lines) cause a case where two files are being incrementally
> written rare?
If we did delayed allocation, yes. But we generally don't. (Exceptions:
XFS, reiser4, ext4, ext2 prototype circa 2001).