I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks
like inside ext3_try_to_allocate_with_rsv, we spend a long time in this
loop:
ext3_test_allocatable (bitmap_search_next_usable_block)
find_next_zero_bit (bitmap_search_next_usable_block)
find_next_zero_bit (bitmap_search_next_usable_block)
ext3_test_allocatable (bitmap_search_next_usable_block)
find_next_zero_bit (bitmap_search_next_usable_block)
find_next_zero_bit (bitmap_search_next_usable_block)
etc.
Lee
* Lee Revell <[email protected]> wrote:
> I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks
> like inside ext3_try_to_allocate_with_rsv, we spend a long time in this
> loop:
>
> ext3_test_allocatable (bitmap_search_next_usable_block)
> find_next_zero_bit (bitmap_search_next_usable_block)
> find_next_zero_bit (bitmap_search_next_usable_block)
>
> ext3_test_allocatable (bitmap_search_next_usable_block)
> find_next_zero_bit (bitmap_search_next_usable_block)
> find_next_zero_bit (bitmap_search_next_usable_block)
Breaking the lock is not really possible at that point, and it doesnt
look too easy to make that path preemptable either. (To make it
preemptable rsv_lock would need to become a semaphore (this could be
fine, as it's only used when a new reservation window is created).)
The hard part is the seqlock - the read side is performance-critical,
maybe it could be solved via a preemptable but still scalable seqlock
variant that uses a semaphore for the write side? It all depends on what
the scalability impact of using a semaphore for the new-window code
would be.
the best longterm solution for these types of tradeoffs seems to be to
add a locking primitive that is a spinlock on !PREEMPT kernels and a
semaphore on PREEMPT kernels. I.e. not as drastic as a full PREEMPT_RT
kernel, but good enough to make latency-critical codepaths of ext3
preemptable, without having to hurt scalability on !PREEMPT. The
PREEMPT_RT kernel has all the 'compile-time type-switching'
infrastructure for such tricks, all that needs to be changed to switch a
lock's type is to change the spinlock definition - all the
spin_lock(&lock) uses can remain unchanged. (The same method is used on
PREEMPT_RT to have 'dual-type' spinlocks.)
the same thing could then also be used for things like the mm lock, and
other longer-held locks that PREEMPT would like to see preemptable. It
would also be a good first step towards merging the PREEMPT_RT
infrastructure ;-) I'll cook up something.
Ingo
On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks
> > like inside ext3_try_to_allocate_with_rsv, we spend a long time in this
> > loop:
> >
We have not modify the reservation create algorithm for a long time.
Sorry, I missed the background here, on which kernel did you see this
latency? And what tests you were running?
> > ext3_test_allocatable (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
> >
> > ext3_test_allocatable (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
>
The loop in bitmap_search_next_usable_block tries to find the next zero
bit on the bitmap, if it find, then it need to check whether this zero
bit is also be zero on the journalling last copy the the bitmap. The
double check is there prevent re-use the just freed block on disk before
it is committed from journalling. If it's not marked as a free block on
the journalling copy, then it will find the next free bit on that copy
first, then loop back etc.etc.
The bitmap_search_next_usable_block() code is quite simple:
static int
bitmap_search_next_usable_block(int start, struct buffer_head *bh,
int maxblocks)
{
int next;
struct journal_head *jh = bh2jh(bh);
/*
* The bitmap search --- search forward alternately through the actual
* bitmap and the last-committed copy until we find a bit free in
* both
*/
while (start < maxblocks) {
next = ext3_find_next_zero_bit(bh->b_data, maxblocks, start);
if (next >= maxblocks)
return -1;
if (ext3_test_allocatable(next, bh))
return next;
jbd_lock_bh_state(bh);
if (jh->b_committed_data)
start = ext3_find_next_zero_bit(jh->b_committed_data,
maxblocks, next);
jbd_unlock_bh_state(bh);
}
return -1;
}
The latency report shows the pattern of two calls of ext3_find_next_zero_bit () after a call to ext3_test_allocatable(), matches what I suspect above.
I wonder if re-run the test with -0 noreservation mount option make any difference?
> Breaking the lock is not really possible at that point, and it doesnt
> look too easy to make that path preemptable either. (To make it
> preemptable rsv_lock would need to become a semaphore (this could be
> fine, as it's only used when a new reservation window is created).)
>
> The hard part is the seqlock - the read side is performance-critical,
> maybe it could be solved via a preemptable but still scalable seqlock
> variant that uses a semaphore for the write side? It all depends on what
> the scalability impact of using a semaphore for the new-window code
> would be.
>
The seqlock was there to prevent a smp race, but actually turns out to be unnecessary, and it has been removed recently. While even if it is there, it will not cause much latency here: the whole ext3_try_to_allocate_with_rsv is guard by ei->truncate_sem, thus no multiple readers/writers at the same time.
On Mon, 2005-04-04 at 23:10 -0700, Mingming Cao wrote:
> On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote:
> > * Lee Revell <[email protected]> wrote:
> >
> > > I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks
> > > like inside ext3_try_to_allocate_with_rsv, we spend a long time in this
> > > loop:
> > >
>
> We have not modify the reservation create algorithm for a long time.
> Sorry, I missed the background here, on which kernel did you see this
> latency? And what tests you were running?
Makes sense, I have been seeing this one for a long time. I get it with
dbench, or when doing a CVS checkout of a large module.
Kernel is 2.6.12-rc1-RT-V0.7.43-05, with PREEMPT_DESKTOP which AFAIK
should be equivalent to mainline.
Lee
On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks
> > like inside ext3_try_to_allocate_with_rsv, we spend a long time in this
> > loop:
> >
> > ext3_test_allocatable (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
> >
> > ext3_test_allocatable (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
>
> Breaking the lock is not really possible at that point, and it doesnt
> look too easy to make that path preemptable either. (To make it
> preemptable rsv_lock would need to become a semaphore (this could be
> fine, as it's only used when a new reservation window is created).)
It seems we are holding the rsv_block while searching the bitmap for a
free bit. In alloc_new_reservation(), we first find a available to
create a reservation window, then we check the bitmap to see if it
contains any free block. If not, we will search for next available
window, so on and on. During the whole process we are holding the global
rsv_lock. We could, and probably should, avoid that. Just unlock the
rsv_lock before the bitmap search and re-grab it after it. We need to
make sure that the available space that are still available after we re-
grab the lock.
Another option is to hold that available window before we release the
rsv_lock, and if there is no free bit inside that window, we will remove
it from the tree in the next round of searching for next available
window.
I prefer the second option, and plan to code it up soon. Any comments?
Thanks,
Mingming
Hi,
On Wed, 2005-04-06 at 06:35, Mingming Cao wrote:
> It seems we are holding the rsv_block while searching the bitmap for a
> free bit.
Probably something to avoid!
> In alloc_new_reservation(), we first find a available to
> create a reservation window, then we check the bitmap to see if it
> contains any free block. If not, we will search for next available
> window, so on and on. During the whole process we are holding the global
> rsv_lock. We could, and probably should, avoid that. Just unlock the
> rsv_lock before the bitmap search and re-grab it after it. We need to
> make sure that the available space that are still available after we re-
> grab the lock.
Not necessarily. As long as the windows remain mutually exclusive in
the rbtree, it doesn't matter too much if we occasionally allocate a
full one --- as long as that case is rare, the worst that happens is
that we fail to allocate from the window and have to repeat the window
reserve.
The difficulty will be in keeping it rare. What we need to avoid is the
case where multiple tasks need a new window, they all drop the lock,
find the same bits free in the bitmap, then all try to take that
window. One will succeed, the others will fail; but as the files in
that bit of the disk continue to grow, we risk those processes
*continually* repeating the same stomp-on-each-others'-windows operation
and raising the allocation overhead significantly.
> Another option is to hold that available window before we release the
> rsv_lock, and if there is no free bit inside that window, we will remove
> it from the tree in the next round of searching for next available
> window.
Possible, but not necessarily nice. If you've got a nearly-full disk,
most bits will be already allocated. As you scan the bitmaps, it may
take quite a while to find a free bit; do you really want to (a) lock
the whole block group with a temporary window just to do the scan, or
(b) keep allocating multiple smaller windows until you finally find a
free bit? The former is bad for concurrency if you have multiple tasks
trying to allocate nearby on disk --- you'll force them into different
block groups. The latter is high overhead.
--Stephen
On Wed, 2005-04-06 at 10:51 +0100, Stephen C. Tweedie wrote:
> Hi,
>
> On Wed, 2005-04-06 at 06:35, Mingming Cao wrote:
>
> > It seems we are holding the rsv_block while searching the bitmap for a
> > free bit.
>
> Probably something to avoid!
>
> > In alloc_new_reservation(), we first find a available to
> > create a reservation window, then we check the bitmap to see if it
> > contains any free block. If not, we will search for next available
> > window, so on and on. During the whole process we are holding the global
> > rsv_lock. We could, and probably should, avoid that. Just unlock the
> > rsv_lock before the bitmap search and re-grab it after it. We need to
> > make sure that the available space that are still available after we re-
> > grab the lock.
>
> Not necessarily. As long as the windows remain mutually exclusive in
> the rbtree, it doesn't matter too much if we occasionally allocate a
> full one --- as long as that case is rare, the worst that happens is
> that we fail to allocate from the window and have to repeat the window
> reserve.
>
> The difficulty will be in keeping it rare. What we need to avoid is the
> case where multiple tasks need a new window, they all drop the lock,
> find the same bits free in the bitmap, then all try to take that
> window. One will succeed, the others will fail; but as the files in
> that bit of the disk continue to grow, we risk those processes
> *continually* repeating the same stomp-on-each-others'-windows operation
> and raising the allocation overhead significantly.
>
Agreed. That's why the second option is preferred. When a file find a
window open to book, it booked it, then drop the lock and do the bitmap
search. So other files will not attempt to book the same window.
> > Another option is to hold that available window before we release the
> > rsv_lock, and if there is no free bit inside that window, we will remove
> > it from the tree in the next round of searching for next available
> > window.
>
> Possible, but not necessarily nice. If you've got a nearly-full disk,
> most bits will be already allocated. As you scan the bitmaps, it may
> take quite a while to find a free bit; do you really want to (a) lock
> the whole block group with a temporary window just to do the scan, or
> (b) keep allocating multiple smaller windows until you finally find a
> free bit? The former is bad for concurrency if you have multiple tasks
> trying to allocate nearby on disk --- you'll force them into different
> block groups. The latter is high overhead.
>
I am not quite understand what you mean about (a). In this proposal, we
will drop the lock before the scan.
And for (b), maybe I did not make myself clear: I am not proposing to
keeping allocating multiple smaller windows until finally find a free
bit. I mean, we book the window(just link the node into the tree) before
we drop the lock, if there is no free bit inside that window, we will go
back search for another window(call find_next_reserveable_window()),
inside it, we will remove the temporary window we just created and find
next window. SO we only have one temporary window at a time. In the case
of we did find a free block inside the temporary created window, we make
it official: do the rb tree coloring rotate stuff then.
Does it make sense? Let me know, thanks,
Maybe I should post my proposed patch here.( I am testing it now. )
---
linux-2.6.11-ming/fs/ext3/balloc.c | 107 +++++++++++++++++++++++--------------
1 files changed, 68 insertions(+), 39 deletions(-)
diff -puN fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency fs/ext3/balloc.c
--- linux-2.6.11/fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency 2005-04-05 10:48:31.000000000 -0700
+++ linux-2.6.11-ming/fs/ext3/balloc.c 2005-04-06 09:24:24.203028968 -0700
@@ -241,8 +241,13 @@ void ext3_rsv_window_add(struct super_bl
BUG();
}
+ /*
+ * we only link the node the the rb tree here
+ * the real color balancing will be done
+ * after we confirm the window has some free
+ * block.
+ */
rb_link_node(node, parent, p);
- rb_insert_color(node, root);
}
static void rsv_window_remove(struct super_block *sb,
@@ -749,24 +754,24 @@ fail_access:
* to find a free region that is of my size and has not
* been reserved.
*
- * on succeed, it returns the reservation window to be appended to.
- * failed, return NULL.
*/
-static struct ext3_reserve_window_node *find_next_reservable_window(
+static int find_next_reservable_window(
struct ext3_reserve_window_node *search_head,
- unsigned long size, int *start_block,
+ struct ext3_reserve_window_node *my_rsv,
+ struct super_block * sb, int start_block,
int last_block)
{
struct rb_node *next;
struct ext3_reserve_window_node *rsv, *prev;
int cur;
+ int size = my_rsv->rsv_goal_size;
/* TODO: make the start of the reservation window byte-aligned */
/* cur = *start_block & ~7;*/
- cur = *start_block;
+ cur = start_block;
rsv = search_head;
if (!rsv)
- return NULL;
+ return -1;
while (1) {
if (cur <= rsv->rsv_end)
@@ -782,7 +787,7 @@ static struct ext3_reserve_window_node *
* space with expected-size (or more)...
*/
if (cur > last_block)
- return NULL; /* fail */
+ return -1; /* fail */
prev = rsv;
next = rb_next(&rsv->rsv_node);
@@ -813,8 +818,29 @@ static struct ext3_reserve_window_node *
* return the reservation window that we could append to.
* succeed.
*/
- *start_block = cur;
- return prev;
+
+ if ((prev != my_rsv) && (!rsv_is_empty(&my_rsv->rsv_window)))
+ rsv_window_remove(sb, my_rsv);
+
+ /* let's book the whole avaliable window for now
+ * we will check the
+ * disk bitmap later and then, if there are free block
+ * then we adjust the window size if the it's
+ * larger than requested.
+ * Otherwise, we will remove this node from the tree next time
+ * call find_next_reservable_window.
+ */
+ my_rsv->rsv_start = cur;
+ if (rsv)
+ my_rsv->rsv_end = rsv->rsv_start - 1;
+ else
+ my_rsv->rsv_end = last_block;
+ my_rsv->rsv_alloc_hit = 0;
+
+ if (prev != my_rsv)
+ ext3_rsv_window_add(sb, my_rsv);
+
+ return 0;
}
/**
@@ -852,6 +878,7 @@ static struct ext3_reserve_window_node *
* @sb: the super block
* @group: the group we are trying to allocate in
* @bitmap_bh: the block group block bitmap
+ *
*/
static int alloc_new_reservation(struct ext3_reserve_window_node *my_rsv,
int goal, struct super_block *sb,
@@ -860,10 +887,10 @@ static int alloc_new_reservation(struct
struct ext3_reserve_window_node *search_head;
int group_first_block, group_end_block, start_block;
int first_free_block;
- int reservable_space_start;
- struct ext3_reserve_window_node *prev_rsv;
struct rb_root *fs_rsv_root = &EXT3_SB(sb)->s_rsv_window_root;
unsigned long size;
+ int ret;
+ spinlock_t *rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock;
group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
group * EXT3_BLOCKS_PER_GROUP(sb);
@@ -875,6 +902,8 @@ static int alloc_new_reservation(struct
start_block = goal + group_first_block;
size = my_rsv->rsv_goal_size;
+
+ spin_lock(rsv_lock);
if (!rsv_is_empty(&my_rsv->rsv_window)) {
/*
* if the old reservation is cross group boundary
@@ -908,6 +937,7 @@ static int alloc_new_reservation(struct
my_rsv->rsv_goal_size= size;
}
}
+
/*
* shift the search start to the window near the goal block
*/
@@ -921,11 +951,11 @@ static int alloc_new_reservation(struct
* need to check the bitmap after we found a reservable window.
*/
retry:
- prev_rsv = find_next_reservable_window(search_head, size,
- &start_block, group_end_block);
- if (prev_rsv == NULL)
+ ret = find_next_reservable_window(search_head, my_rsv, sb,
+ start_block, group_end_block);
+ if (ret == -1)
goto failed;
- reservable_space_start = start_block;
+
/*
* On success, find_next_reservable_window() returns the
* reservation window where there is a reservable space after it.
@@ -937,10 +967,13 @@ retry:
* block. Search start from the start block of the reservable space
* we just found.
*/
+ spin_unlock(rsv_lock);
first_free_block = bitmap_search_next_usable_block(
- reservable_space_start - group_first_block,
+ my_rsv->rsv_start - group_first_block,
bitmap_bh, group_end_block - group_first_block + 1);
+ spin_lock(rsv_lock);
+
if (first_free_block < 0) {
/*
* no free block left on the bitmap, no point
@@ -948,13 +981,20 @@ retry:
*/
goto failed;
}
+ /*
+ * it is possible someone else freed the window during we check
+ * the bitmap
+ */
+ if (rsv_is_empty(&my_rsv->rsv_window))
+ goto retry;
+
start_block = first_free_block + group_first_block;
/*
* check if the first free block is within the
* free space we just found
*/
- if ((start_block >= reservable_space_start) &&
- (start_block < reservable_space_start + size))
+ if ((start_block >= my_rsv->rsv_start) &&
+ (start_block < my_rsv->rsv_end))
goto found_rsv_window;
/*
* if the first free bit we found is out of the reservable space
@@ -963,27 +1003,19 @@ retry:
* start from where the free block is,
* we also shift the list head to where we stopped last time
*/
- search_head = prev_rsv;
+ search_head = my_rsv;
goto retry;
found_rsv_window:
/*
* great! the reservable space contains some free blocks.
- * if the search returns that we should add the new
- * window just next to where the old window, we don't
- * need to remove the old window first then add it to the
- * same place, just update the new start and new end.
- */
- if (my_rsv != prev_rsv) {
- if (!rsv_is_empty(&my_rsv->rsv_window))
- rsv_window_remove(sb, my_rsv);
- }
- my_rsv->rsv_start = reservable_space_start;
- my_rsv->rsv_end = my_rsv->rsv_start + size - 1;
- my_rsv->rsv_alloc_hit = 0;
- if (my_rsv != prev_rsv) {
- ext3_rsv_window_add(sb, my_rsv);
- }
+ */
+ my_rsv->rsv_start = start_block;
+ if ((my_rsv->rsv_end) > (start_block + size -1 ))
+ my_rsv->rsv_end = my_rsv->rsv_start + size - 1;
+
+ rb_insert_color(&my_rsv->rsv_node, fs_rsv_root);
+ spin_unlock(rsv_lock);
return 0; /* succeed */
failed:
/*
@@ -993,6 +1025,7 @@ failed:
*/
if (!rsv_is_empty(&my_rsv->rsv_window))
rsv_window_remove(sb, my_rsv);
+ spin_unlock(rsv_lock);
return -1; /* failed */
}
@@ -1023,7 +1056,6 @@ ext3_try_to_allocate_with_rsv(struct sup
int goal, struct ext3_reserve_window_node * my_rsv,
int *errp)
{
- spinlock_t *rsv_lock;
unsigned long group_first_block;
int ret = 0;
int fatal;
@@ -1052,7 +1084,6 @@ ext3_try_to_allocate_with_rsv(struct sup
ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL);
goto out;
}
- rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock;
/*
* goal is a group relative block number (if there is a goal)
* 0 < goal < EXT3_BLOCKS_PER_GROUP(sb)
@@ -1085,12 +1116,10 @@ ext3_try_to_allocate_with_rsv(struct sup
if (rsv_is_empty(&rsv_copy) || (ret < 0) ||
!goal_in_my_reservation(&rsv_copy, goal, group, sb)) {
- spin_lock(rsv_lock);
ret = alloc_new_reservation(my_rsv, goal, sb,
group, bitmap_bh);
rsv_copy._rsv_start = my_rsv->rsv_start;
rsv_copy._rsv_end = my_rsv->rsv_end;
- spin_unlock(rsv_lock);
if (ret < 0)
break; /* failed */
_
Hi,
On Wed, 2005-04-06 at 17:53, Mingming Cao wrote:
> > Possible, but not necessarily nice. If you've got a nearly-full disk,
> > most bits will be already allocated. As you scan the bitmaps, it may
> > take quite a while to find a free bit; do you really want to (a) lock
> > the whole block group with a temporary window just to do the scan, or
> > (b) keep allocating multiple smaller windows until you finally find a
> > free bit? The former is bad for concurrency if you have multiple tasks
> > trying to allocate nearby on disk --- you'll force them into different
> > block groups. The latter is high overhead.
> I am not quite understand what you mean about (a). In this proposal, we
> will drop the lock before the scan.
s/lock/reserve/.
> And for (b), maybe I did not make myself clear: I am not proposing to
> keeping allocating multiple smaller windows until finally find a free
> bit. I mean, we book the window(just link the node into the tree) before
> we drop the lock, if there is no free bit inside that window, we will go
> back search for another window(call find_next_reserveable_window()),
> inside it, we will remove the temporary window we just created and find
> next window. SO we only have one temporary window at a time.
And that's the problem. Either we create small temporary windows, in
which case we may end up thrashing through vast numbers of them before
we find a bit that's available --- very expensive as the disk gets full
--- or we use large windows but get worse layout when there are parallel
allocators going on.
--Stephen
On Wed, 2005-04-06 at 19:22 +0100, Stephen C. Tweedie wrote:
> Hi,
>
> On Wed, 2005-04-06 at 17:53, Mingming Cao wrote:
>
> > > Possible, but not necessarily nice. If you've got a nearly-full disk,
> > > most bits will be already allocated. As you scan the bitmaps, it may
> > > take quite a while to find a free bit; do you really want to (a) lock
> > > the whole block group with a temporary window just to do the scan, or
> > > (b) keep allocating multiple smaller windows until you finally find a
> > > free bit? The former is bad for concurrency if you have multiple tasks
> > > trying to allocate nearby on disk --- you'll force them into different
> > > block groups. The latter is high overhead.
>
> > I am not quite understand what you mean about (a). In this proposal, we
> > will drop the lock before the scan.
>
> s/lock/reserve/.
>
> > And for (b), maybe I did not make myself clear: I am not proposing to
> > keeping allocating multiple smaller windows until finally find a free
> > bit. I mean, we book the window(just link the node into the tree) before
> > we drop the lock, if there is no free bit inside that window, we will go
> > back search for another window(call find_next_reserveable_window()),
> > inside it, we will remove the temporary window we just created and find
> > next window. SO we only have one temporary window at a time.
>
> And that's the problem. Either we create small temporary windows, in
> which case we may end up thrashing through vast numbers of them before
> we find a bit that's available --- very expensive as the disk gets full
> --- or we use large windows but get worse layout when there are parallel
> allocators going on.
>
Ah... I see your points. (a) is certainly not a good option. (b) is not
very nice, but not necessary that bad when the disk is really full: We
are not only scanning the bitmap within the reserved window range,
instead, scanning the bitmap start from the reserved window start block
to the last block of the group, to find the next free bit; So in the
case the group is really full, we could reduce the # of small windows to
to try.
Mingming
* Mingming Cao <[email protected]> wrote:
> It seems we are holding the rsv_block while searching the bitmap for a
> free bit. In alloc_new_reservation(), we first find a available to
> create a reservation window, then we check the bitmap to see if it
> contains any free block. If not, we will search for next available
> window, so on and on. During the whole process we are holding the
> global rsv_lock. We could, and probably should, avoid that. Just
> unlock the rsv_lock before the bitmap search and re-grab it after it.
> We need to make sure that the available space that are still available
> after we re- grab the lock.
>
> Another option is to hold that available window before we release the
> rsv_lock, and if there is no free bit inside that window, we will
> remove it from the tree in the next round of searching for next
> available window.
>
> I prefer the second option, and plan to code it up soon. Any comments?
doesnt the first option also allow searches to be in parallel? This
particular one took over 1 msec, so it seems there's a fair room for
parallellizing multiple writers and thus improving scalability. (or is
this codepath serialized globally anyway?)
Ingo
Hi,
On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote:
> doesnt the first option also allow searches to be in parallel?
In terms of CPU usage, yes. But either we use large windows, in which
case we *can't* search remotely near areas of the disk in parallel; or
we use small windows, in which case failure to find a free bit becomes
disproportionately expensive (we have to do an rbtree link and unlink
for each window we search.)
> This
> particular one took over 1 msec, so it seems there's a fair room for
> parallellizing multiple writers and thus improving scalability. (or is
> this codepath serialized globally anyway?)
No, the rbtree ops are serialised per-sb, and allocations within any one
inode are serialised, but the bitmap searches need not be serialised
globally. (They are in the case of new window searches right now, which
is what we're trying to fix.)
--Stephen
On Thu, 2005-04-07 at 14:08 +0100, Stephen C. Tweedie wrote:
> Hi,
>
> On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote:
>
> > doesnt the first option also allow searches to be in parallel?
>
> In terms of CPU usage, yes. But either we use large windows, in which
> case we *can't* search remotely near areas of the disk in parallel; or
> we use small windows, in which case failure to find a free bit becomes
> disproportionately expensive (we have to do an rbtree link and unlink
> for each window we search.)
>
I was thinking that at the end of find_next_reservable_window(), since
we already know the previous node in the rbtree to insert, we could just
link the window there, that way the rbtree link cost will be minimized.
Then I realize in rbtree, previous node is not necessary the parent
node, we still have to search part of the rbtree to find the parent
node, which is really expensive in the case the insertion operations are
very frequent. Hmmm...
> > This particular one took over 1 msec,
If you look at the pattern in the latency report:
.........
cvs-14981 0.n.2 1044us : find_next_zero_bit (bitmap_search_next_usable_block)
cvs-14981 0.n.2 1045us : ext3_test_allocatable (bitmap_search_next_usable_block)
cvs-14981 0.n.2 1045us : find_next_zero_bit (bitmap_search_next_usable_block)
cvs-14981 0.n.2 1046us : find_next_zero_bit (bitmap_search_next_usable_block)
cvs-14981 0.n.2 1047us : ext3_test_allocatable (bitmap_search_next_usable_block)
cvs-14981 0.n.2 1048us : find_next_zero_bit (bitmap_search_next_usable_block)
cvs-14981 0.n.2 1049us : find_next_zero_bit (bitmap_search_next_usable_block)
cvs-14981 0.n.2 1049us : ext3_test_allocatable (bitmap_search_next_usable_block)
........
the patten is: two calls to find_next_zero_bit() followed by a
ext3_test_allocatabl(). And we have about 440 iteration in this report.
That means we are really unlucky: Checked the on disk bitmap, found a
free bit, but it's not free in the journal copy, then we check the
journal copy, find a free bit, but it's not free on disk....I could
imagine we need to check the journal copy once or twice(to prevent re-
use the just freed bit before it is been commited), but how could this
many (440) iterations happen?
Mingming
On Thu, 2005-04-07 at 14:08 +0100, Stephen C. Tweedie wrote:
> Hi,
>
> On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote:
>
> > doesnt the first option also allow searches to be in parallel?
>
> In terms of CPU usage, yes. But either we use large windows, in which
> case we *can't* search remotely near areas of the disk in parallel; or
> we use small windows, in which case failure to find a free bit becomes
> disproportionately expensive (we have to do an rbtree link and unlink
> for each window we search.)
Actually, we do not have to do an rbtree link and unlink for every
window we search. If the reserved window(old) has no free bit and the
new reservable window's is right after the old one, no need to unlink
the old window from the rbtree and then link the new window, just update
the start and end of the old one. That's in the current designed
already, to reduce the cost of rbtree link and unlink.
Mingming
Hi,
On Fri, 2005-04-08 at 00:37, Mingming Cao wrote:
> Actually, we do not have to do an rbtree link and unlink for every
> window we search. If the reserved window(old) has no free bit and the
> new reservable window's is right after the old one, no need to unlink
> the old window from the rbtree and then link the new window, just update
> the start and end of the old one.
It still needs to be done under locking to prevent us from expanding
over the next window, though. And having to take and drop a spinlock a
dozen times or more just to find out that there are no usable free
blocks in the current block group is still expensive, even if we're not
actually fully unlinking the window each time.
I wonder if this can possibly be done safely without locking? It would
be really good if we could rotate windows forward with no global locks.
At the very least, a fair rwlock would let us freeze the total layout of
the tree, while still letting us modify individual windows safely. As
long as we use wmb() to make sure that we always extend the end of the
window before we shrink the start of it, I think we could get away with
such shared locking. And rw locking is much better for concurrency, so
we might be able to hold it over the whole bitmap search rather than
taking it and dropping it at each window advance.
--Stephen
> And rw locking is much better for concurrency, so
> we might be able to hold it over the whole bitmap search rather than
> taking it and dropping it at each window advance.
rw locks only help if readers are 10x more common than writers generally
speaking. They are quite expensive locks, so they really should be used
with the utmost care.
(if you have really long hold times the dynamics are different, but if
you have really long hold times your latency sucks too and wasn't that
what this thread was trying to fix?)
On Fri, 2005-04-08 at 15:40 +0100, Stephen C. Tweedie wrote:
> Hi,
>
> On Fri, 2005-04-08 at 00:37, Mingming Cao wrote:
>
> > Actually, we do not have to do an rbtree link and unlink for every
> > window we search. If the reserved window(old) has no free bit and the
> > new reservable window's is right after the old one, no need to unlink
> > the old window from the rbtree and then link the new window, just update
> > the start and end of the old one.
>
> It still needs to be done under locking to prevent us from expanding
> over the next window, though. And having to take and drop a spinlock a
> dozen times or more just to find out that there are no usable free
> blocks in the current block group is still expensive, even if we're not
> actually fully unlinking the window each time.
>
Isn't this a rare case? The whole group is relatively full and the free
bits are all reserved by other files. Probably we should avoid trying
to make reservation in this block group at the first place, if we could
find a way to detect the number of _usable_ free bits are less than the
requested window size.
> I wonder if this can possibly be done safely without locking? It would
> be really good if we could rotate windows forward with no global locks.
> At the very least, a fair rwlock would let us freeze the total layout of
> the tree, while still letting us modify individual windows safely. As
> long as we use wmb() to make sure that we always extend the end of the
> window before we shrink the start of it, I think we could get away with
> such shared locking. And rw locking is much better for concurrency, so
> we might be able to hold it over the whole bitmap search rather than
> taking it and dropping it at each window advance.
>
You are proposing that we hold the read lock first, do the window search
and bitmap scan, then once we confirm there is free bit in the candidate
window, we grab the write lock and update the tree?
I think this is a good idea to address case you have concerned: when we
need to do lots of window search before settle down. Also if later we
decide (I think we have discussed this before) to always try to reserve
the window with at least 8 contigous free blocks, the search will be
more expensive and the read lock will help.
However I am still worried that the rw lock will allow concurrent files
trying to lock the same window at the same time. Only one succeed
though., high cpu usage then. And also, in the normal case the
filesystem is not really full, probably rw lock becomes expensive.
On Fri, 2005-04-08 at 11:10 -0700, Mingming Cao wrote:
> However I am still worried that the rw lock will allow concurrent files
> trying to lock the same window at the same time. Only one succeed
> though., high cpu usage then. And also, in the normal case the
> filesystem is not really full, probably rw lock becomes expensive.
FWIW this was a 125GB filesystem, about 70% full.
Lee
Hi,
On Fri, 2005-04-08 at 19:10, Mingming Cao wrote:
> > It still needs to be done under locking to prevent us from expanding
> > over the next window, though. And having to take and drop a spinlock a
> > dozen times or more just to find out that there are no usable free
> > blocks in the current block group is still expensive, even if we're not
> > actually fully unlinking the window each time.
> Isn't this a rare case? The whole group is relatively full and the free
> bits are all reserved by other files.
Well, that's not much different from the case where there _are_ usable
bits but they are all near the end of the bitmap. And that's common
enough as you fill up a block group with small files, for example. Once
the bg becomes nearly full, new file opens-for-write will still try to
allocate in that bg (because that's where the inode was allocated), but
as it's a new fd we have no prior knowledge of _where_ in the bh to
allocate, and we'll have to walk it to the end to find any free space.
This is the access pattern I'd expect of (for example) "tar xvjf
linux-2.6.12.tar.bz2", not exactly a rare case.
> Probably we should avoid trying
> to make reservation in this block group at the first place
Not in this case, because it's the "home" of the file in question, and
skipping to another bg would just leave useful space empty --- and that
leads to fragmentation.
> You are proposing that we hold the read lock first, do the window search
> and bitmap scan, then once we confirm there is free bit in the candidate
> window, we grab the write lock and update the tree?
No, I'm suggesting that if we need the write lock for tree updates, we
may still be able to get away with just a read lock when updating an
individual window. If all we're doing is winding the window forwards a
bit, that's not actually changing the structure of the tree.
> However I am still worried that the rw lock will allow concurrent files
> trying to lock the same window at the same time.
Adding a new window will need the write lock, always. But with the read
lock, we can still check the neighbouring windows (the structure is
pinned so those remain valid), so advancing a window with just that lock
can still be done without risking overlapping the next window.
--Stephen
On Mon, 2005-04-11 at 12:48 +0100, Stephen C. Tweedie wrote:
> Hi,
>
> On Fri, 2005-04-08 at 19:10, Mingming Cao wrote:
>
> > > It still needs to be done under locking to prevent us from expanding
> > > over the next window, though. And having to take and drop a spinlock a
> > > dozen times or more just to find out that there are no usable free
> > > blocks in the current block group is still expensive, even if we're not
> > > actually fully unlinking the window each time.
>
> > Isn't this a rare case? The whole group is relatively full and the free
> > bits are all reserved by other files.
>
> Well, that's not much different from the case where there _are_ usable
> bits but they are all near the end of the bitmap. And that's common
> enough as you fill up a block group with small files, for example. Once
> the bg becomes nearly full, new file opens-for-write will still try to
> allocate in that bg (because that's where the inode was allocated), but
> as it's a new fd we have no prior knowledge of _where_ in the bh to
> allocate, and we'll have to walk it to the end to find any free space.
> This is the access pattern I'd expect of (for example) "tar xvjf
> linux-2.6.12.tar.bz2", not exactly a rare case.
>
Okey.
> > Probably we should avoid trying
> > to make reservation in this block group at the first place
>
> Not in this case, because it's the "home" of the file in question, and
> skipping to another bg would just leave useful space empty --- and that
> leads to fragmentation.
>
I agree. We should not skip the home block group of the file. I guess
what I was suggesting is, if allocation from the home group failed and
we continuing the linear search the rest of block groups, we could
probably try to skip the block groups without enough usable free blocks
to make a reservation. Checking for the number of free blocks left in
the quary bg is a good way, but probably not good enough, since those
free blocks might already being reserved.
> > You are proposing that we hold the read lock first, do the window search
> > and bitmap scan, then once we confirm there is free bit in the candidate
> > window, we grab the write lock and update the tree?
>
> No, I'm suggesting that if we need the write lock for tree updates, we
> may still be able to get away with just a read lock when updating an
> individual window. If all we're doing is winding the window forwards a
> bit, that's not actually changing the structure of the tree.
>
> > However I am still worried that the rw lock will allow concurrent files
> > trying to lock the same window at the same time.
>
> Adding a new window will need the write lock, always. But with the read
> lock, we can still check the neighbouring windows (the structure is
> pinned so those remain valid), so advancing a window with just that lock
> can still be done without risking overlapping the next window.
>
Ah.. I see the win with the read lock now: once the a reservation window
is added, updating it (either winding it forward and or searching for a
avaliable window) probably is the majorirty of the operations on the
tree, and using read lock for that should help reduce the latency.
I will work on a patch for Lee to try sometime tonight.
Thanks,
Mingming
On Mon, 2005-04-11 at 11:38 -0700, Mingming Cao wrote:
> I will work on a patch for Lee to try sometime tonight.
>
Just FYI, this will take a while to test, because this latency seems
quite rare. I haven't seen it again since my original report.
Hopefully I can reproduce it with enough dbench processes.
This seems to agree with your earlier observation that we seem to have
been quite unlucky.
Anyway, if I understand the thread it seems like this could increase
performance as well as help latency.
Lee
On Mon, 2005-04-11 at 15:12 -0400, Lee Revell wrote:
> On Mon, 2005-04-11 at 11:38 -0700, Mingming Cao wrote:
> > I will work on a patch for Lee to try sometime tonight.
> >
>
> Just FYI, this will take a while to test, because this latency seems
> quite rare. I haven't seen it again since my original report.
Never mind, I spoke too soon. As soon as I posted this, I hit it again.
Seems like doing a CVS checkout reliably triggers the problem.
Lee
Hi,
On Mon, 2005-04-11 at 19:38, Mingming Cao wrote:
> I agree. We should not skip the home block group of the file. I guess
> what I was suggesting is, if allocation from the home group failed and
> we continuing the linear search the rest of block groups, we could
> probably try to skip the block groups without enough usable free blocks
> to make a reservation.
Fair enough. Once those are the only bgs left, performance is going to
drop pretty quickly, but that's not really avoidable on a very full
filesystem.
> Ah.. I see the win with the read lock now: once the a reservation window
> is added, updating it (either winding it forward and or searching for a
> avaliable window) probably is the majorirty of the operations on the
> tree, and using read lock for that should help reduce the latency.
Right. The down side is that for things like a kernel "tar xf", we'll
be doing lots of small file unpacks, and hopefully most files will be
just one or two reservations --- so there's little winding forward going
on. The searching will still be improved in that case.
Note that this may improve average case latencies, but it's not likely
to improve worst-case ones. We still need a write lock to install a new
window, and that's going to have to wait for us to finish finding a free
bit even if that operation starts using a read lock.
I'm not really sure what to do about worst-case here. For that, we
really do want to drop the lock entirely while we do the bitmap scan.
That leaves two options. Hold a reservation while we do that; or don't.
Holding one poses the problems we discussed before: either you hold a
large reservation (bad for disk layout in the presence of concurrent
allocators), or you hold smaller ones (high cost as you continually
advance the window, which requires some read lock on the tree to avoid
bumping into the next window.)
Just how bad would it be if we didn't hold a lock _or_ a window at all
while doing the search for new window space? I didn't like that
alternative at first because of the problem when you've got multiple
tasks trying to allocate in the same space at the same time; but given
the locking overhead of the alternatives, I'm wondering if this is
actually the lesser evil.
--Stephen
On Mon, 2005-04-11 at 20:57 +0100, Stephen C. Tweedie wrote:
> Hi,
Hi Stephen,
>
> On Mon, 2005-04-11 at 19:38, Mingming Cao wrote:
> > Ah.. I see the win with the read lock now: once the a reservation window
> > is added, updating it (either winding it forward and or searching for a
> > avaliable window) probably is the majorirty of the operations on the
> > tree, and using read lock for that should help reduce the latency.
>
> Right. The down side is that for things like a kernel "tar xf", we'll
> be doing lots of small file unpacks, and hopefully most files will be
> just one or two reservations --- so there's little winding forward going
> on. The searching will still be improved in that case.
Just a side note that "tar xf" should know the file size before
unpacking it. So it could set the reservation window size large enough
to fit the entire file before doing the file write through ioctl
command.
> Note that this may improve average case latencies, but it's not likely
> to improve worst-case ones. We still need a write lock to install a new
> window, and that's going to have to wait for us to finish finding a free
> bit even if that operation starts using a read lock.
>
Yes indeed. However nothing is free and there are always trade-offs.:)
By worse case you mean multiple writes trying to allocate blocks around
same area?
But I wonder if the latency saw by Lee belongs to this worst-case: the
latency comes mostly from loop calling find_next_zero_bit() in
bitmap_search_next_usable_block(). Even if we take out the whole
reservation, we still possibility run into this kind of latency: the
bitmap on disk and on journal are extremely inconsistent so we need to
keep searching them both until we find a free bit on both map.
> I'm not really sure what to do about worst-case here. For that, we
> really do want to drop the lock entirely while we do the bitmap scan.
>
Hmm...if we drop the lock entirely while scan the bitmap, assuming you
mean drop the read lock, then I am afraid we have to re-check the tree
(require a read or write lock ) to see if the new window space is still
there after the scan succeed. This is probably not very interesting for
the window rotating case.
> That leaves two options. Hold a reservation while we do that; or don't.
> Holding one poses the problems we discussed before: either you hold a
> large reservation (bad for disk layout in the presence of concurrent
> allocators), or you hold smaller ones (high cost as you continually
> advance the window, which requires some read lock on the tree to avoid
> bumping into the next window.)
>
Well, we cannot hold a reservation (which need to update the tree)
without a write lock. I guess if we want to improve the average case
latency by replacing the current spin_lock with the read lock for the
new window space searching, we don't have much choice here.
> Just how bad would it be if we didn't hold a lock _or_ a window at all
> while doing the search for new window space?
I wonder if this is feasible: Walk through the rb tree without a lock?
What if some node is being removed by another thread while we are
walking through the tree and trying to get the next node from it?
Thanks,
Mingming
Hi,
On Tue, 2005-04-12 at 07:41, Mingming Cao wrote:
> > Note that this may improve average case latencies, but it's not likely
> > to improve worst-case ones. We still need a write lock to install a new
> > window, and that's going to have to wait for us to finish finding a free
> > bit even if that operation starts using a read lock.
> >
> Yes indeed. However nothing is free and there are always trade-offs.:)
>
> By worse case you mean multiple writes trying to allocate blocks around
> same area?
It doesn't matter where they are; multiple new file opens will all be
looking for a write lock. You only need one long-held read lock and all
the writers still block. The worst-case latencies can't be properly
solved with r/w locks --- those let the readers go more quickly
(assuming they are in the majority), which helps the average case, but
writers still have to wait for exclusive access. We only really help
them by dropping the lock entirely.
> Even if we take out the whole
> reservation, we still possibility run into this kind of latency: the
> bitmap on disk and on journal are extremely inconsistent so we need to
> keep searching them both until we find a free bit on both map.
Quite possibly. But as long as that code is running without locks, it's
much easier to deal with those latencies: they won't impact other CPUs,
cond_resched() is easier, and there's even CONFIG_PREEMPT.
> > I'm not really sure what to do about worst-case here. For that, we
> > really do want to drop the lock entirely while we do the bitmap scan.
> Hmm...if we drop the lock entirely while scan the bitmap, assuming you
> mean drop the read lock, then I am afraid we have to re-check the tree
> (require a read or write lock ) to see if the new window space is still
> there after the scan succeed.
Sure. You basically start off with a provisional window, and then if
necessary roll it forward just the same way you roll normal windows
forward when they get to their end. That means you can still drop the
lock while you search for new space. When you get there, reacquire the
lock and check that the intervening space is still available.
That's really cheap for the common case. The difficulty is when you
have many parallel allocations hitting the same bg: they allocate
provisional windows, find the same free area later on in the bg, and
then stomp on each other as they try to move their windows there.
I wonder if there's not a simple solution for this --- mark the window
as "provisional", and if any other task tries to allocate in the space
immediately following such a window, it needs to block until that window
is released.
--Stephen
On Tue, 2005-04-12 at 12:18 +0100, Stephen C. Tweedie wrote:
> Hi,
>
> On Tue, 2005-04-12 at 07:41, Mingming Cao wrote:
>
> > > Note that this may improve average case latencies, but it's not likely
> > > to improve worst-case ones. We still need a write lock to install a new
> > > window, and that's going to have to wait for us to finish finding a free
> > > bit even if that operation starts using a read lock.
> > >
> > Yes indeed. However nothing is free and there are always trade-offs.:)
> >
> > By worse case you mean multiple writes trying to allocate blocks around
> > same area?
>
> It doesn't matter where they are; multiple new file opens will all be
> looking for a write lock. You only need one long-held read lock and all
> the writers still block. The worst-case latencies can't be properly
> solved with r/w locks --- those let the readers go more quickly
> (assuming they are in the majority), which helps the average case, but
> writers still have to wait for exclusive access. We only really help
> them by dropping the lock entirely.
>
> > Even if we take out the whole
> > reservation, we still possibility run into this kind of latency: the
> > bitmap on disk and on journal are extremely inconsistent so we need to
> > keep searching them both until we find a free bit on both map.
>
> Quite possibly. But as long as that code is running without locks, it's
> much easier to deal with those latencies: they won't impact other CPUs,
> cond_resched() is easier, and there's even CONFIG_PREEMPT.
>
Okey, I agree.
> > > I'm not really sure what to do about worst-case here. For that, we
> > > really do want to drop the lock entirely while we do the bitmap scan.
>
> > Hmm...if we drop the lock entirely while scan the bitmap, assuming you
> > mean drop the read lock, then I am afraid we have to re-check the tree
> > (require a read or write lock ) to see if the new window space is still
> > there after the scan succeed.
>
> Sure. You basically start off with a provisional window, and then if
> necessary roll it forward just the same way you roll normal windows
> forward when they get to their end. That means you can still drop the
> lock while you search for new space. When you get there, reacquire the
> lock and check that the intervening space is still available.
>
Please note that the bitmap scan does not only scan the provisional
window range, it will return the first free it on the bitmap start from
the start block of the provisional window until the end of the whole
block group.
So in the case the new window's neighbor is the same as the old one(that
means window is rolled forward), we only need roll forward once to find
it. Either we find a free bit inside the provisional window, or we find
a free bit out of it. In the first case we just roll window forward and
we are done. In the second case, it's possible the free bit is inside
someone else's window(which means we can't take that window) or it
inside the new space after a already reserved window. Either way we have
to lock up the whole tree to remove the old window and insert the new
window.
> That's really cheap for the common case. The difficulty is when you
> have many parallel allocations hitting the same bg: they allocate
> provisional windows, find the same free area later on in the bg, and
> then stomp on each other as they try to move their windows there.
>
> I wonder if there's not a simple solution for this --- mark the window
> as "provisional", and if any other task tries to allocate in the space
> immediately following such a window, it needs to block until that window
> is released.
>
Sounds interesting. However that implies we need a write lock to mark
the window as provisional and block other files looking for windows near
it: we need to insert the provisional window into the tree and then mark
it as a temporary window, to really let other file notice this kind of
"hold".
I wonder if the benefit of read/write lock is worth all the hassles now.
If the new window's neighbor stays the same, we only need to roll
forward once. If not, after a successful scan, we need to grab the
write lock, and make sure the window is still there. If we dropped the
lock without holding the new space, we have to search the whole tree
again to find out if the space is still there, we cannot use the
previous node returned by find_next_reservable_window() since the
previous node could be gone while we scan the bitmap without the lock.
Basically we do twice tree search(once for roll forward case) and twice
locking in the normal case. Also I am concerned about the possible
starvation on writers.
I am thinking, maybe back to the spin_lock is not that bad with the
"mark provisional" suggest you made? It allows us to mark the new space
as provisional if we find a new space(prevent other window searching run
into the same neighborhood). We could release the lock and scan the
bitmap without worry about the new space will be taking by others. If
there is a free bit, then we could just clear the provisional bit and
simply return.(we don't have to re-grab the lock there). In the case the
the final new window is the one just rolled the old window forward, we
only need to grab the spin_lock once and move the window forward. This
seems to me the simplest way to fix the latency that Lee reported, and
balanced both average and worse cases. Is it still looks too bad to
you?
Thanks,
Mingming
Hi,
On Wed, 2005-04-13 at 00:27, Mingming Cao wrote:
> > I wonder if there's not a simple solution for this --- mark the window
> > as "provisional", and if any other task tries to allocate in the space
> > immediately following such a window, it needs to block until that window
> > is released.
> Sounds interesting. However that implies we need a write lock to mark
> the window as provisional and block other files looking for windows near
> it: we need to insert the provisional window into the tree and then mark
> it as a temporary window, to really let other file notice this kind of
> "hold".
We need a lock for the tree modification, yes.
> I wonder if the benefit of read/write lock is worth all the hassles now.
The point of the provisional window is that you don't need read/write
locks at all. The provisional window lets you unlock the tree
completely while you do the bitmap scan, so there's really no reason to
have rwlocks for the tree any more.
> If the new window's neighbor stays the same, we only need to roll
> forward once. If not, after a successful scan, we need to grab the
> write lock, and make sure the window is still there.
When we take the provisional window, we can make a note of how much
space we have before the next window. And because all future allocators
will stall if they try to allocate at this point due to the provisional
window, we know that that space will remain outside any other window
until we come to fix the provisional window and potentially roll it
forward to the space we found.
> If we dropped the
> lock without holding the new space, we have to search the whole tree
> again to find out if the space is still there
As long as the space is within the area between the provisional window
and its successor, we don't need to do that. (It gets more complex if
the bitmap search returns a bit _beyond_ the next window, though.)
> Also I am concerned about the possible
> starvation on writers.
In what way?
> I am thinking, maybe back to the spin_lock is not that bad with the
> "mark provisional" suggest you made?
Right, that was the intent --- sorry if I didn't make it clear.
> It allows us to mark the new space
> as provisional if we find a new space(prevent other window searching run
> into the same neighborhood). We could release the lock and scan the
> bitmap without worry about the new space will be taking by others.
Exactly.
Note that there _is_ some additional complexity here. It's not entirely
free. Specifically, if we have other tasks waiting on our provisional
window, then we need to be very careful about the life expectancy of the
wait queues involved, so that if the first task fixes its window and
then deletes it before the waiters have woken up, they don't get
confused by the provisional window vanishing while being waited for.
The easy solution is a global wait queue for that, but that doesn't
scale well. The complex solution is a per-window wait queue and
reference count, which is obviously a bit of bloat, though probably
worth it for the high-load case.
Cheers,
Stephen
Hi,
On Fri, 2005-04-15 at 21:32, Mingming Cao wrote:
> Sorry for the delaying. I was not in office these days.
No problem.
> > > Also I am concerned about the possible
> > > starvation on writers.
> > In what way?
> I was worried about the rw lock case.:)
OK, so we're both on the same track here. :-)
> > Note that there _is_ some additional complexity here. It's not entirely
> > free. Specifically, if we have other tasks waiting on our provisional
> > window, then we need to be very careful about the life expectancy of the
> > wait queues involved, so that if the first task fixes its window and
> > then deletes it before the waiters have woken up, they don't get
> > confused by the provisional window vanishing while being waited for.
> This approach(provisional window) sounds interesting to me, but it's a
> little more complicated than I thought:(
Yep. Once you have other tasks waiting on your window while it's not
locked, object lifetime becomes a much bigger problem.
> alloc_new_reservation()
> retry:
> lock the tree
> search for a new space start from the given startblock
> found a new space
> if the new space is not a "provisional window"
I was thinking rather that we _start_ with the window, and _then_ look
for new space.
So we'd start with:
if we already have a window,
mark it provisional;
else,
do
search for a new window;
if the immediately preceding window is provisional,
wait for that window;
continue;
allocate the window and mark it provisional;
break
At this point we have a provisional window, and we know that nobody else
is going to allocate either in it, or in the empty space following it
(because if they tried to, they'd bump into our own provisional window
as their predecessor and would wait for us.) So even if the window
doesn't occupy the _whole_ unreserved area, we can still keep searching
without fear of multiple tasks trying to do so in the same space at the
same time.
--Stephen
On Mon, 2005-04-18 at 19:00 +0100, Stephen C. Tweedie wrote:
> > > Note that there _is_ some additional complexity here. It's not entirely
> > > free. Specifically, if we have other tasks waiting on our provisional
> > > window, then we need to be very careful about the life expectancy of the
> > > wait queues involved, so that if the first task fixes its window and
> > > then deletes it before the waiters have woken up, they don't get
> > > confused by the provisional window vanishing while being waited for.
>
> > This approach(provisional window) sounds interesting to me, but it's a
> > little more complicated than I thought:(
>
> Yep. Once you have other tasks waiting on your window while it's not
> locked, object lifetime becomes a much bigger problem.
>
> > alloc_new_reservation()
> > retry:
> > lock the tree
> > search for a new space start from the given startblock
> > found a new space
> > if the new space is not a "provisional window"
> I was thinking rather that we _start_ with the window, and _then_ look
> for new space.
>
It seems I am lost here. Could you elaborate more here, please? What
is "the window" you are referring to here? The old(stale) reservation
window? I thought we are discussing the algorithm to how to allocate a
new window.
> So we'd start with:
>
> if we already have a window,
> mark it provisional;
> else,
> do
> search for a new window;
> if the immediately preceding window is provisional,
> wait for that window;
> continue;
> allocate the window and mark it provisional;
> break
>
> At this point we have a provisional window, and we know that nobody else
> is going to allocate either in it, or in the empty space following it
> (because if they tried to, they'd bump into our own provisional window
> as their predecessor and would wait for us.) So even if the window
> doesn't occupy the _whole_ unreserved area, we can still keep searching
> without fear of multiple tasks trying to do so in the same space at the
> same time.
>
> --Stephen
>
>
Hmm...... This thread was to address the latency of holding the global
per-fs reservation lock while scanning the block group bitmap. And the
whole point of the "provisional window" is to prevent other inodes being
forced to put into other block groups when one inode hold/reserve the
whole block group as a temporary window to do the bitmap scan. I clearly
see it's a win if there are no multiple threads allocating blocks nearby
at the same time. However the whole reservation is there to address the
case where multiple allocations happen at the same time near the same
place.
With the provisional method, if multiple threads trying to allocate new
windows in the same area, they are still have to wait for other new-
window-allocation-and-bitmap-scan finish. After that the probably will
compete the same window again and again....:( I am worried that the
latency is not being being improved than before (holding the lock for
bitmap scan), and we have paid extra cpu time or context switches. Also
the changes to the algorithm is no going to be trivial.
Now we all agree that the right thing to fix the latency is to drop the
lock and then scan the bitmap. Before that we need to reserve the open
window in case someone else is trying to target at the same window.
Question was should we reserve the whole free reservable space or just
the window size we need. Now that we have explored and discussed so
many possible solutions, I think the lest evil and less intrusive way
to just lock the small window. We were worried about in the case the
scanned free bit is not inside the temporary reserved window, it is
possible we have to do many window unlink and link operations on the rb
tree, however this is not avoidable with the provisional window proposal
either. Probably we could start with that simple proposal first, and if
there are benchmarks shows the cpu usage is really high and a concern,
we could address that later?
Thanks,
Mingming
Andrew, Stephen,
Currently in ext3 block reservation code, the global filesystem
reservation tree lock (rsv_block) is hold during the process of
searching for a space to make a new reservation window, including while
scaning the block bitmap to verify if the available window has a free
block. Holding the lock during bitmap scan is unnecessary and could
possibly cause scalability issue and latency issues.
This patch trying to address this by dropping the lock before scan the
bitmap. Before that we need to reserve the open window in case someone
else is targeting at the same window. Question was should we reserve the
whole free reservable space or just the window size we need. Reserve
the whole free reservable space will possibly force other threads which
intended to do block allocation nearby move to another block group(cause
bad layout). In this patch, we just reserve the desired size before
drop the lock and scan the block bitmap.
Please review. I have tested on fsx on SMP box. Lee, if you got time,
please try this patch.
Signed-Off-By: Mingming Cao <[email protected]>
---
linux-2.6.11-ming/fs/ext3/balloc.c | 135 +++++++++++++++++--------------------
linux-2.6.11-ming/fs/ext3/file.c | 4 +
2 files changed, 68 insertions(+), 71 deletions(-)
diff -puN fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency fs/ext3/balloc.c
--- linux-2.6.11/fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency 2005-04-22 10:49:04.000000000 -0700
+++ linux-2.6.11-ming/fs/ext3/balloc.c 2005-04-22 14:42:35.518751070 -0700
@@ -749,24 +749,24 @@ fail_access:
* to find a free region that is of my size and has not
* been reserved.
*
- * on succeed, it returns the reservation window to be appended to.
- * failed, return NULL.
*/
-static struct ext3_reserve_window_node *find_next_reservable_window(
+static int find_next_reservable_window(
struct ext3_reserve_window_node *search_head,
- unsigned long size, int *start_block,
+ struct ext3_reserve_window_node *my_rsv,
+ struct super_block * sb, int start_block,
int last_block)
{
struct rb_node *next;
struct ext3_reserve_window_node *rsv, *prev;
int cur;
+ int size = my_rsv->rsv_goal_size;
/* TODO: make the start of the reservation window byte-aligned */
/* cur = *start_block & ~7;*/
- cur = *start_block;
+ cur = start_block;
rsv = search_head;
if (!rsv)
- return NULL;
+ return -1;
while (1) {
if (cur <= rsv->rsv_end)
@@ -782,7 +782,7 @@ static struct ext3_reserve_window_node *
* space with expected-size (or more)...
*/
if (cur > last_block)
- return NULL; /* fail */
+ return -1; /* fail */
prev = rsv;
next = rb_next(&rsv->rsv_node);
@@ -813,8 +813,26 @@ static struct ext3_reserve_window_node *
* return the reservation window that we could append to.
* succeed.
*/
- *start_block = cur;
- return prev;
+
+ if ((prev != my_rsv) && (!rsv_is_empty(&my_rsv->rsv_window)))
+ rsv_window_remove(sb, my_rsv);
+
+ /* let's book the whole avaliable window for now
+ * we will check the
+ * disk bitmap later and then, if there are free block
+ * then we adjust the window size if the it's
+ * larger than requested.
+ * Otherwise, we will remove this node from the tree next time
+ * call find_next_reservable_window.
+ */
+ my_rsv->rsv_start = cur;
+ my_rsv->rsv_end = cur + size - 1;
+ my_rsv->rsv_alloc_hit = 0;
+
+ if (prev != my_rsv)
+ ext3_rsv_window_add(sb, my_rsv);
+
+ return 0;
}
/**
@@ -852,6 +870,7 @@ static struct ext3_reserve_window_node *
* @sb: the super block
* @group: the group we are trying to allocate in
* @bitmap_bh: the block group block bitmap
+ *
*/
static int alloc_new_reservation(struct ext3_reserve_window_node *my_rsv,
int goal, struct super_block *sb,
@@ -860,10 +879,10 @@ static int alloc_new_reservation(struct
struct ext3_reserve_window_node *search_head;
int group_first_block, group_end_block, start_block;
int first_free_block;
- int reservable_space_start;
- struct ext3_reserve_window_node *prev_rsv;
struct rb_root *fs_rsv_root = &EXT3_SB(sb)->s_rsv_window_root;
unsigned long size;
+ int ret;
+ spinlock_t *rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock;
group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
group * EXT3_BLOCKS_PER_GROUP(sb);
@@ -875,6 +894,7 @@ static int alloc_new_reservation(struct
start_block = goal + group_first_block;
size = my_rsv->rsv_goal_size;
+
if (!rsv_is_empty(&my_rsv->rsv_window)) {
/*
* if the old reservation is cross group boundary
@@ -908,6 +928,8 @@ static int alloc_new_reservation(struct
my_rsv->rsv_goal_size= size;
}
}
+
+ spin_lock(rsv_lock);
/*
* shift the search start to the window near the goal block
*/
@@ -921,11 +943,16 @@ static int alloc_new_reservation(struct
* need to check the bitmap after we found a reservable window.
*/
retry:
- prev_rsv = find_next_reservable_window(search_head, size,
- &start_block, group_end_block);
- if (prev_rsv == NULL)
- goto failed;
- reservable_space_start = start_block;
+ ret = find_next_reservable_window(search_head, my_rsv, sb,
+ start_block, group_end_block);
+
+ if (ret == -1) {
+ if (!rsv_is_empty(&my_rsv->rsv_window))
+ rsv_window_remove(sb, my_rsv);
+ spin_unlock(rsv_lock);
+ return -1;
+ }
+
/*
* On success, find_next_reservable_window() returns the
* reservation window where there is a reservable space after it.
@@ -937,8 +964,9 @@ retry:
* block. Search start from the start block of the reservable space
* we just found.
*/
+ spin_unlock(rsv_lock);
first_free_block = bitmap_search_next_usable_block(
- reservable_space_start - group_first_block,
+ my_rsv->rsv_start - group_first_block,
bitmap_bh, group_end_block - group_first_block + 1);
if (first_free_block < 0) {
@@ -946,54 +974,30 @@ retry:
* no free block left on the bitmap, no point
* to reserve the space. return failed.
*/
- goto failed;
+ spin_lock(rsv_lock);
+ if (!rsv_is_empty(&my_rsv->rsv_window))
+ rsv_window_remove(sb, my_rsv);
+ spin_unlock(rsv_lock);
+ return -1; /* failed */
}
+
start_block = first_free_block + group_first_block;
/*
* check if the first free block is within the
- * free space we just found
+ * free space we just reserved
*/
- if ((start_block >= reservable_space_start) &&
- (start_block < reservable_space_start + size))
- goto found_rsv_window;
+ if ((start_block >= my_rsv->rsv_start) &&
+ (start_block < my_rsv->rsv_end))
+ return 0; /* succeed */
/*
* if the first free bit we found is out of the reservable space
- * this means there is no free block on the reservable space
- * we should continue search for next reservable space,
+ * continue search for next reservable space,
* start from where the free block is,
* we also shift the list head to where we stopped last time
*/
- search_head = prev_rsv;
+ search_head = my_rsv;
+ spin_lock(rsv_lock);
goto retry;
-
-found_rsv_window:
- /*
- * great! the reservable space contains some free blocks.
- * if the search returns that we should add the new
- * window just next to where the old window, we don't
- * need to remove the old window first then add it to the
- * same place, just update the new start and new end.
- */
- if (my_rsv != prev_rsv) {
- if (!rsv_is_empty(&my_rsv->rsv_window))
- rsv_window_remove(sb, my_rsv);
- }
- my_rsv->rsv_start = reservable_space_start;
- my_rsv->rsv_end = my_rsv->rsv_start + size - 1;
- my_rsv->rsv_alloc_hit = 0;
- if (my_rsv != prev_rsv) {
- ext3_rsv_window_add(sb, my_rsv);
- }
- return 0; /* succeed */
-failed:
- /*
- * failed to find a new reservation window in the current
- * group, remove the current(stale) reservation window
- * if there is any
- */
- if (!rsv_is_empty(&my_rsv->rsv_window))
- rsv_window_remove(sb, my_rsv);
- return -1; /* failed */
}
/*
@@ -1023,7 +1027,6 @@ ext3_try_to_allocate_with_rsv(struct sup
int goal, struct ext3_reserve_window_node * my_rsv,
int *errp)
{
- spinlock_t *rsv_lock;
unsigned long group_first_block;
int ret = 0;
int fatal;
@@ -1052,7 +1055,6 @@ ext3_try_to_allocate_with_rsv(struct sup
ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL);
goto out;
}
- rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock;
/*
* goal is a group relative block number (if there is a goal)
* 0 < goal < EXT3_BLOCKS_PER_GROUP(sb)
@@ -1078,30 +1080,21 @@ ext3_try_to_allocate_with_rsv(struct sup
* then we could go to allocate from the reservation window directly.
*/
while (1) {
- struct ext3_reserve_window rsv_copy;
-
- rsv_copy._rsv_start = my_rsv->rsv_start;
- rsv_copy._rsv_end = my_rsv->rsv_end;
-
- if (rsv_is_empty(&rsv_copy) || (ret < 0) ||
- !goal_in_my_reservation(&rsv_copy, goal, group, sb)) {
- spin_lock(rsv_lock);
+ if (rsv_is_empty(&my_rsv->rsv_window) || (ret < 0) ||
+ !goal_in_my_reservation(&my_rsv->rsv_window, goal, group, sb)) {
ret = alloc_new_reservation(my_rsv, goal, sb,
group, bitmap_bh);
- rsv_copy._rsv_start = my_rsv->rsv_start;
- rsv_copy._rsv_end = my_rsv->rsv_end;
- spin_unlock(rsv_lock);
if (ret < 0)
break; /* failed */
- if (!goal_in_my_reservation(&rsv_copy, goal, group, sb))
+ if (!goal_in_my_reservation(&my_rsv->rsv_window, goal, group, sb))
goal = -1;
}
- if ((rsv_copy._rsv_start >= group_first_block + EXT3_BLOCKS_PER_GROUP(sb))
- || (rsv_copy._rsv_end < group_first_block))
+ if ((my_rsv->rsv_start >= group_first_block + EXT3_BLOCKS_PER_GROUP(sb))
+ || (my_rsv->rsv_end < group_first_block))
BUG();
ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal,
- &rsv_copy);
+ &my_rsv->rsv_window);
if (ret >= 0) {
my_rsv->rsv_alloc_hit++;
break; /* succeed */
diff -puN fs/ext3/file.c~ext3-reduce-reservation-lock-latency fs/ext3/file.c
--- linux-2.6.11/fs/ext3/file.c~ext3-reduce-reservation-lock-latency 2005-04-22 10:49:04.000000000 -0700
+++ linux-2.6.11-ming/fs/ext3/file.c 2005-04-22 10:49:04.000000000 -0700
@@ -36,7 +36,11 @@ static int ext3_release_file (struct ino
/* if we are the last writer on the inode, drop the block reservation */
if ((filp->f_mode & FMODE_WRITE) &&
(atomic_read(&inode->i_writecount) == 1))
+ {
+ down(&EXT3_I(inode)->truncate_sem);
ext3_discard_reservation(inode);
+ up(&EXT3_I(inode)->truncate_sem);
+ }
if (is_dx(inode) && filp->private_data)
ext3_htree_free_dir_info(filp->private_data);
_
On Fri, 2005-04-22 at 15:10 -0700, Mingming Cao wrote:
> Please review. I have tested on fsx on SMP box. Lee, if you got time,
> please try this patch.
I have tested and this does fix the problem. I ran my tests and no ext3
code paths showed up on the latency tracer at all, it never went above
33 usecs.
Please apply.
Lee
On Wed, 2005-04-27 at 23:45 -0400, Lee Revell wrote:
> On Fri, 2005-04-22 at 15:10 -0700, Mingming Cao wrote:
> > Please review. I have tested on fsx on SMP box. Lee, if you got time,
> > please try this patch.
>
> I have tested and this does fix the problem. I ran my tests and no ext3
> code paths showed up on the latency tracer at all, it never went above
> 33 usecs.
>
Thanks, Lee.
The patch survived on many fsx test over 20 hours on a 2cpu machine.
Tested the patch on the same machine with tiobench (1-64 threads), and
untar a kernel tree test, no regression there.
However I see about 5-7% throughput drop on dbench with 16 threads. It
probably due to the cpu cost that we have discussed.
Mingming
On Thu, 2005-04-28 at 00:37 -0700, Mingming Cao wrote:
> On Wed, 2005-04-27 at 23:45 -0400, Lee Revell wrote:
> > On Fri, 2005-04-22 at 15:10 -0700, Mingming Cao wrote:
> > > Please review. I have tested on fsx on SMP box. Lee, if you got time,
> > > please try this patch.
> >
> > I have tested and this does fix the problem. I ran my tests and no ext3
> > code paths showed up on the latency tracer at all, it never went above
> > 33 usecs.
> >
> Thanks, Lee.
>
> The patch survived on many fsx test over 20 hours on a 2cpu machine.
> Tested the patch on the same machine with tiobench (1-64 threads), and
> untar a kernel tree test, no regression there.
> However I see about 5-7% throughput drop on dbench with 16 threads. It
> probably due to the cpu cost that we have discussed.
Hmm, I guess someone needs to test it on a bigger system. AFAICT this
should improve SMP scalability quite a bit. Maybe that lock is rarely
contended.
Lee
On Thu, 2005-04-28 at 12:12 -0400, Lee Revell wrote:
> On Thu, 2005-04-28 at 00:37 -0700, Mingming Cao wrote:
> > On Wed, 2005-04-27 at 23:45 -0400, Lee Revell wrote:
> > > On Fri, 2005-04-22 at 15:10 -0700, Mingming Cao wrote:
> > > > Please review. I have tested on fsx on SMP box. Lee, if you got time,
> > > > please try this patch.
> > >
> > > I have tested and this does fix the problem. I ran my tests and no ext3
> > > code paths showed up on the latency tracer at all, it never went above
> > > 33 usecs.
> > >
> > Thanks, Lee.
> >
> > The patch survived on many fsx test over 20 hours on a 2cpu machine.
> > Tested the patch on the same machine with tiobench (1-64 threads), and
> > untar a kernel tree test, no regression there.
> > However I see about 5-7% throughput drop on dbench with 16 threads. It
> > probably due to the cpu cost that we have discussed.
>
> Hmm, I guess someone needs to test it on a bigger system. AFAICT this
> should improve SMP scalability quite a bit. Maybe that lock is rarely
> contended.
Probably. I will try it on a relatively full filesystem(where bitmap
scan to find a free block takes time), and on a 4 cpu box with many
threads allocating at the same time.
Mingming
Currently ext3_get_block()/ext3_new_block() only allocate one block at a
time. To allocate multiple blocks, the caller, for example, ext3 direct
IO routine, has to invoke ext3_get_block() many times. This is quite
inefficient for sequential IO workload.
The benefit of a real get_blocks() include
1) increase the possibility to get contiguous blocks, reduce possibility
of fragmentation due to interleaved allocations from other threads.
(should good for non reservation case)
2) Reduces CPU cycles spent in repeated get_block() calls
3) Batch meta data update and journaling in one short
4) Could possibly speed up future get_blocks() look up by cache the last
mapped blocks in inode.
Multiple block allocation for ext3 is very useful for support delayed
allocation, also useful for direct io.
This effort is trying to fill this requirements on top of the current
ext3 disk format. Given a file offset and maximum length of blocks to
map, ext3_get_blocks() will find out or allocate the corresponding chunk
of contiguous physical blocks on disk. It will return the buffer head
mapped to the first physical block corresponding to the file offset, and
the number of the contiguous blocks just mapped or allocated.
First, search the [i,d,t]direct blocks indexed by the file offset to
found out the physical block. In the case that it is allocated, then
continue mapping the next logical block until the mapped physical block
is dis-adjacent, or there is no corresponding physical block being
allocated.
In the case block allocation is needed, first, it walks through the
inode's block mapping tree to count the number of adjacent blocks to
allocate. Then it passes the number of blocks to allocate to
ext3_new_blocks(), while the real block allocation is performed. After
allocated the first block, ext3_new_blocks() will always attempt to
allocate the next few(up to the requested size and not beyond the
reservation window) adjacent blocks at the same time. Once the blocks
are allocated and updated on bitmap, finally update those metadata
blocks with those new blocks info and splice the whole branch into the
block mapping tree.
Partial read map and partial write map is not supported since there is
no garantuee that the block allocation will allocate a contiguous
physical block back. Cross indirect block boundary allocation is not
supported due to the concern that it may cause metadata block far from
it's data blocks.
Below is just a first cut of proposed patch. Attach here just to show
the effort/direction.
Appreciate any comments/suggestions/critics!
Thanks,
Mingming
---
linux-2.6.11-ming/fs/ext3/balloc.c | 89 ++++++---
linux-2.6.11-ming/fs/ext3/inode.c | 284 ++++++++++++++++++++++++++++--
linux-2.6.11-ming/fs/ext3/xattr.c | 3
linux-2.6.11-ming/include/linux/ext3_fs.h | 2
4 files changed, 333 insertions(+), 45 deletions(-)
diff -puN fs/ext3/balloc.c~ext3_multiple_blocks_allocation fs/ext3/balloc.c
--- linux-2.6.11/fs/ext3/balloc.c~ext3_multiple_blocks_allocation 2005-04-28 12:12:06.168475152 -0700
+++ linux-2.6.11-ming/fs/ext3/balloc.c 2005-04-28 12:12:06.186473049 -0700
@@ -652,9 +652,11 @@ claim_block(spinlock_t *lock, int block,
*/
static int
ext3_try_to_allocate(struct super_block *sb, handle_t *handle, int group,
- struct buffer_head *bitmap_bh, int goal, struct ext3_reserve_window *my_rsv)
+ struct buffer_head *bitmap_bh, int goal, unsigned long *count,
+ struct ext3_reserve_window *my_rsv)
{
int group_first_block, start, end;
+ unsigned long num = 0;
/* we do allocation within the reservation window if we have a window */
if (my_rsv) {
@@ -712,8 +714,18 @@ repeat:
goto fail_access;
goto repeat;
}
- return goal;
+ num++;
+ goal++;
+ while (num < *count && goal < end
+ && ext3_test_allocatable(goal, bitmap_bh)
+ && claim_block(sb_bgl_lock(EXT3_SB(sb), group), goal, bitmap_bh)) {
+ num++;
+ goal++;
+ }
+ *count = num;
+ return goal - num;
fail_access:
+ *count = num;
return -1;
}
@@ -1025,11 +1037,13 @@ static int
ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
unsigned int group, struct buffer_head *bitmap_bh,
int goal, struct ext3_reserve_window_node * my_rsv,
- int *errp)
+ unsigned long *count, int *errp)
{
unsigned long group_first_block;
int ret = 0;
int fatal;
+ int original_goal=goal;
+ unsigned long num = *count;
*errp = 0;
@@ -1052,7 +1066,8 @@ ext3_try_to_allocate_with_rsv(struct sup
* or last attempt to allocate a block with reservation turned on failed
*/
if (my_rsv == NULL ) {
- ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL);
+ ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal,
+ count, NULL);
goto out;
}
/*
@@ -1094,13 +1109,17 @@ ext3_try_to_allocate_with_rsv(struct sup
|| (my_rsv->rsv_end < group_first_block))
BUG();
ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal,
- &my_rsv->rsv_window);
+ &num, &my_rsv->rsv_window);
if (ret >= 0) {
- my_rsv->rsv_alloc_hit++;
+ my_rsv->rsv_alloc_hit += num;
+ *count = num;
break; /* succeed */
}
+ num = *count;
}
out:
+ if (my_rsv)
+ printk("Allocated blocks from %d ,count is %d, goal was %d, group is %d, inode is %, reservation window (%d,%d)\n", ret, *count, original_goal, group, my_rsv->rsv_start, my_rsv->rsv_end);
if (ret >= 0) {
BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for "
"bitmap block");
@@ -1155,8 +1174,8 @@ int ext3_should_retry_alloc(struct super
* bitmap, and then for any free bit if that fails.
* This function also updates quota and i_blocks field.
*/
-int ext3_new_block(handle_t *handle, struct inode *inode,
- unsigned long goal, int *errp)
+int ext3_new_blocks(handle_t *handle, struct inode *inode,
+ unsigned long goal, unsigned long* count, int *errp)
{
struct buffer_head *bitmap_bh = NULL;
struct buffer_head *gdp_bh;
@@ -1179,7 +1198,8 @@ int ext3_new_block(handle_t *handle, str
static int goal_hits, goal_attempts;
#endif
unsigned long ngroups;
-
+ unsigned long num = *count;
+ int i;
*errp = -ENOSPC;
sb = inode->i_sb;
if (!sb) {
@@ -1190,7 +1210,7 @@ int ext3_new_block(handle_t *handle, str
/*
* Check quota for allocation of this block.
*/
- if (DQUOT_ALLOC_BLOCK(inode, 1)) {
+ if (DQUOT_ALLOC_BLOCK(inode, num)) {
*errp = -EDQUOT;
return 0;
}
@@ -1245,7 +1265,7 @@ retry:
if (!bitmap_bh)
goto io_error;
ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
- bitmap_bh, ret_block, my_rsv, &fatal);
+ bitmap_bh, ret_block, my_rsv, &num, &fatal);
if (fatal)
goto out;
if (ret_block >= 0)
@@ -1282,7 +1302,7 @@ retry:
if (!bitmap_bh)
goto io_error;
ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
- bitmap_bh, -1, my_rsv, &fatal);
+ bitmap_bh, -1, my_rsv, &num, &fatal);
if (fatal)
goto out;
if (ret_block >= 0)
@@ -1317,14 +1337,17 @@ allocated:
target_block = ret_block + group_no * EXT3_BLOCKS_PER_GROUP(sb)
+ le32_to_cpu(es->s_first_data_block);
- if (target_block == le32_to_cpu(gdp->bg_block_bitmap) ||
- target_block == le32_to_cpu(gdp->bg_inode_bitmap) ||
- in_range(target_block, le32_to_cpu(gdp->bg_inode_table),
- EXT3_SB(sb)->s_itb_per_group))
- ext3_error(sb, "ext3_new_block",
- "Allocating block in system zone - "
- "block = %u", target_block);
-
+ for (i = 0; i < num; i++, target_block++) {
+ if (target_block == le32_to_cpu(gdp->bg_block_bitmap) ||
+ target_block == le32_to_cpu(gdp->bg_inode_bitmap) ||
+ in_range(target_block, le32_to_cpu(gdp->bg_inode_table),
+ EXT3_SB(sb)->s_itb_per_group)) {
+ ext3_error(sb, "ext3_new_block",
+ "Allocating block in system zone - "
+ "block = %u", target_block);
+ goto out;
+ }
+ }
performed_allocation = 1;
#ifdef CONFIG_JBD_DEBUG
@@ -1342,10 +1365,12 @@ allocated:
jbd_lock_bh_state(bitmap_bh);
spin_lock(sb_bgl_lock(sbi, group_no));
if (buffer_jbd(bitmap_bh) && bh2jh(bitmap_bh)->b_committed_data) {
- if (ext3_test_bit(ret_block,
- bh2jh(bitmap_bh)->b_committed_data)) {
- printk("%s: block was unexpectedly set in "
- "b_committed_data\n", __FUNCTION__);
+ for (i = 0; i < num ; i++) {
+ if (ext3_test_bit(ret_block++,
+ bh2jh(bitmap_bh)->b_committed_data)) {
+ printk("%s: block was unexpectedly set in "
+ "b_committed_data\n", __FUNCTION__);
+ }
}
}
ext3_debug("found bit %d\n", ret_block);
@@ -1354,12 +1379,12 @@ allocated:
#endif
/* ret_block was blockgroup-relative. Now it becomes fs-relative */
- ret_block = target_block;
+ ret_block = target_block - num;
- if (ret_block >= le32_to_cpu(es->s_blocks_count)) {
+ if (target_block - 1>= le32_to_cpu(es->s_blocks_count)) {
ext3_error(sb, "ext3_new_block",
- "block(%d) >= blocks count(%d) - "
- "block_group = %d, es == %p ", ret_block,
+ "block(%d) >= fs blocks count(%d) - "
+ "block_group = %d, es == %p ", target_block - 1,
le32_to_cpu(es->s_blocks_count), group_no, es);
goto out;
}
@@ -1374,9 +1399,9 @@ allocated:
spin_lock(sb_bgl_lock(sbi, group_no));
gdp->bg_free_blocks_count =
- cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) - 1);
+ cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num);
spin_unlock(sb_bgl_lock(sbi, group_no));
- percpu_counter_mod(&sbi->s_freeblocks_counter, -1);
+ percpu_counter_mod(&sbi->s_freeblocks_counter, -num);
BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
err = ext3_journal_dirty_metadata(handle, gdp_bh);
@@ -1388,6 +1413,8 @@ allocated:
goto out;
*errp = 0;
+ DQUOT_FREE_BLOCK(inode, *count-num);
+ *count = num;
brelse(bitmap_bh);
return ret_block;
@@ -1402,7 +1429,7 @@ out:
* Undo the block allocation
*/
if (!performed_allocation)
- DQUOT_FREE_BLOCK(inode, 1);
+ DQUOT_FREE_BLOCK(inode, *count);
brelse(bitmap_bh);
return 0;
}
diff -puN fs/ext3/inode.c~ext3_multiple_blocks_allocation fs/ext3/inode.c
--- linux-2.6.11/fs/ext3/inode.c~ext3_multiple_blocks_allocation 2005-04-28 12:12:06.172474685 -0700
+++ linux-2.6.11-ming/fs/ext3/inode.c 2005-04-28 12:12:06.192472348 -0700
@@ -237,12 +237,12 @@ static int ext3_alloc_block (handle_t *h
struct inode * inode, unsigned long goal, int *err)
{
unsigned long result;
+ unsigned long count = 1;
- result = ext3_new_block(handle, inode, goal, err);
+ result = ext3_new_blocks(handle, inode, goal, &count, err);
return result;
}
-
typedef struct {
__le32 *p;
__le32 key;
@@ -328,7 +328,7 @@ static int ext3_block_to_path(struct ino
ext3_warning (inode->i_sb, "ext3_block_to_path", "block > big");
}
if (boundary)
- *boundary = (i_block & (ptrs - 1)) == (final - 1);
+ *boundary = final - 1 - (i_block & (ptrs - 1));
return n;
}
@@ -526,7 +526,7 @@ static int ext3_alloc_branch(handle_t *h
/*
* Get buffer_head for parent block, zero it out
* and set the pointer to new one, then send
- * parent to disk.
+ * parent to disk.
*/
bh = sb_getblk(inode->i_sb, parent);
branch[n].bh = bh;
@@ -566,6 +566,111 @@ static int ext3_alloc_branch(handle_t *h
ext3_free_blocks(handle, inode, le32_to_cpu(branch[i].key), 1);
return err;
}
+#define GBS_DEBUG 0
+static int ext3_alloc_splice_branch(handle_t *handle, struct inode *inode,
+ unsigned long num, unsigned long first_block,
+ int *offsets, Indirect *branch, int depth)
+{
+ int blocksize = inode->i_sb->s_blocksize;
+ int n = 0;
+ int err = 0;
+ int i;
+ unsigned long current_block;
+ struct buffer_head *bh;
+
+ current_block = first_block;
+ branch[0].key = cpu_to_le32(current_block);
+
+ if (GBS_DEBUG)
+ printk("block %d, branch[0].p :%x\n", current_block, branch[0].p);
+ for (n = 1; n < depth; n++) {
+ current_block++;
+ branch[n].key = cpu_to_le32(current_block);
+
+ /*
+ * Get buffer_head for parent block, zero it out
+ * and set the pointer to new one, then send
+ * parent to disk.
+ */
+ bh = sb_getblk(inode->i_sb, current_block - 1);
+ branch[n].bh = bh;
+ lock_buffer(bh);
+ BUFFER_TRACE(bh, "call get_create_access");
+ err = ext3_journal_get_create_access(handle, bh);
+ if (err) {
+ unlock_buffer(bh);
+ brelse(bh);
+ goto failed;
+ }
+
+ memset(bh->b_data, 0, blocksize);
+ branch[n].p = (__le32*) bh->b_data + offsets[n];
+ *branch[n].p = branch[n].key;
+ /*
+ * allocate blocks to the rest missing data blocks
+ */
+ if (n == depth -1 ) {
+ i = 0;
+ while (current_block - first_block + 1 < num) {
+ *(branch[n].p + i + 1) = ++current_block;
+ i++;
+ }
+ }
+ BUFFER_TRACE(bh, "marking uptodate");
+ set_buffer_uptodate(bh);
+ unlock_buffer(bh);
+
+ BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata");
+ err = ext3_journal_dirty_metadata(handle, bh);
+ if (err)
+ goto failed;
+ }
+
+ bh = branch[0].bh;
+ if (bh){
+ BUFFER_TRACE(bh, "call get_write_access");
+ err = ext3_journal_get_write_access(handle, bh);
+ if (err)
+ goto failed;
+ }
+
+ *(branch[0].p) = branch[0].key;
+
+ /* allocate blocks to the rest missing data blocks */
+ i = 0;
+ while (current_block - first_block + 1< num) {
+ *(branch[0].p + i + 1) = ++current_block;
+ i++;
+ }
+
+ if (GBS_DEBUG) {
+ for (i=0; i< n; i++)
+ printk("inode %x, branch[%d].p:%x, branch[%d].key:%d,\n", inode, i, branch[i].p, i, branch[i].key);
+ for (i=0; i< num-n; i++)
+ printk("inode %x, branch[%d].p + %d + 1:%x, *(branch[%d].p+%d+1):%d,\n, branch[%d].bh:%x\n", inode, n-1, i, branch[n-1].p + i +1, n-1, i, *(branch[n-1].p+i+1),n-1, branch[n-1].bh);
+ }
+ if (bh) {
+ BUFFER_TRACE(bh, "marking uptodate");
+ set_buffer_uptodate(bh);
+ unlock_buffer(bh);
+
+ BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata");
+ err = ext3_journal_dirty_metadata(handle, bh);
+ if (err)
+ goto failed;
+ }
+
+ return err;
+failed:
+ /* Allocation failed, free what we already allocated */
+ for (i = 0; i < n ; i++) {
+ BUFFER_TRACE(branch[i].bh, "call journal_forget");
+ ext3_journal_forget(handle, branch[n].bh);
+ }
+ for (i = 0; i < num; i++)
+ ext3_free_blocks(handle, inode, first_block + i, 1);
+ return err;
+}
/**
* ext3_splice_branch - splice the allocated branch onto inode.
@@ -777,8 +882,153 @@ out:
return err;
}
-static int ext3_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create)
+static int
+ext3_count_blocks_to_allocate(Indirect * branch, int k,
+ unsigned long maxblocks, int blocks_to_boundary)
+{
+ unsigned long count = 0;
+
+ if (k == 0) return 0;
+ /*
+ * Simple case, [t,d]Indirect block(s) has not allocated yet
+ * then it's clear blocks on that path have not allocated
+ */
+ if (GBS_DEBUG)
+ printk("maxblocks: %d, k: %d, boundary : %d \n",maxblocks, k,
+ blocks_to_boundary);
+ if (k > 1) {
+ /* right now don't hanel cross boundary allocation */
+ if ((maxblocks - count) < blocks_to_boundary)
+ count += maxblocks;
+ else
+ count += blocks_to_boundary;
+ count += k - 1; /* blocks for [t,d]indirect blocks */
+ return count;
+ }
+
+ count++;
+ while (count < maxblocks && count <= blocks_to_boundary
+ && *(branch[0].p + count) == 0) {
+ count++;
+ }
+ return count;
+}
+static int
+ext3_get_blocks_handle(handle_t *handle, struct inode *inode, sector_t iblock,
+ unsigned long *maxblocks, struct buffer_head *bh_result,
+ int create, int extend_disksize)
+{
+ int err = -EIO;
+ int offsets[4];
+ Indirect chain[4];
+ Indirect *partial = NULL;
+ unsigned long goal;
+ int left;
+ int blocks_to_boundary = 0;
+ int depth;
+ struct ext3_inode_info *ei = EXT3_I(inode);
+ unsigned long count = 0;
+ unsigned long first_block;
+ struct ext3_block_alloc_info *block_i;
+
+ J_ASSERT(handle != NULL || create == 0);
+
+ if (GBS_DEBUG)
+ printk("ext3_get_blocks_handle: maxblocks= %d, iblock = %d\n", *maxblocks, iblock);
+ down(&ei->truncate_sem);
+ depth = ext3_block_to_path(inode, iblock, offsets, &blocks_to_boundary);
+ if (depth == 0)
+ goto out;
+ partial = ext3_get_branch(inode, depth,
+ offsets, chain, &err);
+ /* Simplest case - block found */
+ if (!err && !partial) {
+ first_block = chain[depth-1].key;
+ clear_buffer_new(bh_result);
+ count ++;
+ /* map more blocks */
+ while (count < *maxblocks && count <= blocks_to_boundary
+ && (*(chain[depth-1].p+count) == first_block + count)) {
+ count ++;
+ }
+ up(&ei->truncate_sem);
+ goto got_it;
+ }
+ /* got mapped blocks or plain lookup or failed read of indirect block */
+ if (!create || err == -EIO){
+ up(&ei->truncate_sem);
+ goto out;
+ }
+ /* Okey, we need to do block allocation */
+ /* lazy initialize the block allocation info here if necessary */
+ if (S_ISREG(inode->i_mode) && (!ei->i_block_alloc_info)) {
+ ext3_init_block_alloc_info(inode);
+ }
+
+ goal = ext3_find_goal(inode, iblock, chain, partial);
+
+ /* number of missing meta data blocks need to allocate for this branch */
+ left = chain + depth - partial;
+ count = ext3_count_blocks_to_allocate(partial, left, *maxblocks, blocks_to_boundary);
+ if (GBS_DEBUG)
+ printk("blocks to allocate: %d\n", count);
+
+ first_block = ext3_new_blocks(handle, inode, goal, &count, &err);
+ if (GBS_DEBUG)
+ printk("blocks allocated(%d,%d,%d)\n", (int)iblock, (int)first_block, count);
+
+ if (!err)
+ err = ext3_alloc_splice_branch(handle, inode, count,
+ first_block, offsets+(partial-chain), partial, left);
+ if (err) {
+ up(&ei->truncate_sem);
+ goto cleanup;
+ }
+ /* i_disksize growing is protected by truncate_sem
+ * don't forget to protect it if you're about to implement
+ * concurrent ext3_get_block() -bzzz */
+ if (extend_disksize && inode->i_size > ei->i_disksize)
+ ei->i_disksize = inode->i_size;
+ /*
+ * update the most recently allocated logical & physical block
+ * in i_block_alloc_info, to assist find the proper goal block for next
+ * allocation
+ */
+ block_i = ei->i_block_alloc_info;
+ if (block_i) {
+ block_i->last_alloc_logical_block = iblock + count - left;
+ block_i->last_alloc_physical_block = first_block + count - 1;
+ }
+
+ inode->i_ctime = CURRENT_TIME_SEC;
+ ext3_mark_inode_dirty(handle, inode);
+
+ up(&ei->truncate_sem);
+ if (err)
+ goto cleanup;
+
+ set_buffer_new(bh_result);
+got_it:
+ map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
+ if (blocks_to_boundary == 0)
+ set_buffer_boundary(bh_result);
+ /* Clean up and exit */
+ partial = chain+depth-1; /* the whole chain */
+cleanup:
+ while (partial > chain) {
+ BUFFER_TRACE(partial->bh, "call brelse");
+ brelse(partial->bh);
+ partial--;
+ }
+ BUFFER_TRACE(bh_result, "returned");
+out:
+ *maxblocks = count;
+ return err;
+}
+
+static int ext3_get_blocks(struct inode *inode, sector_t iblock,
+ unsigned long maxblocks, struct buffer_head *bh_result,
+ int create)
{
handle_t *handle = NULL;
int ret;
@@ -787,15 +1037,21 @@ static int ext3_get_block(struct inode *
handle = ext3_journal_current_handle();
J_ASSERT(handle != 0);
}
- ret = ext3_get_block_handle(handle, inode, iblock,
+ ret = ext3_get_blocks_handle(handle, inode, iblock, &maxblocks,
bh_result, create, 1);
- return ret;
+ bh_result->b_size = (maxblocks << inode->i_blkbits);
+ return ret;
+}
+
+static int ext3_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
+{
+ return ext3_get_blocks(inode, iblock, 1, bh_result, create);
}
#define DIO_CREDITS (EXT3_RESERVE_TRANS_BLOCKS + 32)
-static int
-ext3_direct_io_get_blocks(struct inode *inode, sector_t iblock,
+static int ext3_direct_io_get_blocks(struct inode *inode, sector_t iblock,
unsigned long max_blocks, struct buffer_head *bh_result,
int create)
{
@@ -831,10 +1087,14 @@ ext3_direct_io_get_blocks(struct inode *
}
get_block:
+ if (GBS_DEBUG)
+ printk("Calling ext3_get_blocks_handle from dio: maxblocks= %d, iblock = %d", (int)max_blocks, (int)iblock);
if (ret == 0)
- ret = ext3_get_block_handle(handle, inode, iblock,
+ ret = ext3_get_blocks_handle(handle, inode, iblock, &max_blocks,
bh_result, create, 0);
- bh_result->b_size = (1 << inode->i_blkbits);
+ bh_result->b_size = (max_blocks << inode->i_blkbits);
+ if (GBS_DEBUG)
+ printk("ext3_get_blocks_handle returns to dio: maxblocks= %d, iblock = %d", (int)max_blocks, (int)iblock);
return ret;
}
diff -puN fs/ext3/xattr.c~ext3_multiple_blocks_allocation fs/ext3/xattr.c
--- linux-2.6.11/fs/ext3/xattr.c~ext3_multiple_blocks_allocation 2005-04-28 12:12:06.177474100 -0700
+++ linux-2.6.11-ming/fs/ext3/xattr.c 2005-04-28 12:12:06.194472114 -0700
@@ -796,7 +796,8 @@ inserted:
EXT3_SB(sb)->s_es->s_first_data_block) +
EXT3_I(inode)->i_block_group *
EXT3_BLOCKS_PER_GROUP(sb);
- int block = ext3_new_block(handle, inode, goal, &error);
+ unsigned long count = 1;
+ int block = ext3_new_blocks(handle, inode, goal, &count, &error);
if (error)
goto cleanup;
ea_idebug(inode, "creating block %d", block);
diff -puN include/linux/ext3_fs.h~ext3_multiple_blocks_allocation include/linux/ext3_fs.h
--- linux-2.6.11/include/linux/ext3_fs.h~ext3_multiple_blocks_allocation 2005-04-28 12:12:06.180473750 -0700
+++ linux-2.6.11-ming/include/linux/ext3_fs.h 2005-04-28 12:12:06.196471880 -0700
@@ -714,7 +714,7 @@ struct dir_private_info {
/* balloc.c */
extern int ext3_bg_has_super(struct super_block *sb, int group);
extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group);
-extern int ext3_new_block (handle_t *, struct inode *, unsigned long, int *);
+extern int ext3_new_blocks (handle_t *, struct inode *, unsigned long, unsigned long*, int *);
extern void ext3_free_blocks (handle_t *, struct inode *, unsigned long,
unsigned long);
extern void ext3_free_blocks_sb (handle_t *, struct super_block *,
_
On Thu, Apr 28, 2005 at 12:14:24PM -0700, Mingming Cao wrote:
> Currently ext3_get_block()/ext3_new_block() only allocate one block at a
> time. To allocate multiple blocks, the caller, for example, ext3 direct
> IO routine, has to invoke ext3_get_block() many times. This is quite
> inefficient for sequential IO workload.
>
> The benefit of a real get_blocks() include
> 1) increase the possibility to get contiguous blocks, reduce possibility
> of fragmentation due to interleaved allocations from other threads.
> (should good for non reservation case)
> 2) Reduces CPU cycles spent in repeated get_block() calls
> 3) Batch meta data update and journaling in one short
> 4) Could possibly speed up future get_blocks() look up by cache the last
> mapped blocks in inode.
>
And here is the patch to make mpage_writepages use get_blocks() for
multiple block lookup/allocation. It performs a radix-tree contiguous
pages lookup, and issues a get_blocks for the range together. It maintains
an mpageio structure to track intermediate mapping state, somewhat
like the DIO code.
It does need some more testing, especially block_size < PAGE_SIZE.
The JFS workaround can be dropped if the JFS get_blocks fix from
Dave Kleikamp is integrated.
Review feedback would be welcome.
Mingming,
Let me know if you have a chance to try this out with your patch.
Regards
Suparna
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
diff -urp -X dontdiff linux-2.6.12-rc3/fs/buffer.c linux-2.6.12-rc3-getblocks/fs/buffer.c
--- linux-2.6.12-rc3/fs/buffer.c 2005-04-21 05:33:15.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/fs/buffer.c 2005-04-22 15:08:33.000000000 +0530
@@ -2514,53 +2514,10 @@ EXPORT_SYMBOL(nobh_commit_write);
* that it tries to operate without attaching bufferheads to
* the page.
*/
-int nobh_writepage(struct page *page, get_block_t *get_block,
- struct writeback_control *wbc)
+int nobh_writepage(struct page *page, get_blocks_t *get_blocks,
+ struct writeback_control *wbc, writepage_t bh_writepage_fn)
{
- struct inode * const inode = page->mapping->host;
- loff_t i_size = i_size_read(inode);
- const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
- unsigned offset;
- void *kaddr;
- int ret;
-
- /* Is the page fully inside i_size? */
- if (page->index < end_index)
- goto out;
-
- /* Is the page fully outside i_size? (truncate in progress) */
- offset = i_size & (PAGE_CACHE_SIZE-1);
- if (page->index >= end_index+1 || !offset) {
- /*
- * The page may have dirty, unmapped buffers. For example,
- * they may have been added in ext3_writepage(). Make them
- * freeable here, so the page does not leak.
- */
-#if 0
- /* Not really sure about this - do we need this ? */
- if (page->mapping->a_ops->invalidatepage)
- page->mapping->a_ops->invalidatepage(page, offset);
-#endif
- unlock_page(page);
- return 0; /* don't care */
- }
-
- /*
- * The page straddles i_size. It must be zeroed out on each and every
- * writepage invocation because it may be mmapped. "A file is mapped
- * in multiples of the page size. For a file that is not a multiple of
- * the page size, the remaining memory is zeroed when mapped, and
- * writes to that region are not written out to the file."
- */
- kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
-out:
- ret = mpage_writepage(page, get_block, wbc);
- if (ret == -EAGAIN)
- ret = __block_write_full_page(inode, page, get_block, wbc);
- return ret;
+ return mpage_writepage(page, get_blocks, wbc, bh_writepage_fn);
}
EXPORT_SYMBOL(nobh_writepage);
diff -urp -X dontdiff linux-2.6.12-rc3/fs/ext2/inode.c linux-2.6.12-rc3-getblocks/fs/ext2/inode.c
--- linux-2.6.12-rc3/fs/ext2/inode.c 2005-04-21 05:33:15.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/fs/ext2/inode.c 2005-04-22 16:30:42.000000000 +0530
@@ -639,12 +639,6 @@ ext2_nobh_prepare_write(struct file *fil
return nobh_prepare_write(page,from,to,ext2_get_block);
}
-static int ext2_nobh_writepage(struct page *page,
- struct writeback_control *wbc)
-{
- return nobh_writepage(page, ext2_get_block, wbc);
-}
-
static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
{
return generic_block_bmap(mapping,block,ext2_get_block);
@@ -662,6 +656,12 @@ ext2_get_blocks(struct inode *inode, sec
return ret;
}
+static int ext2_nobh_writepage(struct page *page,
+ struct writeback_control *wbc)
+{
+ return nobh_writepage(page, ext2_get_blocks, wbc, ext2_writepage);
+}
+
static ssize_t
ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
loff_t offset, unsigned long nr_segs)
@@ -676,7 +676,8 @@ ext2_direct_IO(int rw, struct kiocb *ioc
static int
ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
- return mpage_writepages(mapping, wbc, ext2_get_block);
+ return __mpage_writepages(mapping, wbc, ext2_get_blocks,
+ ext2_writepage);
}
struct address_space_operations ext2_aops = {
diff -urp -X dontdiff linux-2.6.12-rc3/fs/ext3/inode.c linux-2.6.12-rc3-getblocks/fs/ext3/inode.c
--- linux-2.6.12-rc3/fs/ext3/inode.c 2005-04-21 05:33:15.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/fs/ext3/inode.c 2005-04-22 15:08:33.000000000 +0530
@@ -866,10 +866,10 @@ get_block:
return ret;
}
-static int ext3_writepages_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh, int create)
+static int ext3_writepages_get_blocks(struct inode *inode, sector_t iblock,
+ unsigned long max_blocks, struct buffer_head *bh, int create)
{
- return ext3_direct_io_get_blocks(inode, iblock, 1, bh, create);
+ return ext3_direct_io_get_blocks(inode, iblock, max_blocks, bh, create);
}
/*
@@ -1369,11 +1369,11 @@ ext3_writeback_writepages(struct address
return ret;
}
- ret = __mpage_writepages(mapping, wbc, ext3_writepages_get_block,
+ ret = __mpage_writepages(mapping, wbc, ext3_writepages_get_blocks,
ext3_writeback_writepage_helper);
/*
- * Need to reaquire the handle since ext3_writepages_get_block()
+ * Need to reaquire the handle since ext3_writepages_get_blocks()
* can restart the handle
*/
handle = journal_current_handle();
@@ -1402,7 +1402,8 @@ static int ext3_writeback_writepage(stru
}
if (test_opt(inode->i_sb, NOBH))
- ret = nobh_writepage(page, ext3_get_block, wbc);
+ ret = nobh_writepage(page, ext3_writepages_get_blocks, wbc,
+ ext3_writeback_writepage_helper);
else
ret = block_write_full_page(page, ext3_get_block, wbc);
diff -urp -X dontdiff linux-2.6.12-rc3/fs/ext3/super.c linux-2.6.12-rc3-getblocks/fs/ext3/super.c
--- linux-2.6.12-rc3/fs/ext3/super.c 2005-04-21 05:33:15.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/fs/ext3/super.c 2005-04-22 15:08:33.000000000 +0530
@@ -1321,6 +1321,7 @@ static int ext3_fill_super (struct super
sbi->s_resgid = le16_to_cpu(es->s_def_resgid);
set_opt(sbi->s_mount_opt, RESERVATION);
+ set_opt(sbi->s_mount_opt, NOBH); /* temp: set nobh default */
if (!parse_options ((char *) data, sb, &journal_inum, NULL, 0))
goto failed_mount;
@@ -1567,6 +1568,7 @@ static int ext3_fill_super (struct super
printk(KERN_ERR "EXT3-fs: Journal does not support "
"requested data journaling mode\n");
goto failed_mount3;
+ set_opt(sbi->s_mount_opt, NOBH); /* temp: set nobh default */
}
default:
break;
@@ -1584,6 +1586,7 @@ static int ext3_fill_super (struct super
"its supported only with writeback mode\n");
clear_opt(sbi->s_mount_opt, NOBH);
}
+ printk("NOBH option set\n");
}
/*
* The journal_load will have done any necessary log recovery,
diff -urp -X dontdiff linux-2.6.12-rc3/fs/hfs/inode.c linux-2.6.12-rc3-getblocks/fs/hfs/inode.c
--- linux-2.6.12-rc3/fs/hfs/inode.c 2005-04-21 05:33:15.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/fs/hfs/inode.c 2005-04-22 15:08:33.000000000 +0530
@@ -124,7 +124,7 @@ static ssize_t hfs_direct_IO(int rw, str
static int hfs_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
- return mpage_writepages(mapping, wbc, hfs_get_block);
+ return mpage_writepages(mapping, wbc, hfs_get_blocks);
}
struct address_space_operations hfs_btree_aops = {
diff -urp -X dontdiff linux-2.6.12-rc3/fs/hfsplus/inode.c linux-2.6.12-rc3-getblocks/fs/hfsplus/inode.c
--- linux-2.6.12-rc3/fs/hfsplus/inode.c 2005-04-21 05:33:15.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/fs/hfsplus/inode.c 2005-04-22 15:08:33.000000000 +0530
@@ -121,7 +121,7 @@ static ssize_t hfsplus_direct_IO(int rw,
static int hfsplus_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
- return mpage_writepages(mapping, wbc, hfsplus_get_block);
+ return mpage_writepages(mapping, wbc, hfsplus_get_blocks);
}
struct address_space_operations hfsplus_btree_aops = {
diff -urp -X dontdiff linux-2.6.12-rc3/fs/jfs/inode.c linux-2.6.12-rc3-getblocks/fs/jfs/inode.c
--- linux-2.6.12-rc3/fs/jfs/inode.c 2005-04-21 05:33:15.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/fs/jfs/inode.c 2005-04-22 16:27:19.000000000 +0530
@@ -267,21 +267,41 @@ jfs_get_blocks(struct inode *ip, sector_
return rc;
}
+static int
+jfs_mpage_get_blocks(struct inode *ip, sector_t lblock, unsigned long
+ max_blocks, struct buffer_head *bh_result, int create)
+{
+ /*
+ * fixme: temporary workaround: return one block at a time until
+ * we figure out why we see exposures with truncate on
+ * allocating multiple blocks in one shot.
+ */
+ return jfs_get_blocks(ip, lblock, 1, bh_result, create);
+}
+
static int jfs_get_block(struct inode *ip, sector_t lblock,
struct buffer_head *bh_result, int create)
{
return jfs_get_blocks(ip, lblock, 1, bh_result, create);
}
+static int jfs_bh_writepage(struct page *page,
+ struct writeback_control *wbc)
+{
+ return block_write_full_page(page, jfs_get_block, wbc);
+}
+
+
static int jfs_writepage(struct page *page, struct writeback_control *wbc)
{
- return nobh_writepage(page, jfs_get_block, wbc);
+ return nobh_writepage(page, jfs_mpage_get_blocks, wbc, jfs_bh_writepage);
}
static int jfs_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
- return mpage_writepages(mapping, wbc, jfs_get_block);
+ return __mpage_writepages(mapping, wbc, jfs_mpage_get_blocks,
+ jfs_bh_writepage);
}
static int jfs_readpage(struct file *file, struct page *page)
diff -urp -X dontdiff linux-2.6.12-rc3/fs/mpage.c linux-2.6.12-rc3-getblocks/fs/mpage.c
--- linux-2.6.12-rc3/fs/mpage.c 2005-04-21 05:33:15.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/fs/mpage.c 2005-04-22 16:19:14.000000000 +0530
@@ -370,6 +370,67 @@ int mpage_readpage(struct page *page, ge
}
EXPORT_SYMBOL(mpage_readpage);
+struct mpageio {
+ struct bio *bio;
+ struct buffer_head map_bh;
+ unsigned long block_in_file;
+ unsigned long final_block_in_request;
+ sector_t long block_in_bio;
+ int boundary;
+ sector_t boundary_block;
+ struct block_device *boundary_bdev;
+};
+
+/*
+ * Maps as many contiguous disk blocks as it can within the range of
+ * the request, and returns the total number of contiguous mapped
+ * blocks in the mpageio.
+ */
+static unsigned long mpage_get_more_blocks(struct mpageio *mio,
+ struct inode *inode, get_blocks_t get_blocks)
+{
+ struct buffer_head map_bh = {.b_state = 0};
+ unsigned long mio_nblocks = mio->map_bh.b_size >> inode->i_blkbits;
+ unsigned long first_unmapped = mio->block_in_file + mio_nblocks;
+ unsigned long next_contig_block = mio->map_bh.b_blocknr + mio_nblocks;
+
+ while ((first_unmapped < mio->final_block_in_request) &&
+ (mio->map_bh.b_size < PAGE_SIZE)) {
+
+ if (get_blocks(inode, first_unmapped,
+ mio->final_block_in_request - first_unmapped,
+ &map_bh, 1))
+ break;
+ if (mio_nblocks && ((map_bh.b_blocknr != next_contig_block) ||
+ map_bh.b_bdev != mio->map_bh.b_bdev))
+ break;
+
+ if (buffer_new(&map_bh)) {
+ int i = 0;
+ for (; i < map_bh.b_size >> inode->i_blkbits; i++)
+ unmap_underlying_metadata(map_bh.b_bdev,
+ map_bh.b_blocknr + i);
+ }
+
+ if (buffer_boundary(&map_bh)) {
+ mio->boundary = 1;
+ mio->boundary_block = map_bh.b_blocknr;
+ mio->boundary_bdev = map_bh.b_bdev;
+ }
+ if (mio_nblocks == 0) {
+ mio->map_bh.b_bdev = map_bh.b_bdev;
+ mio->map_bh.b_blocknr = map_bh.b_blocknr;
+ }
+
+ mio_nblocks += map_bh.b_size >> inode->i_blkbits;
+ first_unmapped = mio->block_in_file + mio_nblocks;
+ next_contig_block = mio->map_bh.b_blocknr + mio_nblocks;
+ mio->map_bh.b_size += map_bh.b_size;
+ }
+
+ return mio_nblocks;
+}
+
/*
* Writing is not so simple.
*
@@ -386,9 +447,9 @@ EXPORT_SYMBOL(mpage_readpage);
* written, so it can intelligently allocate a suitably-sized BIO. For now,
* just allocate full-size (16-page) BIOs.
*/
-static struct bio *
-__mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block,
- sector_t *last_block_in_bio, int *ret, struct writeback_control *wbc,
+static int
+__mpage_writepage(struct mpageio *mio, struct page *page,
+ get_blocks_t get_blocks, struct writeback_control *wbc,
writepage_t writepage_fn)
{
struct address_space *mapping = page->mapping;
@@ -396,9 +457,8 @@ __mpage_writepage(struct bio *bio, struc
const unsigned blkbits = inode->i_blkbits;
unsigned long end_index;
const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
- sector_t last_block;
+ sector_t last_block, blocks_to_skip;
sector_t block_in_file;
- sector_t blocks[MAX_BUF_PER_PAGE];
unsigned page_block;
unsigned first_unmapped = blocks_per_page;
struct block_device *bdev = NULL;
@@ -406,8 +466,10 @@ __mpage_writepage(struct bio *bio, struc
sector_t boundary_block = 0;
struct block_device *boundary_bdev = NULL;
int length;
- struct buffer_head map_bh;
loff_t i_size = i_size_read(inode);
+ struct buffer_head *map_bh = &mio->map_bh;
+ struct bio *bio = mio->bio;
+ int ret = 0;
if (page_has_buffers(page)) {
struct buffer_head *head = page_buffers(page);
@@ -435,10 +497,13 @@ __mpage_writepage(struct bio *bio, struc
if (!buffer_dirty(bh) || !buffer_uptodate(bh))
goto confused;
if (page_block) {
- if (bh->b_blocknr != blocks[page_block-1] + 1)
+ if (bh->b_blocknr != map_bh->b_blocknr
+ + page_block)
goto confused;
+ } else {
+ map_bh->b_blocknr = bh->b_blocknr;
+ map_bh->b_size = PAGE_SIZE;
}
- blocks[page_block++] = bh->b_blocknr;
boundary = buffer_boundary(bh);
if (boundary) {
boundary_block = bh->b_blocknr;
@@ -465,33 +530,30 @@ __mpage_writepage(struct bio *bio, struc
BUG_ON(!PageUptodate(page));
block_in_file = page->index << (PAGE_CACHE_SHIFT - blkbits);
last_block = (i_size - 1) >> blkbits;
- map_bh.b_page = page;
- for (page_block = 0; page_block < blocks_per_page; ) {
-
- map_bh.b_state = 0;
- if (get_block(inode, block_in_file, &map_bh, 1))
- goto confused;
- if (buffer_new(&map_bh))
- unmap_underlying_metadata(map_bh.b_bdev,
- map_bh.b_blocknr);
- if (buffer_boundary(&map_bh)) {
- boundary_block = map_bh.b_blocknr;
- boundary_bdev = map_bh.b_bdev;
- }
- if (page_block) {
- if (map_bh.b_blocknr != blocks[page_block-1] + 1)
- goto confused;
- }
- blocks[page_block++] = map_bh.b_blocknr;
- boundary = buffer_boundary(&map_bh);
- bdev = map_bh.b_bdev;
- if (block_in_file == last_block)
- break;
- block_in_file++;
+ blocks_to_skip = block_in_file - mio->block_in_file;
+ mio->block_in_file = block_in_file;
+ if (blocks_to_skip < (map_bh->b_size >> blkbits)) {
+ map_bh->b_blocknr += blocks_to_skip;
+ map_bh->b_size -= blocks_to_skip << blkbits;
+ } else {
+ map_bh->b_state = 0;
+ map_bh->b_size = 0;
+ if (mio->final_block_in_request > last_block)
+ mio->final_block_in_request = last_block;
+ mpage_get_more_blocks(mio, inode, get_blocks);
}
- BUG_ON(page_block == 0);
+ if (map_bh->b_size < PAGE_SIZE)
+ goto confused;
- first_unmapped = page_block;
+ if (mio->boundary && (mio->boundary_block < map_bh->b_blocknr
+ + blocks_per_page)) {
+ boundary = 1;
+ boundary_block = mio->boundary_block;
+ boundary_bdev = mio->boundary_bdev;
+ }
+
+ bdev = map_bh->b_bdev;
+ first_unmapped = blocks_per_page;
page_is_mapped:
end_index = i_size >> PAGE_CACHE_SHIFT;
@@ -518,12 +580,16 @@ page_is_mapped:
/*
* This page will go to BIO. Do we need to send this BIO off first?
*/
- if (bio && *last_block_in_bio != blocks[0] - 1)
+ if (bio && mio->block_in_bio != map_bh->b_blocknr - 1)
bio = mpage_bio_submit(WRITE, bio);
alloc_new:
if (bio == NULL) {
- bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
+ /*
+ * Fixme: bio size can be limited to final_block - block, or
+ * even mio->map_bh.b_size
+ */
+ bio = mpage_alloc(bdev, map_bh->b_blocknr << (blkbits - 9),
bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH);
if (bio == NULL)
goto confused;
@@ -539,6 +605,9 @@ alloc_new:
bio = mpage_bio_submit(WRITE, bio);
goto alloc_new;
}
+ map_bh->b_blocknr += blocks_per_page;
+ map_bh->b_size -= PAGE_SIZE;
+ mio->block_in_file += blocks_per_page;
/*
* OK, we have our BIO, so we can now mark the buffers clean. Make
@@ -575,7 +644,8 @@ alloc_new:
boundary_block, 1 << blkbits);
}
} else {
- *last_block_in_bio = blocks[blocks_per_page - 1];
+ /* we can pack more pages into the bio, don't submit yet */
+ mio->block_in_bio = map_bh->b_blocknr - 1;
}
goto out;
@@ -584,22 +654,23 @@ confused:
bio = mpage_bio_submit(WRITE, bio);
if (writepage_fn) {
- *ret = (*writepage_fn)(page, wbc);
+ ret = (*writepage_fn)(page, wbc);
} else {
- *ret = -EAGAIN;
+ ret = -EAGAIN;
goto out;
}
/*
* The caller has a ref on the inode, so *mapping is stable
*/
- if (*ret) {
- if (*ret == -ENOSPC)
+ if (ret) {
+ if (ret == -ENOSPC)
set_bit(AS_ENOSPC, &mapping->flags);
else
set_bit(AS_EIO, &mapping->flags);
}
out:
- return bio;
+ mio->bio = bio;
+ return ret;
}
/**
@@ -625,20 +696,21 @@ out:
*/
int
mpage_writepages(struct address_space *mapping,
- struct writeback_control *wbc, get_block_t get_block)
+ struct writeback_control *wbc, get_blocks_t get_blocks)
{
- return __mpage_writepages(mapping, wbc, get_block,
+ return __mpage_writepages(mapping, wbc, get_blocks,
mapping->a_ops->writepage);
}
int
__mpage_writepages(struct address_space *mapping,
- struct writeback_control *wbc, get_block_t get_block,
+ struct writeback_control *wbc, get_blocks_t get_blocks,
writepage_t writepage_fn)
{
struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct inode *inode = mapping->host;
+ const unsigned blkbits = inode->i_blkbits;
struct bio *bio = NULL;
- sector_t last_block_in_bio = 0;
int ret = 0;
int done = 0;
int (*writepage)(struct page *page, struct writeback_control *wbc);
@@ -648,6 +720,9 @@ __mpage_writepages(struct address_space
pgoff_t end = -1; /* Inclusive */
int scanned = 0;
int is_range = 0;
+ struct mpageio mio = {
+ .bio = NULL
+ };
if (wbc->nonblocking && bdi_write_congested(bdi)) {
wbc->encountered_congestion = 1;
@@ -655,7 +730,7 @@ __mpage_writepages(struct address_space
}
writepage = NULL;
- if (get_block == NULL)
+ if (get_blocks == NULL)
writepage = mapping->a_ops->writepage;
pagevec_init(&pvec, 0);
@@ -672,12 +747,15 @@ __mpage_writepages(struct address_space
scanned = 1;
}
retry:
+ down_read(&inode->i_alloc_sem);
while (!done && (index <= end) &&
- (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
- PAGECACHE_TAG_DIRTY,
+ (nr_pages = pagevec_contig_lookup_tag(&pvec, mapping,
+ &index, PAGECACHE_TAG_DIRTY,
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
unsigned i;
+ mio.final_block_in_request = min(index, end) <<
+ (PAGE_CACHE_SHIFT - blkbits);
scanned = 1;
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
@@ -702,7 +780,7 @@ retry:
unlock_page(page);
continue;
}
-
+
if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);
@@ -723,9 +801,9 @@ retry:
&mapping->flags);
}
} else {
- bio = __mpage_writepage(bio, page, get_block,
- &last_block_in_bio, &ret, wbc,
- writepage_fn);
+ ret = __mpage_writepage(&mio, page, get_blocks,
+ wbc, writepage_fn);
+ bio = mio.bio;
}
if (ret || (--(wbc->nr_to_write) <= 0))
done = 1;
@@ -737,6 +815,9 @@ retry:
pagevec_release(&pvec);
cond_resched();
}
+
+ up_read(&inode->i_alloc_sem);
+
if (!scanned && !done) {
/*
* We hit the last page and there is more work to be done: wrap
@@ -755,17 +836,23 @@ retry:
EXPORT_SYMBOL(mpage_writepages);
EXPORT_SYMBOL(__mpage_writepages);
-int mpage_writepage(struct page *page, get_block_t get_block,
- struct writeback_control *wbc)
+int mpage_writepage(struct page *page, get_blocks_t get_blocks,
+ struct writeback_control *wbc, writepage_t writepage_fn)
{
int ret = 0;
- struct bio *bio;
- sector_t last_block_in_bio = 0;
-
- bio = __mpage_writepage(NULL, page, get_block,
- &last_block_in_bio, &ret, wbc, NULL);
- if (bio)
- mpage_bio_submit(WRITE, bio);
+ struct address_space *mapping = page->mapping;
+ struct inode *inode = mapping->host;
+ const unsigned blkbits = inode->i_blkbits;
+ struct mpageio mio = {
+ .final_block_in_request = (page->index + 1) << (PAGE_CACHE_SHIFT
+ - blkbits)
+ };
+
+ dump_stack();
+ ret = __mpage_writepage(&mio, page, get_blocks,
+ wbc, writepage_fn);
+ if (mio.bio)
+ mpage_bio_submit(WRITE, mio.bio);
return ret;
}
diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/buffer_head.h linux-2.6.12-rc3-getblocks/include/linux/buffer_head.h
--- linux-2.6.12-rc3/include/linux/buffer_head.h 2005-04-21 05:33:16.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/include/linux/buffer_head.h 2005-04-22 15:08:33.000000000 +0530
@@ -203,8 +203,8 @@ int file_fsync(struct file *, struct den
int nobh_prepare_write(struct page*, unsigned, unsigned, get_block_t*);
int nobh_commit_write(struct file *, struct page *, unsigned, unsigned);
int nobh_truncate_page(struct address_space *, loff_t);
-int nobh_writepage(struct page *page, get_block_t *get_block,
- struct writeback_control *wbc);
+int nobh_writepage(struct page *page, get_blocks_t *get_blocks,
+ struct writeback_control *wbc, writepage_t bh_writepage);
/*
diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/fs.h linux-2.6.12-rc3-getblocks/include/linux/fs.h
--- linux-2.6.12-rc3/include/linux/fs.h 2005-04-21 05:33:16.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/include/linux/fs.h 2005-04-22 15:08:33.000000000 +0530
@@ -304,6 +304,8 @@ struct address_space;
struct writeback_control;
struct kiocb;
+typedef int (writepage_t)(struct page *page, struct writeback_control *wbc);
+
struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *, struct page *);
diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/mpage.h linux-2.6.12-rc3-getblocks/include/linux/mpage.h
--- linux-2.6.12-rc3/include/linux/mpage.h 2005-04-21 05:33:16.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/include/linux/mpage.h 2005-04-22 15:08:33.000000000 +0530
@@ -11,17 +11,16 @@
*/
struct writeback_control;
-typedef int (writepage_t)(struct page *page, struct writeback_control *wbc);
int mpage_readpages(struct address_space *mapping, struct list_head *pages,
unsigned nr_pages, get_block_t get_block);
int mpage_readpage(struct page *page, get_block_t get_block);
int mpage_writepages(struct address_space *mapping,
- struct writeback_control *wbc, get_block_t get_block);
-int mpage_writepage(struct page *page, get_block_t *get_block,
- struct writeback_control *wbc);
+ struct writeback_control *wbc, get_blocks_t get_blocks);
+int mpage_writepage(struct page *page, get_blocks_t *get_blocks,
+ struct writeback_control *wbc, writepage_t writepage);
int __mpage_writepages(struct address_space *mapping,
- struct writeback_control *wbc, get_block_t get_block,
+ struct writeback_control *wbc, get_blocks_t get_blocks,
writepage_t writepage);
static inline int
diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/pagemap.h linux-2.6.12-rc3-getblocks/include/linux/pagemap.h
--- linux-2.6.12-rc3/include/linux/pagemap.h 2005-04-21 05:33:16.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/include/linux/pagemap.h 2005-04-22 15:08:33.000000000 +0530
@@ -73,7 +73,8 @@ extern struct page * find_or_create_page
unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
unsigned int nr_pages, struct page **pages);
unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
- int tag, unsigned int nr_pages, struct page **pages);
+ int tag, unsigned int nr_pages, struct page **pages,
+ int contig);
/*
* Returns locked page at given index in given cache, creating it if needed.
diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/pagevec.h linux-2.6.12-rc3-getblocks/include/linux/pagevec.h
--- linux-2.6.12-rc3/include/linux/pagevec.h 2005-04-21 05:33:16.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/include/linux/pagevec.h 2005-04-22 15:08:33.000000000 +0530
@@ -28,6 +28,9 @@ unsigned pagevec_lookup(struct pagevec *
unsigned pagevec_lookup_tag(struct pagevec *pvec,
struct address_space *mapping, pgoff_t *index, int tag,
unsigned nr_pages);
+unsigned pagevec_contig_lookup_tag(struct pagevec *pvec,
+ struct address_space *mapping, pgoff_t *index, int tag,
+ unsigned nr_pages);
static inline void pagevec_init(struct pagevec *pvec, int cold)
{
diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/radix-tree.h linux-2.6.12-rc3-getblocks/include/linux/radix-tree.h
--- linux-2.6.12-rc3/include/linux/radix-tree.h 2005-04-21 05:33:16.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/include/linux/radix-tree.h 2005-04-22 15:08:33.000000000 +0530
@@ -59,8 +59,18 @@ void *radix_tree_tag_clear(struct radix_
int radix_tree_tag_get(struct radix_tree_root *root,
unsigned long index, int tag);
unsigned int
-radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
- unsigned long first_index, unsigned int max_items, int tag);
+__radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
+ unsigned long first_index, unsigned int max_items, int tag,
+ int contig);
+
+static inline unsigned int radix_tree_gang_lookup_tag(struct radix_tree_root
+ *root, void **results, unsigned long first_index,
+ unsigned int max_items, int tag)
+{
+ return __radix_tree_gang_lookup_tag(root, results, first_index,
+ max_items, tag, 0);
+}
+
int radix_tree_tagged(struct radix_tree_root *root, int tag);
static inline void radix_tree_preload_end(void)
diff -urp -X dontdiff linux-2.6.12-rc3/lib/radix-tree.c linux-2.6.12-rc3-getblocks/lib/radix-tree.c
--- linux-2.6.12-rc3/lib/radix-tree.c 2005-04-21 05:33:16.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/lib/radix-tree.c 2005-04-22 16:34:29.000000000 +0530
@@ -557,12 +557,13 @@ EXPORT_SYMBOL(radix_tree_gang_lookup);
*/
static unsigned int
__lookup_tag(struct radix_tree_root *root, void **results, unsigned long index,
- unsigned int max_items, unsigned long *next_index, int tag)
+ unsigned int max_items, unsigned long *next_index, int tag, int contig)
{
unsigned int nr_found = 0;
unsigned int shift;
unsigned int height = root->height;
struct radix_tree_node *slot;
+ unsigned long cindex = (contig && (*next_index)) ? *next_index : -1;
shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
slot = root->rnode;
@@ -575,6 +576,11 @@ __lookup_tag(struct radix_tree_root *roo
BUG_ON(slot->slots[i] == NULL);
break;
}
+ if (contig && index >= cindex) {
+ /* break in contiguity */
+ index = 0;
+ goto out;
+ }
index &= ~((1UL << shift) - 1);
index += 1UL << shift;
if (index == 0)
@@ -593,6 +599,10 @@ __lookup_tag(struct radix_tree_root *roo
results[nr_found++] = slot->slots[j];
if (nr_found == max_items)
goto out;
+ } else if (contig && nr_found) {
+ /* break in contiguity */
+ index = 0;
+ goto out;
}
}
}
@@ -618,29 +628,32 @@ out:
* returns the number of items which were placed at *@results.
*/
unsigned int
-radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
- unsigned long first_index, unsigned int max_items, int tag)
+__radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
+ unsigned long first_index, unsigned int max_items, int tag,
+ int contig)
{
const unsigned long max_index = radix_tree_maxindex(root->height);
unsigned long cur_index = first_index;
+ unsigned long next_index = 0; /* Index of next contiguous search */
unsigned int ret = 0;
while (ret < max_items) {
unsigned int nr_found;
- unsigned long next_index; /* Index of next search */
if (cur_index > max_index)
break;
nr_found = __lookup_tag(root, results + ret, cur_index,
- max_items - ret, &next_index, tag);
+ max_items - ret, &next_index, tag, contig);
ret += nr_found;
if (next_index == 0)
break;
cur_index = next_index;
+ if (!nr_found)
+ next_index = 0;
}
return ret;
}
-EXPORT_SYMBOL(radix_tree_gang_lookup_tag);
+EXPORT_SYMBOL(__radix_tree_gang_lookup_tag);
/**
* radix_tree_delete - delete an item from a radix tree
diff -urp -X dontdiff linux-2.6.12-rc3/mm/filemap.c linux-2.6.12-rc3-getblocks/mm/filemap.c
--- linux-2.6.12-rc3/mm/filemap.c 2005-04-21 05:33:16.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/mm/filemap.c 2005-04-22 16:20:30.000000000 +0530
@@ -635,16 +635,19 @@ unsigned find_get_pages(struct address_s
/*
* Like find_get_pages, except we only return pages which are tagged with
* `tag'. We update *index to index the next page for the traversal.
+ * If 'contig' is 1, then we return only pages which are contiguous in the
+ * file.
*/
unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
- int tag, unsigned int nr_pages, struct page **pages)
+ int tag, unsigned int nr_pages, struct page **pages,
+ int contig)
{
unsigned int i;
unsigned int ret;
read_lock_irq(&mapping->tree_lock);
- ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
- (void **)pages, *index, nr_pages, tag);
+ ret = __radix_tree_gang_lookup_tag(&mapping->page_tree,
+ (void **)pages, *index, nr_pages, tag, contig);
for (i = 0; i < ret; i++)
page_cache_get(pages[i]);
if (ret)
diff -urp -X dontdiff linux-2.6.12-rc3/mm/swap.c linux-2.6.12-rc3-getblocks/mm/swap.c
--- linux-2.6.12-rc3/mm/swap.c 2005-04-21 05:33:16.000000000 +0530
+++ linux-2.6.12-rc3-getblocks/mm/swap.c 2005-04-22 15:08:33.000000000 +0530
@@ -384,7 +384,16 @@ unsigned pagevec_lookup_tag(struct pagev
pgoff_t *index, int tag, unsigned nr_pages)
{
pvec->nr = find_get_pages_tag(mapping, index, tag,
- nr_pages, pvec->pages);
+ nr_pages, pvec->pages, 0);
+ return pagevec_count(pvec);
+}
+
+unsigned int
+pagevec_contig_lookup_tag(struct pagevec *pvec, struct address_space *mapping,
+ pgoff_t *index, int tag, unsigned nr_pages)
+{
+ pvec->nr = find_get_pages_tag(mapping, index, tag,
+ nr_pages, pvec->pages, 1);
return pagevec_count(pvec);
}
On Thu, 2005-04-28 at 11:34 -0700, Mingming Cao wrote:
> On Thu, 2005-04-28 at 12:12 -0400, Lee Revell wrote:
> > On Thu, 2005-04-28 at 00:37 -0700, Mingming Cao wrote:
> > > On Wed, 2005-04-27 at 23:45 -0400, Lee Revell wrote:
> > > > On Fri, 2005-04-22 at 15:10 -0700, Mingming Cao wrote:
> > > > > Please review. I have tested on fsx on SMP box. Lee, if you got time,
> > > > > please try this patch.
> > > >
> > > > I have tested and this does fix the problem. I ran my tests and no ext3
> > > > code paths showed up on the latency tracer at all, it never went above
> > > > 33 usecs.
> > > >
> > > Thanks, Lee.
> > >
> > > The patch survived on many fsx test over 20 hours on a 2cpu machine.
> > > Tested the patch on the same machine with tiobench (1-64 threads), and
> > > untar a kernel tree test, no regression there.
> > > However I see about 5-7% throughput drop on dbench with 16 threads. It
> > > probably due to the cpu cost that we have discussed.
> >
> > Hmm, I guess someone needs to test it on a bigger system. AFAICT this
> > should improve SMP scalability quite a bit. Maybe that lock is rarely
> > contended.
>
> Probably. I will try it on a relatively full filesystem(where bitmap
> scan to find a free block takes time), and on a 4 cpu box with many
> threads allocating at the same time.
I run 64 threads dbench on a 4way box on a 90% full filesystem, patched
kernel is slightly faster. I think part of the reason is on a 90% full
filesystem, block bitmap scan will take a long time. Thus drop the lock
before bitmap scan will help, although there are some overhead of
drop/re-grab lock and extra window search/link/unlink before we found a
window with free block. On a freshly created filesystem where the
regression is seen, the bitmap scan takes shorter time, thus the
overhead of drop/re-grab lock and extra window search/link/unlink
probably is probably be outstanding.
I also run 1000 dd tests in parallel, each of them writing to different
file(2M size). Seems no gain or no regression.
Anyway, we should not hold the global reservation lock while scanning
the bitmap, that's a scalability issue and a latency issue. To drop the
lock during bitmap scan we need to do extra work, I don't think it's
avoidable. Andrew, could you add it to mm tree? I will re-send the
patch with the change log.
Thanks,
Mingming
Currently in ext3 block reservation code, the global filesystem
reservation tree lock (rsv_block) is hold during the process of
searching for a space to make a new reservation window, including while
scaning the block bitmap to verify if the avalible window has a free
block. Holding the lock during bitmap scan is unnecessary and could
possibly cause scalability issue and latency issues.
This patch trying to address this by dropping the lock before scan the
bitmap. Before that we need to reserve the open window in case someone
else is targetting at the same window. Question was should we reserve
the whole free reservable space or just the window size we need.
Reserve the whole free reservable space will possibly force other
threads which intended to do block allocation nearby move to another
block group(cause bad layout). In this patch, we just reserve the
desired size before drop the lock and scan the block bitmap. This patch
fixed a ext3 reservation latency issue seen on a cvs check out test.
Patch is tested with many fsx, tiobench, dbench and untar a kernel test.
Signed-Off-By: Mingming Cao <[email protected]>
---
linux-2.6.11-ming/fs/ext3/balloc.c | 135 +++++++++++++++++--------------------
linux-2.6.11-ming/fs/ext3/file.c | 4 +
2 files changed, 68 insertions(+), 71 deletions(-)
diff -puN fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency fs/ext3/balloc.c
--- linux-2.6.11/fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency 2005-04-22 10:49:04.000000000 -0700
+++ linux-2.6.11-ming/fs/ext3/balloc.c 2005-04-27 22:34:19.491331405 -0700
@@ -749,24 +749,24 @@ fail_access:
* to find a free region that is of my size and has not
* been reserved.
*
- * on succeed, it returns the reservation window to be appended to.
- * failed, return NULL.
*/
-static struct ext3_reserve_window_node *find_next_reservable_window(
+static int find_next_reservable_window(
struct ext3_reserve_window_node *search_head,
- unsigned long size, int *start_block,
+ struct ext3_reserve_window_node *my_rsv,
+ struct super_block * sb, int start_block,
int last_block)
{
struct rb_node *next;
struct ext3_reserve_window_node *rsv, *prev;
int cur;
+ int size = my_rsv->rsv_goal_size;
/* TODO: make the start of the reservation window byte-aligned */
/* cur = *start_block & ~7;*/
- cur = *start_block;
+ cur = start_block;
rsv = search_head;
if (!rsv)
- return NULL;
+ return -1;
while (1) {
if (cur <= rsv->rsv_end)
@@ -782,7 +782,7 @@ static struct ext3_reserve_window_node *
* space with expected-size (or more)...
*/
if (cur > last_block)
- return NULL; /* fail */
+ return -1; /* fail */
prev = rsv;
next = rb_next(&rsv->rsv_node);
@@ -813,8 +813,26 @@ static struct ext3_reserve_window_node *
* return the reservation window that we could append to.
* succeed.
*/
- *start_block = cur;
- return prev;
+
+ if ((prev != my_rsv) && (!rsv_is_empty(&my_rsv->rsv_window)))
+ rsv_window_remove(sb, my_rsv);
+
+ /* let's book the whole avaliable window for now
+ * we will check the
+ * disk bitmap later and then, if there are free block
+ * then we adjust the window size if the it's
+ * larger than requested.
+ * Otherwise, we will remove this node from the tree next time
+ * call find_next_reservable_window.
+ */
+ my_rsv->rsv_start = cur;
+ my_rsv->rsv_end = cur + size - 1;
+ my_rsv->rsv_alloc_hit = 0;
+
+ if (prev != my_rsv)
+ ext3_rsv_window_add(sb, my_rsv);
+
+ return 0;
}
/**
@@ -852,6 +870,7 @@ static struct ext3_reserve_window_node *
* @sb: the super block
* @group: the group we are trying to allocate in
* @bitmap_bh: the block group block bitmap
+ *
*/
static int alloc_new_reservation(struct ext3_reserve_window_node *my_rsv,
int goal, struct super_block *sb,
@@ -860,10 +879,10 @@ static int alloc_new_reservation(struct
struct ext3_reserve_window_node *search_head;
int group_first_block, group_end_block, start_block;
int first_free_block;
- int reservable_space_start;
- struct ext3_reserve_window_node *prev_rsv;
struct rb_root *fs_rsv_root = &EXT3_SB(sb)->s_rsv_window_root;
unsigned long size;
+ int ret;
+ spinlock_t *rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock;
group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +
group * EXT3_BLOCKS_PER_GROUP(sb);
@@ -875,6 +894,7 @@ static int alloc_new_reservation(struct
start_block = goal + group_first_block;
size = my_rsv->rsv_goal_size;
+
if (!rsv_is_empty(&my_rsv->rsv_window)) {
/*
* if the old reservation is cross group boundary
@@ -908,6 +928,8 @@ static int alloc_new_reservation(struct
my_rsv->rsv_goal_size= size;
}
}
+
+ spin_lock(rsv_lock);
/*
* shift the search start to the window near the goal block
*/
@@ -921,11 +943,16 @@ static int alloc_new_reservation(struct
* need to check the bitmap after we found a reservable window.
*/
retry:
- prev_rsv = find_next_reservable_window(search_head, size,
- &start_block, group_end_block);
- if (prev_rsv == NULL)
- goto failed;
- reservable_space_start = start_block;
+ ret = find_next_reservable_window(search_head, my_rsv, sb,
+ start_block, group_end_block);
+
+ if (ret == -1) {
+ if (!rsv_is_empty(&my_rsv->rsv_window))
+ rsv_window_remove(sb, my_rsv);
+ spin_unlock(rsv_lock);
+ return -1;
+ }
+
/*
* On success, find_next_reservable_window() returns the
* reservation window where there is a reservable space after it.
@@ -937,8 +964,9 @@ retry:
* block. Search start from the start block of the reservable space
* we just found.
*/
+ spin_unlock(rsv_lock);
first_free_block = bitmap_search_next_usable_block(
- reservable_space_start - group_first_block,
+ my_rsv->rsv_start - group_first_block,
bitmap_bh, group_end_block - group_first_block + 1);
if (first_free_block < 0) {
@@ -946,54 +974,30 @@ retry:
* no free block left on the bitmap, no point
* to reserve the space. return failed.
*/
- goto failed;
+ spin_lock(rsv_lock);
+ if (!rsv_is_empty(&my_rsv->rsv_window))
+ rsv_window_remove(sb, my_rsv);
+ spin_unlock(rsv_lock);
+ return -1; /* failed */
}
+
start_block = first_free_block + group_first_block;
/*
* check if the first free block is within the
- * free space we just found
+ * free space we just reserved
*/
- if ((start_block >= reservable_space_start) &&
- (start_block < reservable_space_start + size))
- goto found_rsv_window;
+ if ((start_block >= my_rsv->rsv_start) &&
+ (start_block < my_rsv->rsv_end))
+ return 0; /* succeed */
/*
* if the first free bit we found is out of the reservable space
- * this means there is no free block on the reservable space
- * we should continue search for next reservable space,
+ * continue search for next reservable space,
* start from where the free block is,
* we also shift the list head to where we stopped last time
*/
- search_head = prev_rsv;
+ search_head = my_rsv;
+ spin_lock(rsv_lock);
goto retry;
-
-found_rsv_window:
- /*
- * great! the reservable space contains some free blocks.
- * if the search returns that we should add the new
- * window just next to where the old window, we don't
- * need to remove the old window first then add it to the
- * same place, just update the new start and new end.
- */
- if (my_rsv != prev_rsv) {
- if (!rsv_is_empty(&my_rsv->rsv_window))
- rsv_window_remove(sb, my_rsv);
- }
- my_rsv->rsv_start = reservable_space_start;
- my_rsv->rsv_end = my_rsv->rsv_start + size - 1;
- my_rsv->rsv_alloc_hit = 0;
- if (my_rsv != prev_rsv) {
- ext3_rsv_window_add(sb, my_rsv);
- }
- return 0; /* succeed */
-failed:
- /*
- * failed to find a new reservation window in the current
- * group, remove the current(stale) reservation window
- * if there is any
- */
- if (!rsv_is_empty(&my_rsv->rsv_window))
- rsv_window_remove(sb, my_rsv);
- return -1; /* failed */
}
/*
@@ -1023,7 +1027,6 @@ ext3_try_to_allocate_with_rsv(struct sup
int goal, struct ext3_reserve_window_node * my_rsv,
int *errp)
{
- spinlock_t *rsv_lock;
unsigned long group_first_block;
int ret = 0;
int fatal;
@@ -1052,7 +1055,6 @@ ext3_try_to_allocate_with_rsv(struct sup
ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL);
goto out;
}
- rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock;
/*
* goal is a group relative block number (if there is a goal)
* 0 < goal < EXT3_BLOCKS_PER_GROUP(sb)
@@ -1078,30 +1080,21 @@ ext3_try_to_allocate_with_rsv(struct sup
* then we could go to allocate from the reservation window directly.
*/
while (1) {
- struct ext3_reserve_window rsv_copy;
-
- rsv_copy._rsv_start = my_rsv->rsv_start;
- rsv_copy._rsv_end = my_rsv->rsv_end;
-
- if (rsv_is_empty(&rsv_copy) || (ret < 0) ||
- !goal_in_my_reservation(&rsv_copy, goal, group, sb)) {
- spin_lock(rsv_lock);
+ if (rsv_is_empty(&my_rsv->rsv_window) || (ret < 0) ||
+ !goal_in_my_reservation(&my_rsv->rsv_window, goal, group, sb)) {
ret = alloc_new_reservation(my_rsv, goal, sb,
group, bitmap_bh);
- rsv_copy._rsv_start = my_rsv->rsv_start;
- rsv_copy._rsv_end = my_rsv->rsv_end;
- spin_unlock(rsv_lock);
if (ret < 0)
break; /* failed */
- if (!goal_in_my_reservation(&rsv_copy, goal, group, sb))
+ if (!goal_in_my_reservation(&my_rsv->rsv_window, goal, group, sb))
goal = -1;
}
- if ((rsv_copy._rsv_start >= group_first_block + EXT3_BLOCKS_PER_GROUP(sb))
- || (rsv_copy._rsv_end < group_first_block))
+ if ((my_rsv->rsv_start >= group_first_block + EXT3_BLOCKS_PER_GROUP(sb))
+ || (my_rsv->rsv_end < group_first_block))
BUG();
ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal,
- &rsv_copy);
+ &my_rsv->rsv_window);
if (ret >= 0) {
my_rsv->rsv_alloc_hit++;
break; /* succeed */
diff -puN fs/ext3/file.c~ext3-reduce-reservation-lock-latency fs/ext3/file.c
--- linux-2.6.11/fs/ext3/file.c~ext3-reduce-reservation-lock-latency 2005-04-22 10:49:04.000000000 -0700
+++ linux-2.6.11-ming/fs/ext3/file.c 2005-04-26 11:09:56.000000000 -0700
@@ -36,7 +36,11 @@ static int ext3_release_file (struct ino
/* if we are the last writer on the inode, drop the block reservation */
if ((filp->f_mode & FMODE_WRITE) &&
(atomic_read(&inode->i_writecount) == 1))
+ {
+ down(&EXT3_I(inode)->truncate_sem);
ext3_discard_reservation(inode);
+ up(&EXT3_I(inode)->truncate_sem);
+ }
if (is_dx(inode) && filp->private_data)
ext3_htree_free_dir_info(filp->private_data);
_
On Fri, 2005-04-29 at 19:22 +0530, Suparna Bhattacharya wrote:
> On Thu, Apr 28, 2005 at 12:14:24PM -0700, Mingming Cao wrote:
> > Currently ext3_get_block()/ext3_new_block() only allocate one block at a
> > time. To allocate multiple blocks, the caller, for example, ext3 direct
> > IO routine, has to invoke ext3_get_block() many times. This is quite
> > inefficient for sequential IO workload.
> >
> > The benefit of a real get_blocks() include
> > 1) increase the possibility to get contiguous blocks, reduce possibility
> > of fragmentation due to interleaved allocations from other threads.
> > (should good for non reservation case)
> > 2) Reduces CPU cycles spent in repeated get_block() calls
> > 3) Batch meta data update and journaling in one short
> > 4) Could possibly speed up future get_blocks() look up by cache the last
> > mapped blocks in inode.
> >
>
> And here is the patch to make mpage_writepages use get_blocks() for
> multiple block lookup/allocation. It performs a radix-tree contiguous
> pages lookup, and issues a get_blocks for the range together. It maintains
> an mpageio structure to track intermediate mapping state, somewhat
> like the DIO code.
>
> It does need some more testing, especially block_size < PAGE_SIZE.
> The JFS workaround can be dropped if the JFS get_blocks fix from
> Dave Kleikamp is integrated.
>
> Review feedback would be welcome.
>
> Mingming,
> Let me know if you have a chance to try this out with your patch.
Sure, Suparna, I will try your patch soon!
In my patch, I have modified ext3 directo io code to make use of
ext3_get_blocks(). Tested with a simple file write with O_DIRECT, seems
work fine! Allocating blocks for a 120k file only invokes
ext3_get_blocks() for 4 times(perfect is 1, but before is 30 times call
to ext3_get_block). Among the 4 calls to ext3_get_blocks, 2 because of
reach the meta data block boundary(direct ->indirect), another 2 because
of reach the end of the reservation window. For the later 2, we could
avoid that by extend the reservation window before calling
ext3_new_blocks() if the window size is less than the number of blocks
to allocate.
But if it try to allocating blocks in the hole (with direct IO), blocks
are allocated one by one. I am looking at it right now.
Thanks,
Mingming
Suparna Bhattacharya wrote:
> On Thu, Apr 28, 2005 at 12:14:24PM -0700, Mingming Cao wrote:
>
>>Currently ext3_get_block()/ext3_new_block() only allocate one block at a
>>time. To allocate multiple blocks, the caller, for example, ext3 direct
>>IO routine, has to invoke ext3_get_block() many times. This is quite
>>inefficient for sequential IO workload.
>>
>>The benefit of a real get_blocks() include
>>1) increase the possibility to get contiguous blocks, reduce possibility
>>of fragmentation due to interleaved allocations from other threads.
>>(should good for non reservation case)
>>2) Reduces CPU cycles spent in repeated get_block() calls
>>3) Batch meta data update and journaling in one short
>>4) Could possibly speed up future get_blocks() look up by cache the last
>>mapped blocks in inode.
>>
>
>
> And here is the patch to make mpage_writepages use get_blocks() for
> multiple block lookup/allocation. It performs a radix-tree contiguous
> pages lookup, and issues a get_blocks for the range together. It maintains
> an mpageio structure to track intermediate mapping state, somewhat
> like the DIO code.
>
> It does need some more testing, especially block_size < PAGE_SIZE.
> The JFS workaround can be dropped if the JFS get_blocks fix from
> Dave Kleikamp is integrated.
>
> Review feedback would be welcome.
>
> Mingming,
> Let me know if you have a chance to try this out with your patch.
>
> Regards
> Suparna
>
Suparna,
I touch tested your patch earlier and seems to work fine. Lets integrate
Mingming's getblocks() patches with this and see if we get any benifit
from the whole effort.
BTW, is it the plan to remove repeated calls to getblocks() and work
with whatever getblocks() returned ? Or do you find the its better to
repeatedly do getblocks() in a loop till you satisfy all the pages
in the list (as long as blocks are contiguous) ?
Thanks,
Badari
On Fri, 2005-04-29 at 10:10 -0700, Mingming Cao wrote:
> But if it try to allocating blocks in the hole (with direct IO), blocks
> are allocated one by one. I am looking at it right now.
>
Hi Andrew, Badari,
If we do direct write(block allocation) to a hole, I found that the
"create" flag passed to ext3_direct_io_get_blocks() is 0 if we are
trying to _write_ to a file hole. Is this expected?
This is what happened on mainline 2.6.12-rc2(and with my patch). To simplify, here is the problem description on mainline:
If we do 30 blocks write to a new file at offset 800k, fine, create flag is all 1.
Then if seek back to offset 400k, write another 30 blocks, create flag is 0
-bash-2.05b# mount -t ext3 /dev/ubdc /mnt/ext3
-bash-2.05b# cd /mnt/ext3
-bash-2.05b# touch a
-bash-2.05b# /root/filetst -o 819200 -b 122880 -c 1 -w -d -f a
Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 30, iblock = 200, create = 1
Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 29, iblock = 201, create = 1
Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 28, iblock = 202, create = 1
Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 27, iblock = 203, create = 1
Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 26, iblock = 204, create = 1
...................
Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 5, iblock = 225, create = 1
Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 4, iblock = 226, create = 1
Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 3, iblock = 227, create = 1
Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 2, iblock = 228, create = 1
Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 1, iblock = 229, create = 1
-bash-2.05b# /root/filetst -o 409600 -b 122880 -c 1 -w -d -f a
Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 30, iblock = 100, create = 0
Because of create flag is 0, ext3_get_block will not do block allocation
and return immediately after look up failed. Then ext3_get_block_handle
() is called from other path(I am not sure where) other than
ext3_direct_io_get_blocks to allocate the desired 30 blocks.(thus, when
apply ext3_get_blocks patch, ext3_get_blocks is not called)
Could you clarify?
Thanks,
Mingming
Mingming Cao <[email protected]> wrote:
>
> If we do direct write(block allocation) to a hole, I found that the
> "create" flag passed to ext3_direct_io_get_blocks() is 0 if we are
> trying to _write_ to a file hole. Is this expected?
Nope. The code in get_more_blocks() is pretty explicit.
> But if it try to allocating blocks in the hole (with direct IO), blocks
> are allocated one by one. I am looking at it right now.
>
Please see the comment over get_more_blocks().
On Fri, 2005-04-29 at 13:57 -0700, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > If we do direct write(block allocation) to a hole, I found that the
> > "create" flag passed to ext3_direct_io_get_blocks() is 0 if we are
> > trying to _write_ to a file hole. Is this expected?
>
> Nope. The code in get_more_blocks() is pretty explicit.
>
> > But if it try to allocating blocks in the hole (with direct IO), blocks
> > are allocated one by one. I am looking at it right now.
> >
>
> Please see the comment over get_more_blocks().
>
I went to look at get_more_blocks, the create flag is cleared if the
file offset is less than the i_size. (see code below).
static int get_more_blocks(struct dio *dio)
{
..........
.........
create = dio->rw == WRITE;
if (dio->lock_type == DIO_LOCKING) {
if (dio->block_in_file < (i_size_read(dio->inode) >>
dio->blkbits))
create = 0;
} else if (dio->lock_type == DIO_NO_LOCKING) {
create = 0;
ret = (*dio->get_blocks)(dio->inode, fs_startblk, fs_count,
map_bh, create);
}
When it is trying to direct writing to a hole, the create flag is
cleared so that the underlying filesystem get_blocks() function will
only do a block look up(look up will be failed and no block allocation).
do_direct_IO -> get_more_blocks -> ext3_direct_io_get_blocks. In
get_more_blocks(), do_direct_IO() handles the write to hole situation
by simply return an error of ENOTBLK. In direct_io_worker() it detects
this error then just simply return the size of written byte to 0. The
real write is handled by the buffered I/O. In
__generic_file_aio_write_nolock(), generic_file_buffered_write() will be
called if written by generic_file_direct_write() is 0.
Could you confirm my observation above? Hope I understand this right:
right now direct io write to a hole is actually handled by buffered IO.
If so, there must be some valid reason that I could not see now.
Thanks,
Mingming
Mingming Cao <[email protected]> wrote:
>
> right now direct io write to a hole is actually handled by buffered IO.
> If so, there must be some valid reason that I could not see now.
Oh yes, that's right. It's all to do with the fixes which went in a year
ago to prevent concurrent readers from peeking at uninitialised disk
blocks.
On Fri, 2005-04-29 at 11:45 -0700, Badari Pulavarty wrote:
> I touch tested your patch earlier and seems to work fine. Lets integrate
> Mingming's getblocks() patches with this and see if we get any benifit
> from the whole effort.
>
I tried Suparna's mpage_writepages_getblocks patch with my
ext3_get_blocks patch, seems to work fine, except that still only one
block is allocated at a time. I got a little confused....
I did not see any delayed allocation code in your patch, I assume you
have to update ext3_prepare_write to not call ext3_get_block, so that
block allocation will be defered at ext3_writepages time. So without the
delayed allocation part, the get_blocks in mpage_writepages is doing
multiple blocks look up only, right?
Mingming
On Fri, 2005-04-29 at 19:22 +0530, Suparna Bhattacharya wrote:
> On Thu, Apr 28, 2005 at 12:14:24PM -0700, Mingming Cao wrote:
> > Currently ext3_get_block()/ext3_new_block() only allocate one block at a
> > time. To allocate multiple blocks, the caller, for example, ext3 direct
> > IO routine, has to invoke ext3_get_block() many times. This is quite
> > inefficient for sequential IO workload.
> >
> > The benefit of a real get_blocks() include
> > 1) increase the possibility to get contiguous blocks, reduce possibility
> > of fragmentation due to interleaved allocations from other threads.
> > (should good for non reservation case)
> > 2) Reduces CPU cycles spent in repeated get_block() calls
> > 3) Batch meta data update and journaling in one short
> > 4) Could possibly speed up future get_blocks() look up by cache the last
> > mapped blocks in inode.
> >
>
> And here is the patch to make mpage_writepages use get_blocks() for
> multiple block lookup/allocation. It performs a radix-tree contiguous
> pages lookup, and issues a get_blocks for the range together. It maintains
> an mpageio structure to track intermediate mapping state, somewhat
> like the DIO code.
>
> It does need some more testing, especially block_size < PAGE_SIZE.
> The JFS workaround can be dropped if the JFS get_blocks fix from
> Dave Kleikamp is integrated.
>
> Review feedback would be welcome.
>
> Mingming,
> Let me know if you have a chance to try this out with your patch.
>
> Regards
> Suparna
>
> --
> Suparna Bhattacharya ([email protected])
> Linux Technology Center
> IBM Software Lab, India
>
>
> diff -urp -X dontdiff linux-2.6.12-rc3/fs/buffer.c linux-2.6.12-rc3-getblocks/fs/buffer.c
> --- linux-2.6.12-rc3/fs/buffer.c 2005-04-21 05:33:15.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/fs/buffer.c 2005-04-22 15:08:33.000000000 +0530
> @@ -2514,53 +2514,10 @@ EXPORT_SYMBOL(nobh_commit_write);
> * that it tries to operate without attaching bufferheads to
> * the page.
> */
> -int nobh_writepage(struct page *page, get_block_t *get_block,
> - struct writeback_control *wbc)
> +int nobh_writepage(struct page *page, get_blocks_t *get_blocks,
> + struct writeback_control *wbc, writepage_t bh_writepage_fn)
> {
> - struct inode * const inode = page->mapping->host;
> - loff_t i_size = i_size_read(inode);
> - const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
> - unsigned offset;
> - void *kaddr;
> - int ret;
> -
> - /* Is the page fully inside i_size? */
> - if (page->index < end_index)
> - goto out;
> -
> - /* Is the page fully outside i_size? (truncate in progress) */
> - offset = i_size & (PAGE_CACHE_SIZE-1);
> - if (page->index >= end_index+1 || !offset) {
> - /*
> - * The page may have dirty, unmapped buffers. For example,
> - * they may have been added in ext3_writepage(). Make them
> - * freeable here, so the page does not leak.
> - */
> -#if 0
> - /* Not really sure about this - do we need this ? */
> - if (page->mapping->a_ops->invalidatepage)
> - page->mapping->a_ops->invalidatepage(page, offset);
> -#endif
> - unlock_page(page);
> - return 0; /* don't care */
> - }
> -
> - /*
> - * The page straddles i_size. It must be zeroed out on each and every
> - * writepage invocation because it may be mmapped. "A file is mapped
> - * in multiples of the page size. For a file that is not a multiple of
> - * the page size, the remaining memory is zeroed when mapped, and
> - * writes to that region are not written out to the file."
> - */
> - kaddr = kmap_atomic(page, KM_USER0);
> - memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
> - flush_dcache_page(page);
> - kunmap_atomic(kaddr, KM_USER0);
> -out:
> - ret = mpage_writepage(page, get_block, wbc);
> - if (ret == -EAGAIN)
> - ret = __block_write_full_page(inode, page, get_block, wbc);
> - return ret;
> + return mpage_writepage(page, get_blocks, wbc, bh_writepage_fn);
> }
> EXPORT_SYMBOL(nobh_writepage);
>
> diff -urp -X dontdiff linux-2.6.12-rc3/fs/ext2/inode.c linux-2.6.12-rc3-getblocks/fs/ext2/inode.c
> --- linux-2.6.12-rc3/fs/ext2/inode.c 2005-04-21 05:33:15.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/fs/ext2/inode.c 2005-04-22 16:30:42.000000000 +0530
> @@ -639,12 +639,6 @@ ext2_nobh_prepare_write(struct file *fil
> return nobh_prepare_write(page,from,to,ext2_get_block);
> }
>
> -static int ext2_nobh_writepage(struct page *page,
> - struct writeback_control *wbc)
> -{
> - return nobh_writepage(page, ext2_get_block, wbc);
> -}
> -
> static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
> {
> return generic_block_bmap(mapping,block,ext2_get_block);
> @@ -662,6 +656,12 @@ ext2_get_blocks(struct inode *inode, sec
> return ret;
> }
>
> +static int ext2_nobh_writepage(struct page *page,
> + struct writeback_control *wbc)
> +{
> + return nobh_writepage(page, ext2_get_blocks, wbc, ext2_writepage);
> +}
> +
> static ssize_t
> ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
> loff_t offset, unsigned long nr_segs)
> @@ -676,7 +676,8 @@ ext2_direct_IO(int rw, struct kiocb *ioc
> static int
> ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
> {
> - return mpage_writepages(mapping, wbc, ext2_get_block);
> + return __mpage_writepages(mapping, wbc, ext2_get_blocks,
> + ext2_writepage);
> }
>
> struct address_space_operations ext2_aops = {
> diff -urp -X dontdiff linux-2.6.12-rc3/fs/ext3/inode.c linux-2.6.12-rc3-getblocks/fs/ext3/inode.c
> --- linux-2.6.12-rc3/fs/ext3/inode.c 2005-04-21 05:33:15.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/fs/ext3/inode.c 2005-04-22 15:08:33.000000000 +0530
> @@ -866,10 +866,10 @@ get_block:
> return ret;
> }
>
> -static int ext3_writepages_get_block(struct inode *inode, sector_t iblock,
> - struct buffer_head *bh, int create)
> +static int ext3_writepages_get_blocks(struct inode *inode, sector_t iblock,
> + unsigned long max_blocks, struct buffer_head *bh, int create)
> {
> - return ext3_direct_io_get_blocks(inode, iblock, 1, bh, create);
> + return ext3_direct_io_get_blocks(inode, iblock, max_blocks, bh, create);
> }
>
> /*
> @@ -1369,11 +1369,11 @@ ext3_writeback_writepages(struct address
> return ret;
> }
>
> - ret = __mpage_writepages(mapping, wbc, ext3_writepages_get_block,
> + ret = __mpage_writepages(mapping, wbc, ext3_writepages_get_blocks,
> ext3_writeback_writepage_helper);
>
> /*
> - * Need to reaquire the handle since ext3_writepages_get_block()
> + * Need to reaquire the handle since ext3_writepages_get_blocks()
> * can restart the handle
> */
> handle = journal_current_handle();
> @@ -1402,7 +1402,8 @@ static int ext3_writeback_writepage(stru
> }
>
> if (test_opt(inode->i_sb, NOBH))
> - ret = nobh_writepage(page, ext3_get_block, wbc);
> + ret = nobh_writepage(page, ext3_writepages_get_blocks, wbc,
> + ext3_writeback_writepage_helper);
> else
> ret = block_write_full_page(page, ext3_get_block, wbc);
>
> diff -urp -X dontdiff linux-2.6.12-rc3/fs/ext3/super.c linux-2.6.12-rc3-getblocks/fs/ext3/super.c
> --- linux-2.6.12-rc3/fs/ext3/super.c 2005-04-21 05:33:15.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/fs/ext3/super.c 2005-04-22 15:08:33.000000000 +0530
> @@ -1321,6 +1321,7 @@ static int ext3_fill_super (struct super
> sbi->s_resgid = le16_to_cpu(es->s_def_resgid);
>
> set_opt(sbi->s_mount_opt, RESERVATION);
> + set_opt(sbi->s_mount_opt, NOBH); /* temp: set nobh default */
>
> if (!parse_options ((char *) data, sb, &journal_inum, NULL, 0))
> goto failed_mount;
> @@ -1567,6 +1568,7 @@ static int ext3_fill_super (struct super
> printk(KERN_ERR "EXT3-fs: Journal does not support "
> "requested data journaling mode\n");
> goto failed_mount3;
> + set_opt(sbi->s_mount_opt, NOBH); /* temp: set nobh default */
> }
> default:
> break;
> @@ -1584,6 +1586,7 @@ static int ext3_fill_super (struct super
> "its supported only with writeback mode\n");
> clear_opt(sbi->s_mount_opt, NOBH);
> }
> + printk("NOBH option set\n");
> }
> /*
> * The journal_load will have done any necessary log recovery,
> diff -urp -X dontdiff linux-2.6.12-rc3/fs/hfs/inode.c linux-2.6.12-rc3-getblocks/fs/hfs/inode.c
> --- linux-2.6.12-rc3/fs/hfs/inode.c 2005-04-21 05:33:15.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/fs/hfs/inode.c 2005-04-22 15:08:33.000000000 +0530
> @@ -124,7 +124,7 @@ static ssize_t hfs_direct_IO(int rw, str
> static int hfs_writepages(struct address_space *mapping,
> struct writeback_control *wbc)
> {
> - return mpage_writepages(mapping, wbc, hfs_get_block);
> + return mpage_writepages(mapping, wbc, hfs_get_blocks);
> }
>
> struct address_space_operations hfs_btree_aops = {
> diff -urp -X dontdiff linux-2.6.12-rc3/fs/hfsplus/inode.c linux-2.6.12-rc3-getblocks/fs/hfsplus/inode.c
> --- linux-2.6.12-rc3/fs/hfsplus/inode.c 2005-04-21 05:33:15.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/fs/hfsplus/inode.c 2005-04-22 15:08:33.000000000 +0530
> @@ -121,7 +121,7 @@ static ssize_t hfsplus_direct_IO(int rw,
> static int hfsplus_writepages(struct address_space *mapping,
> struct writeback_control *wbc)
> {
> - return mpage_writepages(mapping, wbc, hfsplus_get_block);
> + return mpage_writepages(mapping, wbc, hfsplus_get_blocks);
> }
>
> struct address_space_operations hfsplus_btree_aops = {
> diff -urp -X dontdiff linux-2.6.12-rc3/fs/jfs/inode.c linux-2.6.12-rc3-getblocks/fs/jfs/inode.c
> --- linux-2.6.12-rc3/fs/jfs/inode.c 2005-04-21 05:33:15.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/fs/jfs/inode.c 2005-04-22 16:27:19.000000000 +0530
> @@ -267,21 +267,41 @@ jfs_get_blocks(struct inode *ip, sector_
> return rc;
> }
>
> +static int
> +jfs_mpage_get_blocks(struct inode *ip, sector_t lblock, unsigned long
> + max_blocks, struct buffer_head *bh_result, int create)
> +{
> + /*
> + * fixme: temporary workaround: return one block at a time until
> + * we figure out why we see exposures with truncate on
> + * allocating multiple blocks in one shot.
> + */
> + return jfs_get_blocks(ip, lblock, 1, bh_result, create);
> +}
> +
> static int jfs_get_block(struct inode *ip, sector_t lblock,
> struct buffer_head *bh_result, int create)
> {
> return jfs_get_blocks(ip, lblock, 1, bh_result, create);
> }
>
> +static int jfs_bh_writepage(struct page *page,
> + struct writeback_control *wbc)
> +{
> + return block_write_full_page(page, jfs_get_block, wbc);
> +}
> +
> +
> static int jfs_writepage(struct page *page, struct writeback_control *wbc)
> {
> - return nobh_writepage(page, jfs_get_block, wbc);
> + return nobh_writepage(page, jfs_mpage_get_blocks, wbc, jfs_bh_writepage);
> }
>
> static int jfs_writepages(struct address_space *mapping,
> struct writeback_control *wbc)
> {
> - return mpage_writepages(mapping, wbc, jfs_get_block);
> + return __mpage_writepages(mapping, wbc, jfs_mpage_get_blocks,
> + jfs_bh_writepage);
> }
>
> static int jfs_readpage(struct file *file, struct page *page)
> diff -urp -X dontdiff linux-2.6.12-rc3/fs/mpage.c linux-2.6.12-rc3-getblocks/fs/mpage.c
> --- linux-2.6.12-rc3/fs/mpage.c 2005-04-21 05:33:15.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/fs/mpage.c 2005-04-22 16:19:14.000000000 +0530
> @@ -370,6 +370,67 @@ int mpage_readpage(struct page *page, ge
> }
> EXPORT_SYMBOL(mpage_readpage);
>
> +struct mpageio {
> + struct bio *bio;
> + struct buffer_head map_bh;
> + unsigned long block_in_file;
> + unsigned long final_block_in_request;
> + sector_t long block_in_bio;
> + int boundary;
> + sector_t boundary_block;
> + struct block_device *boundary_bdev;
> +};
> +
> +/*
> + * Maps as many contiguous disk blocks as it can within the range of
> + * the request, and returns the total number of contiguous mapped
> + * blocks in the mpageio.
> + */
> +static unsigned long mpage_get_more_blocks(struct mpageio *mio,
> + struct inode *inode, get_blocks_t get_blocks)
> +{
> + struct buffer_head map_bh = {.b_state = 0};
> + unsigned long mio_nblocks = mio->map_bh.b_size >> inode->i_blkbits;
> + unsigned long first_unmapped = mio->block_in_file + mio_nblocks;
> + unsigned long next_contig_block = mio->map_bh.b_blocknr + mio_nblocks;
> +
> + while ((first_unmapped < mio->final_block_in_request) &&
> + (mio->map_bh.b_size < PAGE_SIZE)) {
> +
> + if (get_blocks(inode, first_unmapped,
> + mio->final_block_in_request - first_unmapped,
> + &map_bh, 1))
> + break;
> + if (mio_nblocks && ((map_bh.b_blocknr != next_contig_block) ||
> + map_bh.b_bdev != mio->map_bh.b_bdev))
> + break;
> +
> + if (buffer_new(&map_bh)) {
> + int i = 0;
> + for (; i < map_bh.b_size >> inode->i_blkbits; i++)
> + unmap_underlying_metadata(map_bh.b_bdev,
> + map_bh.b_blocknr + i);
> + }
> +
> + if (buffer_boundary(&map_bh)) {
> + mio->boundary = 1;
> + mio->boundary_block = map_bh.b_blocknr;
> + mio->boundary_bdev = map_bh.b_bdev;
> + }
> + if (mio_nblocks == 0) {
> + mio->map_bh.b_bdev = map_bh.b_bdev;
> + mio->map_bh.b_blocknr = map_bh.b_blocknr;
> + }
> +
> + mio_nblocks += map_bh.b_size >> inode->i_blkbits;
> + first_unmapped = mio->block_in_file + mio_nblocks;
> + next_contig_block = mio->map_bh.b_blocknr + mio_nblocks;
> + mio->map_bh.b_size += map_bh.b_size;
> + }
> +
> + return mio_nblocks;
> +}
> +
> /*
> * Writing is not so simple.
> *
> @@ -386,9 +447,9 @@ EXPORT_SYMBOL(mpage_readpage);
> * written, so it can intelligently allocate a suitably-sized BIO. For now,
> * just allocate full-size (16-page) BIOs.
> */
> -static struct bio *
> -__mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block,
> - sector_t *last_block_in_bio, int *ret, struct writeback_control *wbc,
> +static int
> +__mpage_writepage(struct mpageio *mio, struct page *page,
> + get_blocks_t get_blocks, struct writeback_control *wbc,
> writepage_t writepage_fn)
> {
> struct address_space *mapping = page->mapping;
> @@ -396,9 +457,8 @@ __mpage_writepage(struct bio *bio, struc
> const unsigned blkbits = inode->i_blkbits;
> unsigned long end_index;
> const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
> - sector_t last_block;
> + sector_t last_block, blocks_to_skip;
> sector_t block_in_file;
> - sector_t blocks[MAX_BUF_PER_PAGE];
> unsigned page_block;
> unsigned first_unmapped = blocks_per_page;
> struct block_device *bdev = NULL;
> @@ -406,8 +466,10 @@ __mpage_writepage(struct bio *bio, struc
> sector_t boundary_block = 0;
> struct block_device *boundary_bdev = NULL;
> int length;
> - struct buffer_head map_bh;
> loff_t i_size = i_size_read(inode);
> + struct buffer_head *map_bh = &mio->map_bh;
> + struct bio *bio = mio->bio;
> + int ret = 0;
>
> if (page_has_buffers(page)) {
> struct buffer_head *head = page_buffers(page);
> @@ -435,10 +497,13 @@ __mpage_writepage(struct bio *bio, struc
> if (!buffer_dirty(bh) || !buffer_uptodate(bh))
> goto confused;
> if (page_block) {
> - if (bh->b_blocknr != blocks[page_block-1] + 1)
> + if (bh->b_blocknr != map_bh->b_blocknr
> + + page_block)
> goto confused;
> + } else {
> + map_bh->b_blocknr = bh->b_blocknr;
> + map_bh->b_size = PAGE_SIZE;
> }
> - blocks[page_block++] = bh->b_blocknr;
> boundary = buffer_boundary(bh);
> if (boundary) {
> boundary_block = bh->b_blocknr;
> @@ -465,33 +530,30 @@ __mpage_writepage(struct bio *bio, struc
> BUG_ON(!PageUptodate(page));
> block_in_file = page->index << (PAGE_CACHE_SHIFT - blkbits);
> last_block = (i_size - 1) >> blkbits;
> - map_bh.b_page = page;
> - for (page_block = 0; page_block < blocks_per_page; ) {
> -
> - map_bh.b_state = 0;
> - if (get_block(inode, block_in_file, &map_bh, 1))
> - goto confused;
> - if (buffer_new(&map_bh))
> - unmap_underlying_metadata(map_bh.b_bdev,
> - map_bh.b_blocknr);
> - if (buffer_boundary(&map_bh)) {
> - boundary_block = map_bh.b_blocknr;
> - boundary_bdev = map_bh.b_bdev;
> - }
> - if (page_block) {
> - if (map_bh.b_blocknr != blocks[page_block-1] + 1)
> - goto confused;
> - }
> - blocks[page_block++] = map_bh.b_blocknr;
> - boundary = buffer_boundary(&map_bh);
> - bdev = map_bh.b_bdev;
> - if (block_in_file == last_block)
> - break;
> - block_in_file++;
> + blocks_to_skip = block_in_file - mio->block_in_file;
> + mio->block_in_file = block_in_file;
> + if (blocks_to_skip < (map_bh->b_size >> blkbits)) {
> + map_bh->b_blocknr += blocks_to_skip;
> + map_bh->b_size -= blocks_to_skip << blkbits;
> + } else {
> + map_bh->b_state = 0;
> + map_bh->b_size = 0;
> + if (mio->final_block_in_request > last_block)
> + mio->final_block_in_request = last_block;
> + mpage_get_more_blocks(mio, inode, get_blocks);
> }
> - BUG_ON(page_block == 0);
> + if (map_bh->b_size < PAGE_SIZE)
> + goto confused;
>
> - first_unmapped = page_block;
> + if (mio->boundary && (mio->boundary_block < map_bh->b_blocknr
> + + blocks_per_page)) {
> + boundary = 1;
> + boundary_block = mio->boundary_block;
> + boundary_bdev = mio->boundary_bdev;
> + }
> +
> + bdev = map_bh->b_bdev;
> + first_unmapped = blocks_per_page;
>
> page_is_mapped:
> end_index = i_size >> PAGE_CACHE_SHIFT;
> @@ -518,12 +580,16 @@ page_is_mapped:
> /*
> * This page will go to BIO. Do we need to send this BIO off first?
> */
> - if (bio && *last_block_in_bio != blocks[0] - 1)
> + if (bio && mio->block_in_bio != map_bh->b_blocknr - 1)
> bio = mpage_bio_submit(WRITE, bio);
>
> alloc_new:
> if (bio == NULL) {
> - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> + /*
> + * Fixme: bio size can be limited to final_block - block, or
> + * even mio->map_bh.b_size
> + */
> + bio = mpage_alloc(bdev, map_bh->b_blocknr << (blkbits - 9),
> bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH);
> if (bio == NULL)
> goto confused;
> @@ -539,6 +605,9 @@ alloc_new:
> bio = mpage_bio_submit(WRITE, bio);
> goto alloc_new;
> }
> + map_bh->b_blocknr += blocks_per_page;
> + map_bh->b_size -= PAGE_SIZE;
> + mio->block_in_file += blocks_per_page;
>
> /*
> * OK, we have our BIO, so we can now mark the buffers clean. Make
> @@ -575,7 +644,8 @@ alloc_new:
> boundary_block, 1 << blkbits);
> }
> } else {
> - *last_block_in_bio = blocks[blocks_per_page - 1];
> + /* we can pack more pages into the bio, don't submit yet */
> + mio->block_in_bio = map_bh->b_blocknr - 1;
> }
> goto out;
>
> @@ -584,22 +654,23 @@ confused:
> bio = mpage_bio_submit(WRITE, bio);
>
> if (writepage_fn) {
> - *ret = (*writepage_fn)(page, wbc);
> + ret = (*writepage_fn)(page, wbc);
> } else {
> - *ret = -EAGAIN;
> + ret = -EAGAIN;
> goto out;
> }
> /*
> * The caller has a ref on the inode, so *mapping is stable
> */
> - if (*ret) {
> - if (*ret == -ENOSPC)
> + if (ret) {
> + if (ret == -ENOSPC)
> set_bit(AS_ENOSPC, &mapping->flags);
> else
> set_bit(AS_EIO, &mapping->flags);
> }
> out:
> - return bio;
> + mio->bio = bio;
> + return ret;
> }
>
> /**
> @@ -625,20 +696,21 @@ out:
> */
> int
> mpage_writepages(struct address_space *mapping,
> - struct writeback_control *wbc, get_block_t get_block)
> + struct writeback_control *wbc, get_blocks_t get_blocks)
> {
> - return __mpage_writepages(mapping, wbc, get_block,
> + return __mpage_writepages(mapping, wbc, get_blocks,
> mapping->a_ops->writepage);
> }
>
> int
> __mpage_writepages(struct address_space *mapping,
> - struct writeback_control *wbc, get_block_t get_block,
> + struct writeback_control *wbc, get_blocks_t get_blocks,
> writepage_t writepage_fn)
> {
> struct backing_dev_info *bdi = mapping->backing_dev_info;
> + struct inode *inode = mapping->host;
> + const unsigned blkbits = inode->i_blkbits;
> struct bio *bio = NULL;
> - sector_t last_block_in_bio = 0;
> int ret = 0;
> int done = 0;
> int (*writepage)(struct page *page, struct writeback_control *wbc);
> @@ -648,6 +720,9 @@ __mpage_writepages(struct address_space
> pgoff_t end = -1; /* Inclusive */
> int scanned = 0;
> int is_range = 0;
> + struct mpageio mio = {
> + .bio = NULL
> + };
>
> if (wbc->nonblocking && bdi_write_congested(bdi)) {
> wbc->encountered_congestion = 1;
> @@ -655,7 +730,7 @@ __mpage_writepages(struct address_space
> }
>
> writepage = NULL;
> - if (get_block == NULL)
> + if (get_blocks == NULL)
> writepage = mapping->a_ops->writepage;
>
> pagevec_init(&pvec, 0);
> @@ -672,12 +747,15 @@ __mpage_writepages(struct address_space
> scanned = 1;
> }
> retry:
> + down_read(&inode->i_alloc_sem);
> while (!done && (index <= end) &&
> - (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> - PAGECACHE_TAG_DIRTY,
> + (nr_pages = pagevec_contig_lookup_tag(&pvec, mapping,
> + &index, PAGECACHE_TAG_DIRTY,
> min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
> unsigned i;
>
> + mio.final_block_in_request = min(index, end) <<
> + (PAGE_CACHE_SHIFT - blkbits);
> scanned = 1;
> for (i = 0; i < nr_pages; i++) {
> struct page *page = pvec.pages[i];
> @@ -702,7 +780,7 @@ retry:
> unlock_page(page);
> continue;
> }
> -
> +
> if (wbc->sync_mode != WB_SYNC_NONE)
> wait_on_page_writeback(page);
>
> @@ -723,9 +801,9 @@ retry:
> &mapping->flags);
> }
> } else {
> - bio = __mpage_writepage(bio, page, get_block,
> - &last_block_in_bio, &ret, wbc,
> - writepage_fn);
> + ret = __mpage_writepage(&mio, page, get_blocks,
> + wbc, writepage_fn);
> + bio = mio.bio;
> }
> if (ret || (--(wbc->nr_to_write) <= 0))
> done = 1;
> @@ -737,6 +815,9 @@ retry:
> pagevec_release(&pvec);
> cond_resched();
> }
> +
> + up_read(&inode->i_alloc_sem);
> +
> if (!scanned && !done) {
> /*
> * We hit the last page and there is more work to be done: wrap
> @@ -755,17 +836,23 @@ retry:
> EXPORT_SYMBOL(mpage_writepages);
> EXPORT_SYMBOL(__mpage_writepages);
>
> -int mpage_writepage(struct page *page, get_block_t get_block,
> - struct writeback_control *wbc)
> +int mpage_writepage(struct page *page, get_blocks_t get_blocks,
> + struct writeback_control *wbc, writepage_t writepage_fn)
> {
> int ret = 0;
> - struct bio *bio;
> - sector_t last_block_in_bio = 0;
> -
> - bio = __mpage_writepage(NULL, page, get_block,
> - &last_block_in_bio, &ret, wbc, NULL);
> - if (bio)
> - mpage_bio_submit(WRITE, bio);
> + struct address_space *mapping = page->mapping;
> + struct inode *inode = mapping->host;
> + const unsigned blkbits = inode->i_blkbits;
> + struct mpageio mio = {
> + .final_block_in_request = (page->index + 1) << (PAGE_CACHE_SHIFT
> + - blkbits)
> + };
> +
> + dump_stack();
> + ret = __mpage_writepage(&mio, page, get_blocks,
> + wbc, writepage_fn);
> + if (mio.bio)
> + mpage_bio_submit(WRITE, mio.bio);
>
> return ret;
> }
> diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/buffer_head.h linux-2.6.12-rc3-getblocks/include/linux/buffer_head.h
> --- linux-2.6.12-rc3/include/linux/buffer_head.h 2005-04-21 05:33:16.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/include/linux/buffer_head.h 2005-04-22 15:08:33.000000000 +0530
> @@ -203,8 +203,8 @@ int file_fsync(struct file *, struct den
> int nobh_prepare_write(struct page*, unsigned, unsigned, get_block_t*);
> int nobh_commit_write(struct file *, struct page *, unsigned, unsigned);
> int nobh_truncate_page(struct address_space *, loff_t);
> -int nobh_writepage(struct page *page, get_block_t *get_block,
> - struct writeback_control *wbc);
> +int nobh_writepage(struct page *page, get_blocks_t *get_blocks,
> + struct writeback_control *wbc, writepage_t bh_writepage);
>
>
> /*
> diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/fs.h linux-2.6.12-rc3-getblocks/include/linux/fs.h
> --- linux-2.6.12-rc3/include/linux/fs.h 2005-04-21 05:33:16.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/include/linux/fs.h 2005-04-22 15:08:33.000000000 +0530
> @@ -304,6 +304,8 @@ struct address_space;
> struct writeback_control;
> struct kiocb;
>
> +typedef int (writepage_t)(struct page *page, struct writeback_control *wbc);
> +
> struct address_space_operations {
> int (*writepage)(struct page *page, struct writeback_control *wbc);
> int (*readpage)(struct file *, struct page *);
> diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/mpage.h linux-2.6.12-rc3-getblocks/include/linux/mpage.h
> --- linux-2.6.12-rc3/include/linux/mpage.h 2005-04-21 05:33:16.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/include/linux/mpage.h 2005-04-22 15:08:33.000000000 +0530
> @@ -11,17 +11,16 @@
> */
>
> struct writeback_control;
> -typedef int (writepage_t)(struct page *page, struct writeback_control *wbc);
>
> int mpage_readpages(struct address_space *mapping, struct list_head *pages,
> unsigned nr_pages, get_block_t get_block);
> int mpage_readpage(struct page *page, get_block_t get_block);
> int mpage_writepages(struct address_space *mapping,
> - struct writeback_control *wbc, get_block_t get_block);
> -int mpage_writepage(struct page *page, get_block_t *get_block,
> - struct writeback_control *wbc);
> + struct writeback_control *wbc, get_blocks_t get_blocks);
> +int mpage_writepage(struct page *page, get_blocks_t *get_blocks,
> + struct writeback_control *wbc, writepage_t writepage);
> int __mpage_writepages(struct address_space *mapping,
> - struct writeback_control *wbc, get_block_t get_block,
> + struct writeback_control *wbc, get_blocks_t get_blocks,
> writepage_t writepage);
>
> static inline int
> diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/pagemap.h linux-2.6.12-rc3-getblocks/include/linux/pagemap.h
> --- linux-2.6.12-rc3/include/linux/pagemap.h 2005-04-21 05:33:16.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/include/linux/pagemap.h 2005-04-22 15:08:33.000000000 +0530
> @@ -73,7 +73,8 @@ extern struct page * find_or_create_page
> unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
> unsigned int nr_pages, struct page **pages);
> unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
> - int tag, unsigned int nr_pages, struct page **pages);
> + int tag, unsigned int nr_pages, struct page **pages,
> + int contig);
>
> /*
> * Returns locked page at given index in given cache, creating it if needed.
> diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/pagevec.h linux-2.6.12-rc3-getblocks/include/linux/pagevec.h
> --- linux-2.6.12-rc3/include/linux/pagevec.h 2005-04-21 05:33:16.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/include/linux/pagevec.h 2005-04-22 15:08:33.000000000 +0530
> @@ -28,6 +28,9 @@ unsigned pagevec_lookup(struct pagevec *
> unsigned pagevec_lookup_tag(struct pagevec *pvec,
> struct address_space *mapping, pgoff_t *index, int tag,
> unsigned nr_pages);
> +unsigned pagevec_contig_lookup_tag(struct pagevec *pvec,
> + struct address_space *mapping, pgoff_t *index, int tag,
> + unsigned nr_pages);
>
> static inline void pagevec_init(struct pagevec *pvec, int cold)
> {
> diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/radix-tree.h linux-2.6.12-rc3-getblocks/include/linux/radix-tree.h
> --- linux-2.6.12-rc3/include/linux/radix-tree.h 2005-04-21 05:33:16.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/include/linux/radix-tree.h 2005-04-22 15:08:33.000000000 +0530
> @@ -59,8 +59,18 @@ void *radix_tree_tag_clear(struct radix_
> int radix_tree_tag_get(struct radix_tree_root *root,
> unsigned long index, int tag);
> unsigned int
> -radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
> - unsigned long first_index, unsigned int max_items, int tag);
> +__radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
> + unsigned long first_index, unsigned int max_items, int tag,
> + int contig);
> +
> +static inline unsigned int radix_tree_gang_lookup_tag(struct radix_tree_root
> + *root, void **results, unsigned long first_index,
> + unsigned int max_items, int tag)
> +{
> + return __radix_tree_gang_lookup_tag(root, results, first_index,
> + max_items, tag, 0);
> +}
> +
> int radix_tree_tagged(struct radix_tree_root *root, int tag);
>
> static inline void radix_tree_preload_end(void)
> diff -urp -X dontdiff linux-2.6.12-rc3/lib/radix-tree.c linux-2.6.12-rc3-getblocks/lib/radix-tree.c
> --- linux-2.6.12-rc3/lib/radix-tree.c 2005-04-21 05:33:16.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/lib/radix-tree.c 2005-04-22 16:34:29.000000000 +0530
> @@ -557,12 +557,13 @@ EXPORT_SYMBOL(radix_tree_gang_lookup);
> */
> static unsigned int
> __lookup_tag(struct radix_tree_root *root, void **results, unsigned long index,
> - unsigned int max_items, unsigned long *next_index, int tag)
> + unsigned int max_items, unsigned long *next_index, int tag, int contig)
> {
> unsigned int nr_found = 0;
> unsigned int shift;
> unsigned int height = root->height;
> struct radix_tree_node *slot;
> + unsigned long cindex = (contig && (*next_index)) ? *next_index : -1;
>
> shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
> slot = root->rnode;
> @@ -575,6 +576,11 @@ __lookup_tag(struct radix_tree_root *roo
> BUG_ON(slot->slots[i] == NULL);
> break;
> }
> + if (contig && index >= cindex) {
> + /* break in contiguity */
> + index = 0;
> + goto out;
> + }
> index &= ~((1UL << shift) - 1);
> index += 1UL << shift;
> if (index == 0)
> @@ -593,6 +599,10 @@ __lookup_tag(struct radix_tree_root *roo
> results[nr_found++] = slot->slots[j];
> if (nr_found == max_items)
> goto out;
> + } else if (contig && nr_found) {
> + /* break in contiguity */
> + index = 0;
> + goto out;
> }
> }
> }
> @@ -618,29 +628,32 @@ out:
> * returns the number of items which were placed at *@results.
> */
> unsigned int
> -radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
> - unsigned long first_index, unsigned int max_items, int tag)
> +__radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
> + unsigned long first_index, unsigned int max_items, int tag,
> + int contig)
> {
> const unsigned long max_index = radix_tree_maxindex(root->height);
> unsigned long cur_index = first_index;
> + unsigned long next_index = 0; /* Index of next contiguous search */
> unsigned int ret = 0;
>
> while (ret < max_items) {
> unsigned int nr_found;
> - unsigned long next_index; /* Index of next search */
>
> if (cur_index > max_index)
> break;
> nr_found = __lookup_tag(root, results + ret, cur_index,
> - max_items - ret, &next_index, tag);
> + max_items - ret, &next_index, tag, contig);
> ret += nr_found;
> if (next_index == 0)
> break;
> cur_index = next_index;
> + if (!nr_found)
> + next_index = 0;
> }
> return ret;
> }
> -EXPORT_SYMBOL(radix_tree_gang_lookup_tag);
> +EXPORT_SYMBOL(__radix_tree_gang_lookup_tag);
>
> /**
> * radix_tree_delete - delete an item from a radix tree
> diff -urp -X dontdiff linux-2.6.12-rc3/mm/filemap.c linux-2.6.12-rc3-getblocks/mm/filemap.c
> --- linux-2.6.12-rc3/mm/filemap.c 2005-04-21 05:33:16.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/mm/filemap.c 2005-04-22 16:20:30.000000000 +0530
> @@ -635,16 +635,19 @@ unsigned find_get_pages(struct address_s
> /*
> * Like find_get_pages, except we only return pages which are tagged with
> * `tag'. We update *index to index the next page for the traversal.
> + * If 'contig' is 1, then we return only pages which are contiguous in the
> + * file.
> */
> unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
> - int tag, unsigned int nr_pages, struct page **pages)
> + int tag, unsigned int nr_pages, struct page **pages,
> + int contig)
> {
> unsigned int i;
> unsigned int ret;
>
> read_lock_irq(&mapping->tree_lock);
> - ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
> - (void **)pages, *index, nr_pages, tag);
> + ret = __radix_tree_gang_lookup_tag(&mapping->page_tree,
> + (void **)pages, *index, nr_pages, tag, contig);
> for (i = 0; i < ret; i++)
> page_cache_get(pages[i]);
> if (ret)
> diff -urp -X dontdiff linux-2.6.12-rc3/mm/swap.c linux-2.6.12-rc3-getblocks/mm/swap.c
> --- linux-2.6.12-rc3/mm/swap.c 2005-04-21 05:33:16.000000000 +0530
> +++ linux-2.6.12-rc3-getblocks/mm/swap.c 2005-04-22 15:08:33.000000000 +0530
> @@ -384,7 +384,16 @@ unsigned pagevec_lookup_tag(struct pagev
> pgoff_t *index, int tag, unsigned nr_pages)
> {
> pvec->nr = find_get_pages_tag(mapping, index, tag,
> - nr_pages, pvec->pages);
> + nr_pages, pvec->pages, 0);
> + return pagevec_count(pvec);
> +}
> +
> +unsigned int
> +pagevec_contig_lookup_tag(struct pagevec *pvec, struct address_space *mapping,
> + pgoff_t *index, int tag, unsigned nr_pages)
> +{
> + pvec->nr = find_get_pages_tag(mapping, index, tag,
> + nr_pages, pvec->pages, 1);
> return pagevec_count(pvec);
> }
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Oops, sorry about the empty message ...
> -static int ext3_writepages_get_block(struct inode *inode, sector_t iblock,
> - struct buffer_head *bh, int create)
> +static int ext3_writepages_get_blocks(struct inode *inode, sector_t iblock,
> + unsigned long max_blocks, struct buffer_head *bh, int create)
> {
> - return ext3_direct_io_get_blocks(inode, iblock, 1, bh, create);
> + return ext3_direct_io_get_blocks(inode, iblock, max_blocks, bh, create);
> }
>
I have a question here, ext3_direct_io_get_blocks use DIO_CREDITS
(EXT3_RESERVE_TRANS_BLOCKS + 32 = ) to reserve the space for
journalling, but it seems based on assumption of one data block update
once a time. Is it sufficent to re-use that routine for multiple block
allocation here? Don't we need something like
ext3_writepage_trans_blocks() here?
Thanks,
Mingming
On Fri, Apr 29, 2005 at 10:10:08AM -0700, Mingming Cao wrote:
> On Fri, 2005-04-29 at 19:22 +0530, Suparna Bhattacharya wrote:
> > On Thu, Apr 28, 2005 at 12:14:24PM -0700, Mingming Cao wrote:
> > > Currently ext3_get_block()/ext3_new_block() only allocate one block at a
> > > time. To allocate multiple blocks, the caller, for example, ext3 direct
> > > IO routine, has to invoke ext3_get_block() many times. This is quite
> > > inefficient for sequential IO workload.
> > >
> > > The benefit of a real get_blocks() include
> > > 1) increase the possibility to get contiguous blocks, reduce possibility
> > > of fragmentation due to interleaved allocations from other threads.
> > > (should good for non reservation case)
> > > 2) Reduces CPU cycles spent in repeated get_block() calls
> > > 3) Batch meta data update and journaling in one short
> > > 4) Could possibly speed up future get_blocks() look up by cache the last
> > > mapped blocks in inode.
> > >
> >
> > And here is the patch to make mpage_writepages use get_blocks() for
> > multiple block lookup/allocation. It performs a radix-tree contiguous
> > pages lookup, and issues a get_blocks for the range together. It maintains
> > an mpageio structure to track intermediate mapping state, somewhat
> > like the DIO code.
> >
> > It does need some more testing, especially block_size < PAGE_SIZE.
> > The JFS workaround can be dropped if the JFS get_blocks fix from
> > Dave Kleikamp is integrated.
> >
> > Review feedback would be welcome.
> >
> > Mingming,
> > Let me know if you have a chance to try this out with your patch.
>
>
> Sure, Suparna, I will try your patch soon!
>
> In my patch, I have modified ext3 directo io code to make use of
> ext3_get_blocks(). Tested with a simple file write with O_DIRECT, seems
> work fine! Allocating blocks for a 120k file only invokes
> ext3_get_blocks() for 4 times(perfect is 1, but before is 30 times call
> to ext3_get_block). Among the 4 calls to ext3_get_blocks, 2 because of
> reach the meta data block boundary(direct ->indirect), another 2 because
> of reach the end of the reservation window. For the later 2, we could
> avoid that by extend the reservation window before calling
> ext3_new_blocks() if the window size is less than the number of blocks
> to allocate.
>
> But if it try to allocating blocks in the hole (with direct IO), blocks
> are allocated one by one. I am looking at it right now.
That is expected, because DIO falls back to using buffered IO for
overwriting holes. This is to avoid some tricky races between DIO
and buffered IO that could otherwise have led to exposure of
stale data.
Regards
Suparna
>
> Thanks,
> Mingming
>
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
On Fri, Apr 29, 2005 at 04:22:59PM -0700, Mingming Cao wrote:
> On Fri, 2005-04-29 at 11:45 -0700, Badari Pulavarty wrote:
> > I touch tested your patch earlier and seems to work fine. Lets integrate
> > Mingming's getblocks() patches with this and see if we get any benifit
> > from the whole effort.
> >
>
> I tried Suparna's mpage_writepages_getblocks patch with my
> ext3_get_blocks patch, seems to work fine, except that still only one
> block is allocated at a time. I got a little confused....
>
> I did not see any delayed allocation code in your patch, I assume you
> have to update ext3_prepare_write to not call ext3_get_block, so that
> block allocation will be defered at ext3_writepages time. So without the
> delayed allocation part, the get_blocks in mpage_writepages is doing
> multiple blocks look up only, right?
That's right - so you could try it with mmapped writes (which don't
go through a prepare write) or with Badari's patch to not call
ext3_get_block in prepare write.
Regards
Suparna
>
>
> Mingming
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: NEC IT Guy Games.
> Get your fingers limbered up and give it your best shot. 4 great events, 4
> opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
> win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
> _______________________________________________
> Ext2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ext2-devel
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
On Fri, Apr 29, 2005 at 11:45:21AM -0700, Badari Pulavarty wrote:
> Suparna Bhattacharya wrote:
> >On Thu, Apr 28, 2005 at 12:14:24PM -0700, Mingming Cao wrote:
> >
> >>Currently ext3_get_block()/ext3_new_block() only allocate one block at a
> >>time. To allocate multiple blocks, the caller, for example, ext3 direct
> >>IO routine, has to invoke ext3_get_block() many times. This is quite
> >>inefficient for sequential IO workload.
> >>
> >>The benefit of a real get_blocks() include
> >>1) increase the possibility to get contiguous blocks, reduce possibility
> >>of fragmentation due to interleaved allocations from other threads.
> >>(should good for non reservation case)
> >>2) Reduces CPU cycles spent in repeated get_block() calls
> >>3) Batch meta data update and journaling in one short
> >>4) Could possibly speed up future get_blocks() look up by cache the last
> >>mapped blocks in inode.
> >>
> >
> >
> >And here is the patch to make mpage_writepages use get_blocks() for
> >multiple block lookup/allocation. It performs a radix-tree contiguous
> >pages lookup, and issues a get_blocks for the range together. It maintains
> >an mpageio structure to track intermediate mapping state, somewhat
> >like the DIO code.
> >
> >It does need some more testing, especially block_size < PAGE_SIZE.
> >The JFS workaround can be dropped if the JFS get_blocks fix from
> >Dave Kleikamp is integrated.
> >
> >Review feedback would be welcome.
> >
> >Mingming,
> >Let me know if you have a chance to try this out with your patch.
> >
> >Regards
> >Suparna
> >
>
> Suparna,
>
> I touch tested your patch earlier and seems to work fine. Lets integrate
> Mingming's getblocks() patches with this and see if we get any benifit
> from the whole effort.
>
> BTW, is it the plan to remove repeated calls to getblocks() and work
> with whatever getblocks() returned ? Or do you find the its better to
> repeatedly do getblocks() in a loop till you satisfy all the pages
> in the list (as long as blocks are contiguous) ?
The patch only loops through repeated get_blocks upto PAGE_SIZE, just
like the earlier mpage_writepage code - in this case if blocks aren't
contiguous at least upto PAGE_SIZE, it would fall back to confused case
(since it means multiple ios for the same page) just as before.
What is new now is just the numblocks specified to get_blocks, so in
case it gets more contiguous blocks it can remember that mapping and
avoid get_blocks calls for subsequent pages.
Does that clarify ?
Regards
Suparna
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
On Fri, Apr 29, 2005 at 05:44:26PM -0700, Mingming Cao wrote:
> Oops, sorry about the empty message ...
>
> > -static int ext3_writepages_get_block(struct inode *inode, sector_t iblock,
> > - struct buffer_head *bh, int create)
> > +static int ext3_writepages_get_blocks(struct inode *inode, sector_t iblock,
> > + unsigned long max_blocks, struct buffer_head *bh, int create)
> > {
> > - return ext3_direct_io_get_blocks(inode, iblock, 1, bh, create);
> > + return ext3_direct_io_get_blocks(inode, iblock, max_blocks, bh, create);
> > }
> >
>
> I have a question here, ext3_direct_io_get_blocks use DIO_CREDITS
> (EXT3_RESERVE_TRANS_BLOCKS + 32 = ) to reserve the space for
> journalling, but it seems based on assumption of one data block update
> once a time. Is it sufficent to re-use that routine for multiple block
> allocation here? Don't we need something like
> ext3_writepage_trans_blocks() here?
Quite likely - with your patch, as get_blocks actually allocates
multiple blocks at a time, the min credits estimate would
change for ext3_direct_io_get_blocks/ext3_writepages_get_blocks.
Regards
Suparna
>
> Thanks,
> Mingming
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: NEC IT Guy Games.
> Get your fingers limbered up and give it your best shot. 4 great events, 4
> opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
> win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
> _______________________________________________
> Ext2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ext2-devel
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
On Sat, Apr 30, 2005 at 09:40:43PM +0530, Suparna Bhattacharya wrote:
> On Fri, Apr 29, 2005 at 04:22:59PM -0700, Mingming Cao wrote:
> > On Fri, 2005-04-29 at 11:45 -0700, Badari Pulavarty wrote:
> > > I touch tested your patch earlier and seems to work fine. Lets integrate
> > > Mingming's getblocks() patches with this and see if we get any benifit
> > > from the whole effort.
> > >
> >
> > I tried Suparna's mpage_writepages_getblocks patch with my
> > ext3_get_blocks patch, seems to work fine, except that still only one
> > block is allocated at a time. I got a little confused....
> >
> > I did not see any delayed allocation code in your patch, I assume you
> > have to update ext3_prepare_write to not call ext3_get_block, so that
> > block allocation will be defered at ext3_writepages time. So without the
> > delayed allocation part, the get_blocks in mpage_writepages is doing
> > multiple blocks look up only, right?
>
> That's right - so you could try it with mmapped writes (which don't
> go through a prepare write) or with Badari's patch to not call
> ext3_get_block in prepare write.
BTW, I hope you are running with NOBH.
Regards
Suparna
>
> Regards
> Suparna
> >
> >
> > Mingming
> >
> >
> >
> > -------------------------------------------------------
> > This SF.Net email is sponsored by: NEC IT Guy Games.
> > Get your fingers limbered up and give it your best shot. 4 great events, 4
> > opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
> > win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
> > _______________________________________________
> > Ext2-devel mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/ext2-devel
>
> --
> Suparna Bhattacharya ([email protected])
> Linux Technology Center
> IBM Software Lab, India
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: NEC IT Guy Games.
> Get your fingers limbered up and give it your best shot. 4 great events, 4
> opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
> win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
> _______________________________________________
> Ext2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ext2-devel
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
On Sat, 2005-04-30 at 22:41 +0530, Suparna Bhattacharya wrote:
> On Sat, Apr 30, 2005 at 09:40:43PM +0530, Suparna Bhattacharya wrote:
> > On Fri, Apr 29, 2005 at 04:22:59PM -0700, Mingming Cao wrote:
> > > On Fri, 2005-04-29 at 11:45 -0700, Badari Pulavarty wrote:
> > > > I touch tested your patch earlier and seems to work fine. Lets integrate
> > > > Mingming's getblocks() patches with this and see if we get any benifit
> > > > from the whole effort.
> > > >
> > >
> > > I tried Suparna's mpage_writepages_getblocks patch with my
> > > ext3_get_blocks patch, seems to work fine, except that still only one
> > > block is allocated at a time. I got a little confused....
> > >
> > > I did not see any delayed allocation code in your patch, I assume you
> > > have to update ext3_prepare_write to not call ext3_get_block, so that
> > > block allocation will be defered at ext3_writepages time. So without the
> > > delayed allocation part, the get_blocks in mpage_writepages is doing
> > > multiple blocks look up only, right?
> >
> > That's right - so you could try it with mmapped writes (which don't
> > go through a prepare write) or with Badari's patch to not call
> > ext3_get_block in prepare write.
>
> BTW, I hope you are running with NOBH.
No, it was not run with NOBH. Why we need to turn on nobh here?
There are some issue running fsx tests with both patches, w/o direct io
option. I will spend more time on this next week.
thanks,
Mingming
On Sat, Apr 30, 2005 at 11:07:53AM -0700, Mingming Cao wrote:
> On Sat, 2005-04-30 at 22:41 +0530, Suparna Bhattacharya wrote:
> > On Sat, Apr 30, 2005 at 09:40:43PM +0530, Suparna Bhattacharya wrote:
> > > On Fri, Apr 29, 2005 at 04:22:59PM -0700, Mingming Cao wrote:
> > > > On Fri, 2005-04-29 at 11:45 -0700, Badari Pulavarty wrote:
> > > > > I touch tested your patch earlier and seems to work fine. Lets integrate
> > > > > Mingming's getblocks() patches with this and see if we get any benifit
> > > > > from the whole effort.
> > > > >
> > > >
> > > > I tried Suparna's mpage_writepages_getblocks patch with my
> > > > ext3_get_blocks patch, seems to work fine, except that still only one
> > > > block is allocated at a time. I got a little confused....
> > > >
> > > > I did not see any delayed allocation code in your patch, I assume you
> > > > have to update ext3_prepare_write to not call ext3_get_block, so that
> > > > block allocation will be defered at ext3_writepages time. So without the
> > > > delayed allocation part, the get_blocks in mpage_writepages is doing
> > > > multiple blocks look up only, right?
> > >
> > > That's right - so you could try it with mmapped writes (which don't
> > > go through a prepare write) or with Badari's patch to not call
> > > ext3_get_block in prepare write.
> >
> > BTW, I hope you are running with NOBH.
>
> No, it was not run with NOBH. Why we need to turn on nobh here?
If the page has buffers, then get_blocks won't be invoked -- it either
finds all the buffers in a page to be mapped contiguous, in which case
it can directly issue the IO or enters the confused case where it goes
through block_write_full_page.
>
> There are some issue running fsx tests with both patches, w/o direct io
> option. I will spend more time on this next week.
OK, I can take a look at the fsx logs, just in case it is similar to
something I've seen before.
Regards
Suparna
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
Currently ext3_get_block() only map or allocate one block at a time.
This is quite inefficient for sequential IO workload.
I have posted a early implements a simply multiple block map and
allocation with current ext3. The basic idea is allocating the 1st
block in the existing way, and attempting to allocate the next adjacent
blocks on a best effort basis. More description about the
implementation could be found here:
http://marc.theaimsgroup.com/?l=ext2-devel&m=112162230003522&w=2
The following the latest version of the patch: break the original patch
into 5 patches, re-worked some logicals, and fixed some bugs. The break
ups are:
[patch 1] Adding map multiple blocks at a time in ext3_get_blocks()
[patch 2] Extend ext3_get_blocks() to support multiple block allocation
[patch 3] Implement multiple block allocation in ext3-try-to-allocate
(called via ext3_new_block()).
[patch 4] Proper accounting updates in ext3_new_blocks()
[patch 5] Adjust reservation window size properly (by the given number
of blocks to allocate) before block allocation to increase the
possibility of allocating multiple blocks in a single call.
Tests done so far includes fsx,tiobench and dbench. The following
numbers collected from Direct IO tests (1G file creation/read) shows
the system time have been greatly reduced (more than 50% on my 8 cpu
system) with the patches.
1G file DIO write:
2.6.15 2.6.15+patches
real 0m31.275s 0m31.161s
user 0m0.000s 0m0.000s
sys 0m3.384s 0m0.564s
1G file DIO read:
2.6.15 2.6.15+patches
real 0m30.733s 0m30.624s
user 0m0.000s 0m0.004s
sys 0m0.748s 0m0.380s
Some previous test we did on buffered IO with using multiple blocks
allocation and delayed allocation shows noticeable improvement on
throughput and system time.
Patches against 2.6.15. Please consider to add to mm tree.
Thanks!
Mingming
Mingming Cao <[email protected]> wrote:
>
> Tests done so far includes fsx,tiobench and dbench. The following
> numbers collected from Direct IO tests (1G file creation/read) shows
> the system time have been greatly reduced (more than 50% on my 8 cpu
> system) with the patches.
>
> 1G file DIO write:
> 2.6.15 2.6.15+patches
> real 0m31.275s 0m31.161s
> user 0m0.000s 0m0.000s
> sys 0m3.384s 0m0.564s
>
>
> 1G file DIO read:
> 2.6.15 2.6.15+patches
> real 0m30.733s 0m30.624s
> user 0m0.000s 0m0.004s
> sys 0m0.748s 0m0.380s
>
> Some previous test we did on buffered IO with using multiple blocks
> allocation and delayed allocation shows noticeable improvement on
> throughput and system time.
I'd be interested in seeing benchmark results for the common
allocate-one-block case - just normal old buffered IO without any
additional multiblock patches. Would they show any regression?
On Tue, 2006-01-10 at 21:25 -0800, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > Tests done so far includes fsx,tiobench and dbench. The following
> > numbers collected from Direct IO tests (1G file creation/read) shows
> > the system time have been greatly reduced (more than 50% on my 8 cpu
> > system) with the patches.
> >
> > 1G file DIO write:
> > 2.6.15 2.6.15+patches
> > real 0m31.275s 0m31.161s
> > user 0m0.000s 0m0.000s
> > sys 0m3.384s 0m0.564s
> >
> >
> > 1G file DIO read:
> > 2.6.15 2.6.15+patches
> > real 0m30.733s 0m30.624s
> > user 0m0.000s 0m0.004s
> > sys 0m0.748s 0m0.380s
> >
> > Some previous test we did on buffered IO with using multiple blocks
> > allocation and delayed allocation shows noticeable improvement on
> > throughput and system time.
>
> I'd be interested in seeing benchmark results for the common
> allocate-one-block case - just normal old buffered IO without any
> additional multiblock patches. Would they show any regression?
>
Hi Andrew,
One thing I want to clarify is that: for the buffered IO, even with
mutlipleblock patches, currently ext3 is still allocate one block at a
time.(we will need delayed allocation to make use of the multiple block
allocation)
I did the same test on buffered IO, w/o the patches. Very little
regression(less than 1% could be consider as noise) comparing 2.6.15
kernel w/o patches:
buffered IO write: (no sync)
# time ./filetst -b 1048576 -w -f /mnt/a
2.6.15 2.6.15+patches
real 0m25.773s 0m26.102s
user 0m0.004s 0m0.000s
sys 0m15.065s 0m16.053s
buffered IO read (after umount/remount)
# time ./filetst -b 1048576 -r -f /mnt/a
2.6.15 2.6.15+patches
real 0m29.257s 0m29.257s
user 0m0.000s 0m0.000s
sys 0m6.996s 0m6.980s
But I do notice regression between vanilla 2.6.14 kernel and vanilla
2.6.15 kernel on buffered IO(about 18%).
# time ./filetst -b 1048576 -w -f /mnt/a
2.6.14 2.6.15
real 0m21.710s 0m25.773s
user 0m0.012s 0m0.004s
sys 0m14.569s 0m15.065s
I also found tiobench(sequential write test) and dbench has similar
regression between 2.6.14 and 2.6.15. Actually I found 2.6.15 rc2
already has the regression. Is this a known issue? Anyway I will
continue looking at the issue...
Thanks,
Mingming
Mingming Cao <[email protected]> wrote:
>
> # time ./filetst -b 1048576 -w -f /mnt/a
> 2.6.14 2.6.15
> real 0m21.710s 0m25.773s
> user 0m0.012s 0m0.004s
> sys 0m14.569s 0m15.065s
That's a big drop.
Was it doing I/O, or was it all from pagecache?
> I also found tiobench(sequential write test) and dbench has similar
> regression between 2.6.14 and 2.6.15. Actually I found 2.6.15 rc2
> already has the regression. Is this a known issue?
No, it is not known.
> Anyway I will continue looking at the issue...
Thanks.
On Wed, 2006-01-11 at 11:43 -0800, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > # time ./filetst -b 1048576 -w -f /mnt/a
> > 2.6.14 2.6.15
> > real 0m21.710s 0m25.773s
> > user 0m0.012s 0m0.004s
> > sys 0m14.569s 0m15.065s
>
> That's a big drop.
>
> Was it doing I/O, or was it all from pagecache?
>
It's doing IO on a freshly created filesystem. It was running with AS io
scheduler. It seems DIO has no problem, and ext2 has no regression
either... Will keep you posted.
Mingming
On Wed, 2006-01-11 at 11:43 -0800, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > # time ./filetst -b 1048576 -w -f /mnt/a
> > 2.6.14 2.6.15
> > real 0m21.710s 0m25.773s
> > user 0m0.012s 0m0.004s
> > sys 0m14.569s 0m15.065s
>
> That's a big drop.
>
> Was it doing I/O, or was it all from pagecache?
>
> > I also found tiobench(sequential write test) and dbench has similar
> > regression between 2.6.14 and 2.6.15. Actually I found 2.6.15 rc2
> > already has the regression. Is this a known issue?
>
> No, it is not known.
>
> > Anyway I will continue looking at the issue...
>
> Thanks.
Hi, Andrew,
I did some trace, it turns out there isn't regression between 2.6.14 and
2.6.15, and there is no problem in ext3 filesystem. I am comparing
apple to orange: the tests were run on two different io schedulers. That
makes the bogus throughput difference that I reported to you earlier
this week.
I gave the same boot option "elevator=as" for both 2.6.14 and 2.6.15-rc2
(this has been working for me for a long time to get the anticipatory
scheduler on), but the results are, the io schedulers turned on on the
two kernels are different( see elevator_setup_default()). On 2.6.14, the
fall back io scheduler (if the chosen io scheduler is not found) is set
to the default io scheduler (anticipatory, in this case), but since
2.6.15-rc1, this semanistic is changed to fall back to noop.
Is there any reason to fall back to noop instead of as? It seems
anticipatory is much better than noop for ext3 with large sequential
write tests (i.e, 1G dd test) ...
Thanks,
Mingming
Mingming Cao <[email protected]> wrote:
>
> On 2.6.14, the
> fall back io scheduler (if the chosen io scheduler is not found) is set
> to the default io scheduler (anticipatory, in this case), but since
> 2.6.15-rc1, this semanistic is changed to fall back to noop.
OK. And I assume that AS wasn't compiled, so that's why it fell back?
I actually thought that elevator= got removed, now we have
/sys/block/sda/queue/scheduler. But I guess that's not very useful with
CONFIG_SYSFS=n.
> Is there any reason to fall back to noop instead of as? It seems
> anticipatory is much better than noop for ext3 with large sequential
> write tests (i.e, 1G dd test) ...
I suspect that was an accident. Jens?
On Fri, Jan 13, 2006 at 05:49:14PM -0800, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > On 2.6.14, the
> > fall back io scheduler (if the chosen io scheduler is not found) is set
> > to the default io scheduler (anticipatory, in this case), but since
> > 2.6.15-rc1, this semanistic is changed to fall back to noop.
>
> OK. And I assume that AS wasn't compiled, so that's why it fell back?
>
> I actually thought that elevator= got removed, now we have
> /sys/block/sda/queue/scheduler. But I guess that's not very useful with
> CONFIG_SYSFS=n.
It's also a lifesaver if the default scheduler happens to trigger some breakage
preventing boot, and you can tell users to workaround it with a bootparam
until the real problem is fixed (which has bitten us twice now, as Fedora
like several other distros chooses CFQ by default).
Dave
In-Reply-To: <[email protected]>
On Fri, 13 Jan 2006, Andrew Morton wrote:
> OK. And I assume that AS wasn't compiled, so that's why it fell back?
As of 2.6.15 you need to use "anticipatory" instead of "as".
Maybe this patch would help?
Signed-off-by: Chuck Ebbert <[email protected]>
--- 2.6.15a.orig/block/elevator.c
+++ 2.6.15a/block/elevator.c
@@ -150,6 +150,13 @@ static void elevator_setup_default(void)
if (!chosen_elevator[0])
strcpy(chosen_elevator, CONFIG_DEFAULT_IOSCHED);
+ /*
+ * Be backwards-compatible with previous kernels, so users
+ * won't get the wrong elevator.
+ */
+ if (!strcmp(chosen_elevator, "as"))
+ strcpy(chosen_elevator, "anticipatory");
+
/*
* If the given scheduler is not available, fall back to no-op.
*/
--
Chuck
Currently reading: _Olympos_ by Dan Simmons
On Fri, Jan 13 2006, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > On 2.6.14, the
> > fall back io scheduler (if the chosen io scheduler is not found) is set
> > to the default io scheduler (anticipatory, in this case), but since
> > 2.6.15-rc1, this semanistic is changed to fall back to noop.
>
> OK. And I assume that AS wasn't compiled, so that's why it fell back?
>
> I actually thought that elevator= got removed, now we have
> /sys/block/sda/queue/scheduler. But I guess that's not very useful with
> CONFIG_SYSFS=n.
>
> > Is there any reason to fall back to noop instead of as? It seems
> > anticipatory is much better than noop for ext3 with large sequential
> > write tests (i.e, 1G dd test) ...
>
> I suspect that was an accident. Jens?
It is, it makes more sense to fallback to the default of course.
--
Jens Axboe
On Sat, Jan 14 2006, Chuck Ebbert wrote:
> In-Reply-To: <[email protected]>
>
> On Fri, 13 Jan 2006, Andrew Morton wrote:
>
> > OK. And I assume that AS wasn't compiled, so that's why it fell back?
>
> As of 2.6.15 you need to use "anticipatory" instead of "as".
>
> Maybe this patch would help?
>
> Signed-off-by: Chuck Ebbert <[email protected]>
>
> --- 2.6.15a.orig/block/elevator.c
> +++ 2.6.15a/block/elevator.c
> @@ -150,6 +150,13 @@ static void elevator_setup_default(void)
> if (!chosen_elevator[0])
> strcpy(chosen_elevator, CONFIG_DEFAULT_IOSCHED);
>
> + /*
> + * Be backwards-compatible with previous kernels, so users
> + * won't get the wrong elevator.
> + */
> + if (!strcmp(chosen_elevator, "as"))
> + strcpy(chosen_elevator, "anticipatory");
> +
> /*
> * If the given scheduler is not available, fall back to no-op.
> */
We probably should apply this, since it used to be 'as'.
--
Jens Axboe
On Fri, 2006-01-13 at 17:49 -0800, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > On 2.6.14, the
> > fall back io scheduler (if the chosen io scheduler is not found) is set
> > to the default io scheduler (anticipatory, in this case), but since
> > 2.6.15-rc1, this semanistic is changed to fall back to noop.
>
> OK. And I assume that AS wasn't compiled, so that's why it fell back?
>
AS was compiled, and AS is set as the default scheduler. But since
2.6.15 doesn't recognize "elevator=as" (we need to say
"elevator=anticipatory"), so in 2.6.15, elevator_setup_default() will
explicitly fall back to "noop" scheduler.
2.6.14 doesn't recognize "elevator=as" either, but it fall back to the
default scheduler instead. Which makes more sense, as Jen pointed out.
> I actually thought that elevator= got removed, now we have
> /sys/block/sda/queue/scheduler. But I guess that's not very useful with
> CONFIG_SYSFS=n.
>
> > Is there any reason to fall back to noop instead of as? It seems
> > anticipatory is much better than noop for ext3 with large sequential
> > write tests (i.e, 1G dd test) ...
>
> I suspect that was an accident. Jens?
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
> for problems? Stop! Download the new AJAX search engine that makes
> searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
> http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
> _______________________________________________
> Ext2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ext2-devel
>
On Mon, 2006-01-16 at 09:43 +0100, Jens Axboe wrote:
> On Fri, Jan 13 2006, Andrew Morton wrote:
> > Mingming Cao <[email protected]> wrote:
> > >
> > > On 2.6.14, the
> > > fall back io scheduler (if the chosen io scheduler is not found) is set
> > > to the default io scheduler (anticipatory, in this case), but since
> > > 2.6.15-rc1, this semanistic is changed to fall back to noop.
> >
> > OK. And I assume that AS wasn't compiled, so that's why it fell back?
> >
> > I actually thought that elevator= got removed, now we have
> > /sys/block/sda/queue/scheduler. But I guess that's not very useful with
> > CONFIG_SYSFS=n.
> >
> > > Is there any reason to fall back to noop instead of as? It seems
> > > anticipatory is much better than noop for ext3 with large sequential
> > > write tests (i.e, 1G dd test) ...
> >
> > I suspect that was an accident. Jens?
>
> It is, it makes more sense to fallback to the default of course.
>
How about this one?
In the case the chosen io scheduler (elevator = "xxx") is not on the
registered list, fall back to default scheduler makes more sense.
Signed-off-by: Mingming Cao <[email protected]>
---
linux-2.6.15-ming/block/elevator.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff -puN block/elevator.c~default-fall-back-scheduler-fix
block/elevator.c
--- linux-2.6.15/block/elevator.c~default-fall-back-scheduler-fix
2006-01-16 11:42:04.000446588 -0800
+++ linux-2.6.15-ming/block/elevator.c 2006-01-16 11:42:41.145131615
-0800
@@ -151,12 +151,12 @@ static void elevator_setup_default(void)
strcpy(chosen_elevator, CONFIG_DEFAULT_IOSCHED);
/*
- * If the given scheduler is not available, fall back to no-op.
+ * If the given scheduler is not available, fall back to default.
*/
if ((e = elevator_find(chosen_elevator)))
elevator_put(e);
else
- strcpy(chosen_elevator, "noop");
+ strcpy(chosen_elevator, CONFIG_DEFAULT_IOSCHED);
}
static int __init elevator_setup(char *str)
_
On Mon, Jan 16 2006, Mingming Cao wrote:
> On Mon, 2006-01-16 at 09:43 +0100, Jens Axboe wrote:
> > On Fri, Jan 13 2006, Andrew Morton wrote:
> > > Mingming Cao <[email protected]> wrote:
> > > >
> > > > On 2.6.14, the
> > > > fall back io scheduler (if the chosen io scheduler is not found) is set
> > > > to the default io scheduler (anticipatory, in this case), but since
> > > > 2.6.15-rc1, this semanistic is changed to fall back to noop.
> > >
> > > OK. And I assume that AS wasn't compiled, so that's why it fell back?
> > >
> > > I actually thought that elevator= got removed, now we have
> > > /sys/block/sda/queue/scheduler. But I guess that's not very useful with
> > > CONFIG_SYSFS=n.
> > >
> > > > Is there any reason to fall back to noop instead of as? It seems
> > > > anticipatory is much better than noop for ext3 with large sequential
> > > > write tests (i.e, 1G dd test) ...
> > >
> > > I suspect that was an accident. Jens?
> >
> > It is, it makes more sense to fallback to the default of course.
> >
>
> How about this one?
>
>
> In the case the chosen io scheduler (elevator = "xxx") is not on the
> registered list, fall back to default scheduler makes more sense.
>
> Signed-off-by: Mingming Cao <[email protected]>
I think you just recreated an identical patch to the one I added this
morning:
http://brick.kernel.dk/git/?p=linux-2.6-block.git;a=commitdiff;h=b7bfcf7cbd58d2a64aa46f3b4bec921e346e604f
:-)
--
Jens Axboe
On Mon, 2006-01-16 at 20:49 +0100, Jens Axboe wrote:
> On Mon, Jan 16 2006, Mingming Cao wrote:
> > On Mon, 2006-01-16 at 09:43 +0100, Jens Axboe wrote:
> > > On Fri, Jan 13 2006, Andrew Morton wrote:
> > > > Mingming Cao <[email protected]> wrote:
> > > > >
> > > > > On 2.6.14, the
> > > > > fall back io scheduler (if the chosen io scheduler is not found) is set
> > > > > to the default io scheduler (anticipatory, in this case), but since
> > > > > 2.6.15-rc1, this semanistic is changed to fall back to noop.
> > > >
> > > > OK. And I assume that AS wasn't compiled, so that's why it fell back?
> > > >
> > > > I actually thought that elevator= got removed, now we have
> > > > /sys/block/sda/queue/scheduler. But I guess that's not very useful with
> > > > CONFIG_SYSFS=n.
> > > >
> > > > > Is there any reason to fall back to noop instead of as? It seems
> > > > > anticipatory is much better than noop for ext3 with large sequential
> > > > > write tests (i.e, 1G dd test) ...
> > > >
> > > > I suspect that was an accident. Jens?
> > >
> > > It is, it makes more sense to fallback to the default of course.
> > >
> >
> > How about this one?
> >
> >
> > In the case the chosen io scheduler (elevator = "xxx") is not on the
> > registered list, fall back to default scheduler makes more sense.
> >
> > Signed-off-by: Mingming Cao <[email protected]>
>
> I think you just recreated an identical patch to the one I added this
> morning:
>
> http://brick.kernel.dk/git/?p=linux-2.6-block.git;a=commitdiff;h=b7bfcf7cbd58d2a64aa46f3b4bec921e346e604f
>
> :-)
>
Hehe, please ignore mine:-)
On 1/16/06, Jens Axboe <[email protected]> wrote:
> On Fri, Jan 13 2006, Andrew Morton wrote:
> > Mingming Cao <[email protected]> wrote:
> > >
> > > On 2.6.14, the
> > > fall back io scheduler (if the chosen io scheduler is not found) is set
> > > to the default io scheduler (anticipatory, in this case), but since
> > > 2.6.15-rc1, this semanistic is changed to fall back to noop.
> >
> > OK. And I assume that AS wasn't compiled, so that's why it fell back?
> >
> > I actually thought that elevator= got removed, now we have
> > /sys/block/sda/queue/scheduler. But I guess that's not very useful with
> > CONFIG_SYSFS=n.
> >
> > > Is there any reason to fall back to noop instead of as? It seems
> > > anticipatory is much better than noop for ext3 with large sequential
> > > write tests (i.e, 1G dd test) ...
> >
> > I suspect that was an accident. Jens?
>
> It is, it makes more sense to fallback to the default of course.
Not an accident at all, actually, because the original patch i
submitted allowed you to select a scheduler as 'default' even if it
were compiled as a module in kconfig. Since noop is guaranteed to be
present in any system, it is the obvious choice if the chosen or
default scheduler is not loaded.
If you change it to fall back to the default, it will oops if the
default is not available.
NATE
On 1/16/06, Jens Axboe <[email protected]> wrote:
> On Sat, Jan 14 2006, Chuck Ebbert wrote:
> > In-Reply-To: <[email protected]>
> >
> > On Fri, 13 Jan 2006, Andrew Morton wrote:
> >
> > > OK. And I assume that AS wasn't compiled, so that's why it fell back?
> >
> > As of 2.6.15 you need to use "anticipatory" instead of "as".
> >
> > Maybe this patch would help?
> >
> > Signed-off-by: Chuck Ebbert <[email protected]>
> >
> > --- 2.6.15a.orig/block/elevator.c
> > +++ 2.6.15a/block/elevator.c
> > @@ -150,6 +150,13 @@ static void elevator_setup_default(void)
> > if (!chosen_elevator[0])
> > strcpy(chosen_elevator, CONFIG_DEFAULT_IOSCHED);
> >
> > + /*
> > + * Be backwards-compatible with previous kernels, so users
> > + * won't get the wrong elevator.
> > + */
> > + if (!strcmp(chosen_elevator, "as"))
> > + strcpy(chosen_elevator, "anticipatory");
> > +
> > /*
> > * If the given scheduler is not available, fall back to no-op.
> > */
>
> We probably should apply this, since it used to be 'as'.
i agree
NATE
On Thu, Jan 19 2006, Nate Diller wrote:
> On 1/16/06, Jens Axboe <[email protected]> wrote:
> > On Fri, Jan 13 2006, Andrew Morton wrote:
> > > Mingming Cao <[email protected]> wrote:
> > > >
> > > > On 2.6.14, the
> > > > fall back io scheduler (if the chosen io scheduler is not found) is set
> > > > to the default io scheduler (anticipatory, in this case), but since
> > > > 2.6.15-rc1, this semanistic is changed to fall back to noop.
> > >
> > > OK. And I assume that AS wasn't compiled, so that's why it fell back?
> > >
> > > I actually thought that elevator= got removed, now we have
> > > /sys/block/sda/queue/scheduler. But I guess that's not very useful with
> > > CONFIG_SYSFS=n.
> > >
> > > > Is there any reason to fall back to noop instead of as? It seems
> > > > anticipatory is much better than noop for ext3 with large sequential
> > > > write tests (i.e, 1G dd test) ...
> > >
> > > I suspect that was an accident. Jens?
> >
> > It is, it makes more sense to fallback to the default of course.
>
> Not an accident at all, actually, because the original patch i
> submitted allowed you to select a scheduler as 'default' even if it
> were compiled as a module in kconfig. Since noop is guaranteed to be
> present in any system, it is the obvious choice if the chosen or
> default scheduler is not loaded.
Yes and that was a bug in that patch. The default scheduler must be
builtin, that's a given. The Kconfig rules should make a default
selection as a module illegal. And they do, they have since been fixed.
> If you change it to fall back to the default, it will oops if the
> default is not available.
It must be.
--
Jens Axboe
Jens Axboe wrote:
> On Sat, Jan 14 2006, Chuck Ebbert wrote:
>
>>In-Reply-To: <[email protected]>
>>
>>On Fri, 13 Jan 2006, Andrew Morton wrote:
>>
>>
>>>OK. And I assume that AS wasn't compiled, so that's why it fell back?
>>
>>As of 2.6.15 you need to use "anticipatory" instead of "as".
>>
>>Maybe this patch would help?
>>
>>Signed-off-by: Chuck Ebbert <[email protected]>
>>
>>--- 2.6.15a.orig/block/elevator.c
>>+++ 2.6.15a/block/elevator.c
>>@@ -150,6 +150,13 @@ static void elevator_setup_default(void)
>> if (!chosen_elevator[0])
>> strcpy(chosen_elevator, CONFIG_DEFAULT_IOSCHED);
>>
>>+ /*
>>+ * Be backwards-compatible with previous kernels, so users
>>+ * won't get the wrong elevator.
>>+ */
>>+ if (!strcmp(chosen_elevator, "as"))
>>+ strcpy(chosen_elevator, "anticipatory");
>>+
>> /*
>> * If the given scheduler is not available, fall back to no-op.
>> */
>
>
> We probably should apply this, since it used to be 'as'.
>
Just out of curiousity, why did 'as' get renamed to 'anticipatory'?
--
tejun
On Sat, Jan 21 2006, Tejun Heo wrote:
> Jens Axboe wrote:
> >On Sat, Jan 14 2006, Chuck Ebbert wrote:
> >
> >>In-Reply-To: <[email protected]>
> >>
> >>On Fri, 13 Jan 2006, Andrew Morton wrote:
> >>
> >>
> >>>OK. And I assume that AS wasn't compiled, so that's why it fell back?
> >>
> >>As of 2.6.15 you need to use "anticipatory" instead of "as".
> >>
> >>Maybe this patch would help?
> >>
> >>Signed-off-by: Chuck Ebbert <[email protected]>
> >>
> >>--- 2.6.15a.orig/block/elevator.c
> >>+++ 2.6.15a/block/elevator.c
> >>@@ -150,6 +150,13 @@ static void elevator_setup_default(void)
> >> if (!chosen_elevator[0])
> >> strcpy(chosen_elevator, CONFIG_DEFAULT_IOSCHED);
> >>
> >>+ /*
> >>+ * Be backwards-compatible with previous kernels, so users
> >>+ * won't get the wrong elevator.
> >>+ */
> >>+ if (!strcmp(chosen_elevator, "as"))
> >>+ strcpy(chosen_elevator, "anticipatory");
> >>+
> >> /*
> >> * If the given scheduler is not available, fall back to no-op.
> >> */
> >
> >
> >We probably should apply this, since it used to be 'as'.
> >
>
> Just out of curiousity, why did 'as' get renamed to 'anticipatory'?
Side effect really, not intentional. 'as' always registered itself with
the elevator core as "anticipatory", the logic to match elevator=foo
string to scheduler used to be completely seperate prior to the addition
of online switchable elevators. So when those two were tied together,
the "as" disappeared. I guess the right thing to do at that time was to
rename "anticipatory", but that didn't happen...
--
Jens Axboe