Hi,
In ext3-reservations-window-allocation-fix.patch from -mm, we try to
make sure that we always search for a new reservation from the goal
forwards, not just from the previous window forwards. I'm assuming this
is done to optimise random writes.
I'm still not convinced we get it right. In alloc_new_reservation(), we
do:
if ((my_rsv->rsv_start <= group_end_block) &&
(my_rsv->rsv_end > group_end_block))
return -1;
We get into alloc_new_reservation in the first place either when the
goal is outside the window, or we could not allocate inside the window.
Now, in the latter case, the check is correct --- if the window spans
two block groups and we couldn't allocate in the first block group, then
we should continue in the next one.
But what if the goal was in the current block group, but was *prior* to
the window? The goal is outside the window, yet the above check may
still be true, and we'll incorrectly decide to avoid the current block
group entirely.
I think we need an "&& start_block >= my_rsv->rsv_start" to deal with
this.
If we get past that test --- the reservation window is all entirely
inside one group --- then we have the following chunk in
xt3-reservations-window-allocation-fix.patch:
- /* remember where we are before we discard the old one */
- if (my_rsv->rsv_end + 1 > start_block)
- start_block = my_rsv->rsv_end + 1;
- search_head = my_rsv;
+ search_head = search_reserve_window(&my_rsv->rsv_node, start_block);
which I'm assuming is trying to pin the search start to the goal block.
But that's wrong --- search_reserve_window does a downwards-traversing
tree search, and needs to be given the root of the tree in order to find
a given block. Giving it the current reservation node as the search
start point is not going to allow it to find the right node in all
cases.
Have I misunderstood something?
Fortunately, none of the above should affect the normal hot path of
sequential allocation, but it may well penalise random writes.
Cheers,
Stephen
On Fri, 2004-10-15 at 06:27, Stephen C. Tweedie wrote:
> Hi,
>
> In ext3-reservations-window-allocation-fix.patch from -mm, we try to
> make sure that we always search for a new reservation from the goal
> forwards, not just from the previous window forwards. I'm assuming this
> is done to optimise random writes.
>
> I'm still not convinced we get it right. In alloc_new_reservation(), we
> do:
>
> if ((my_rsv->rsv_start <= group_end_block) &&
> (my_rsv->rsv_end > group_end_block))
> return -1;
>
> We get into alloc_new_reservation in the first place either when the
> goal is outside the window, or we could not allocate inside the window.
>
> Now, in the latter case, the check is correct --- if the window spans
> two block groups and we couldn't allocate in the first block group, then
> we should continue in the next one.
>
> But what if the goal was in the current block group, but was *prior* to
> the window? The goal is outside the window, yet the above check may
> still be true, and we'll incorrectly decide to avoid the current block
> group entirely.
>
> I think we need an "&& start_block >= my_rsv->rsv_start" to deal with
> this.
>
You are right. I think at the beginning we were decide the honor the the
reservation window if the goal is out of the window. Considering the
goal is selected with good reason, we agreed that we should try honor
the goal block all the time.But we missed this case apparently.
> If we get past that test --- the reservation window is all entirely
> inside one group --- then we have the following chunk in
> xt3-reservations-window-allocation-fix.patch:
>
> - /* remember where we are before we discard the old one */
> - if (my_rsv->rsv_end + 1 > start_block)
> - start_block = my_rsv->rsv_end + 1;
> - search_head = my_rsv;
> + search_head = search_reserve_window(&my_rsv->rsv_node, start_block);
>
> which I'm assuming is trying to pin the search start to the goal block.
> But that's wrong --- search_reserve_window does a downwards-traversing
> tree search, and needs to be given the root of the tree in order to find
> a given block. Giving it the current reservation node as the search
> start point is not going to allow it to find the right node in all
> cases.
>
> Have I misunderstood something?
>
You are correct, again:) We should do a search_reserve_window() from the
root.
I will post a fix for these two soon.
Mingming
Hi,
On Fri, 2004-10-15 at 17:01, mingming cao wrote:
> > Have I misunderstood something?
> >
> You are correct, again:) We should do a search_reserve_window() from the
> root.
>
> I will post a fix for these two soon.
Thanks. I'll be away for a few days so I probably won't be able to look
at the fix until Wednesday next week.
Cheers,
Stephen
On Fri, 2004-10-15 at 09:40, Stephen C. Tweedie wrote:
> Hi,
>
> On Fri, 2004-10-15 at 17:01, mingming cao wrote:
>
> > > Have I misunderstood something?
> > >
> > You are correct, again:) We should do a search_reserve_window() from the
> > root.
> >
> > I will post a fix for these two soon.
>
> Thanks. I'll be away for a few days so I probably won't be able to look
> at the fix until Wednesday next week.
>
How about this? Haven't test it, will do it shortly.:)
Thanks,
Mingming
Hi,
On Fri, 2004-10-15 at 21:29, mingming cao wrote:
> How about this? Haven't test it, will do it shortly.:)
if ((my_rsv->rsv_start <= group_end_block) &&
- (my_rsv->rsv_end > group_end_block))
+ (my_rsv->rsv_end > group_end_block) &&
+ (start_block <= my_rsv->rsv_start))
return -1;
That new "<=" should be a ">=", shouldn't it?
Otherwise, looks OK.
--Stephen
On Fri, 2004-10-15 at 15:20, Stephen C. Tweedie wrote:
> Hi,
>
> On Fri, 2004-10-15 at 21:29, mingming cao wrote:
>
> > How about this? Haven't test it, will do it shortly.:)
>
> if ((my_rsv->rsv_start <= group_end_block) &&
> - (my_rsv->rsv_end > group_end_block))
> + (my_rsv->rsv_end > group_end_block) &&
> + (start_block <= my_rsv->rsv_start))
> return -1;
>
> That new "<=" should be a ">=", shouldn't it?
My bad! It should be ">=":)
Updated patch attached.
Thanks!
Mingming
Allow user shut down reservation-based allocation(using ioctl) on a specific file(e.g. for seeky random write).
---
linux-2.6.9-rc4-mm1-ming/fs/ext3/balloc.c | 15 +++++++++++++--
1 files changed, 13 insertions(+), 2 deletions(-)
diff -puN fs/ext3/balloc.c~ext3_shutdown_reservation_per-file fs/ext3/balloc.c
--- linux-2.6.9-rc4-mm1/fs/ext3/balloc.c~ext3_shutdown_reservation_per-file 2004-10-18 22:27:06.333698488 -0700
+++ linux-2.6.9-rc4-mm1-ming/fs/ext3/balloc.c 2004-10-18 22:34:52.825780912 -0700
@@ -1153,6 +1153,7 @@ int ext3_new_block(handle_t *handle, str
struct ext3_super_block *es;
struct ext3_sb_info *sbi;
struct reserve_window_node *my_rsv = NULL;
+ struct reserve_window_node *rsv = &EXT3_I(inode)->i_rsv_window;
#ifdef EXT3FS_DEBUG
static int goal_hits, goal_attempts;
#endif
@@ -1176,8 +1177,18 @@ int ext3_new_block(handle_t *handle, str
sbi = EXT3_SB(sb);
es = EXT3_SB(sb)->s_es;
ext3_debug("goal=%lu.\n", goal);
- if (test_opt(sb, RESERVATION) && S_ISREG(inode->i_mode))
- my_rsv = &EXT3_I(inode)->i_rsv_window;
+ /*
+ * Allocate a block from reservation only when
+ * filesystem is mounted with reservation(default,-o reservation), and
+ * it's a regular file, and
+ * the desired window size is greater than 0 (One could use ioctl
+ * command EXT3_IOC_SETRSVSZ to set the window size to 0 to turn off
+ * reservation on that particular file)
+ */
+ if (test_opt(sb, RESERVATION) &&
+ S_ISREG(inode->i_mode) &&
+ (atomic_read(&rsv->rsv_goal_size) > 0))
+ my_rsv = rsv;
if (!ext3_has_free_blocks(sbi)) {
*errp = -ENOSPC;
goto out;
_
Mingming Cao <[email protected]> wrote:
>
> Allow user shut down reservation-based allocation(using ioctl) on a specific file(e.g. for seeky random write).
Applications currently pass a seeky-access hint into the kernel via
posix_fadvise(POSIX_FADV_RANDOM). It would be nice to hook into that
rather than adding an ext3-specific ioctl. Maybe just peeking at
file->f_ra.ra_pages would suffice.
On Mon, 2004-10-18 at 16:42, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > Allow user shut down reservation-based allocation(using ioctl) on a specific file(e.g. for seeky random write).
>
> Applications currently pass a seeky-access hint into the kernel via
> posix_fadvise(POSIX_FADV_RANDOM). It would be nice to hook into that
> rather than adding an ext3-specific ioctl. Maybe just peeking at
> file->f_ra.ra_pages would suffice.
>
>
Ha, I think I did not make this clear in the description: The patch did
not add a new ext3-specific ioctl. We added two ioctl before for ext3
reservation code to allow user to get and set reservation window size,
so application could set it's desired reservation window size. It allows
the window size to be 0, however the current reservation code just
simply set the window size value to 0, but still try to allocate a size
0 reservation window. We should skip doing reservation-based allocation
at all if the desired window size is 0.
Just thought seeky random write application could use the existing ioctl
to let the kernel know it does not need reservation at all. Isn't that
more straightforward?
On Mon, 2004-10-18 at 18:01 -0700, Mingming Cao wrote:
> On Mon, 2004-10-18 at 16:42, Andrew Morton wrote:
> > Applications currently pass a seeky-access hint into the kernel via
> > posix_fadvise(POSIX_FADV_RANDOM). It would be nice to hook into that
[...]
> Just thought seeky random write application could use the existing ioctl
> to let the kernel know it does not need reservation at all. Isn't that
> more straightforward?
Going the ioctl route seems to imply that userspace would have to do a
posix_fadvise() call and the ioctl, as opposed to just the fadvise. No?
I'm betting the fadvise call is a little more portable as well.
Ray
On Wed, 2004-10-20 at 10:55, Ray Lee wrote:
> On Mon, 2004-10-18 at 18:01 -0700, Mingming Cao wrote:
> > On Mon, 2004-10-18 at 16:42, Andrew Morton wrote:
> > > Applications currently pass a seeky-access hint into the kernel via
> > > posix_fadvise(POSIX_FADV_RANDOM). It would be nice to hook into that
>
> [...]
>
> > Just thought seeky random write application could use the existing ioctl
> > to let the kernel know it does not need reservation at all. Isn't that
> > more straightforward?
>
> Going the ioctl route seems to imply that userspace would have to do a
> posix_fadvise() call and the ioctl, as opposed to just the fadvise. No?
> I'm betting the fadvise call is a little more portable as well.
Agreed. How about this: add a check in ext3_prepare_write(), if user passed seeky-access hint through posix_fadvise(via check for file->f_ra.ra_pages == 0), if so, close the reservation window. But we still need previous patch to handle window size 0(no reservation) in reservation code.
Currently the readahead is turned off if the userspace passed a seeky access hint to kernel by POSIX_FADVISE. It would be nice to turn off the block allocation reservation as well for seeky random write.
---
linux-2.6.9-rc4-mm1-ming/fs/ext3/inode.c | 9 +++++++++
1 files changed, 9 insertions(+)
diff -puN fs/ext3/inode.c~ext3_turn_off_reservation_by_fadvise fs/ext3/inode.c
--- linux-2.6.9-rc4-mm1/fs/ext3/inode.c~ext3_turn_off_reservation_by_fadvise 2004-10-25 18:25:44.135751800 -0700
+++ linux-2.6.9-rc4-mm1-ming/fs/ext3/inode.c 2004-10-25 18:34:39.501363856 -0700
@@ -1008,6 +1008,15 @@ retry:
ret = PTR_ERR(handle);
goto out;
}
+
+ /*
+ * if user passed a seeky-access hint to kernel,
+ * through POSIX_FADV_RANDOM,(file->r_ra.ra_pages is cleared)
+ * turn off reservation for block allocation correspondingly.
+ */
+ if (!file->f_ra.ra_pages)
+ atomic_set(&EXT3_I(inode)->i_rsv_window.rsv_goal_size, 0);
+
ret = block_prepare_write(page, from, to, ext3_get_block);
if (ret)
goto prepare_write_failed;
_
On Mon, 2004-10-25 at 13:33, Mingming Cao wrote:
> On Wed, 2004-10-20 at 10:55, Ray Lee wrote:
> > On Mon, 2004-10-18 at 18:01 -0700, Mingming Cao wrote:
> > > On Mon, 2004-10-18 at 16:42, Andrew Morton wrote:
> > > > Applications currently pass a seeky-access hint into the kernel via
> > > > posix_fadvise(POSIX_FADV_RANDOM). It would be nice to hook into that
> >
> > [...]
> >
> > > Just thought seeky random write application could use the existing ioctl
> > > to let the kernel know it does not need reservation at all. Isn't that
> > > more straightforward?
> >
> > Going the ioctl route seems to imply that userspace would have to do a
> > posix_fadvise() call and the ioctl, as opposed to just the fadvise. No?
> > I'm betting the fadvise call is a little more portable as well.
>
> Agreed. How about this: add a check in ext3_prepare_write(), if user passed seeky-access hint through posix_fadvise(via check for file->f_ra.ra_pages == 0), if so, close the reservation window. But we still need previous patch to handle window size 0(no reservation) in reservation code.
The previous patch could re-open the reservation in the case user then
switch back to POSIX_FADV_NORMAL or POSIZ_FADV_SEQUENTAIL. Updated
version included. We set the window size to -1 if the user pass the
POSIX_FADV_RANDOM, instead of 0, to differentiate the case when user
really want to close the window by ioctl. We could re-open the window
only when user pass the POSIX_FADV_SEQUENTAIL (or POSIX_FADV_NORMAL
hint) and the window was closed because of previous seeky access hint.
Currently the readahead is turned off if the userspace passed a seeky access hint to kernel by POSIX_FADVISE. It would be nice to turn off the block allocation reservation as well for seeky random write.
---
linux-2.6.9-rc4-mm1-ming/fs/ext3/balloc.c | 2 +-
linux-2.6.9-rc4-mm1-ming/fs/ext3/inode.c | 17 +++++++++++++++++
2 files changed, 18 insertions(+), 1 deletion(-)
diff -puN fs/ext3/inode.c~ext3_turn_off_reservation_by_fadvise fs/ext3/inode.c
--- linux-2.6.9-rc4-mm1/fs/ext3/inode.c~ext3_turn_off_reservation_by_fadvise 2004-10-25 18:25:44.135751800 -0700
+++ linux-2.6.9-rc4-mm1-ming/fs/ext3/inode.c 2004-10-25 21:43:16.194964928 -0700
@@ -1001,6 +1001,7 @@ static int ext3_prepare_write(struct fil
int ret, needed_blocks = ext3_writepage_trans_blocks(inode);
handle_t *handle;
int retries = 0;
+ int windowsz = 0;
retry:
handle = ext3_journal_start(inode, needed_blocks);
@@ -1008,6 +1009,22 @@ retry:
ret = PTR_ERR(handle);
goto out;
}
+
+ /*
+ * if user passed a seeky-access hint to kernel,
+ * through POSIX_FADV_RANDOM,(file->r_ra.ra_pages is cleared)
+ * turn off reservation for block allocation correspondingly.
+ *
+ * Otherwise, if user switch back to POSIX_FADV_SEQUENTIAL or
+ * POSIX_FADV_NORMAL, re-open the reservation window.
+ */
+ windowsz = atomic_read(&EXT3_I(inode)->i_rsv_window.rsv_goal_size);
+ if ((windowsz > 0) && (!file->f_ra.ra_pages))
+ atomic_set(&EXT3_I(inode)->i_rsv_window.rsv_goal_size, -1);
+ if ((windowsz == -1) && file->f_ra.ra_pages)
+ atomic_set(&EXT3_I(inode)->i_rsv_window.rsv_goal_size,
+ EXT3_DEFAULT_RESERVE_BLOCKS);
+
ret = block_prepare_write(page, from, to, ext3_get_block);
if (ret)
goto prepare_write_failed;
diff -puN fs/ext3/balloc.c~ext3_turn_off_reservation_by_fadvise fs/ext3/balloc.c
--- linux-2.6.9-rc4-mm1/fs/ext3/balloc.c~ext3_turn_off_reservation_by_fadvise 2004-10-25 21:23:05.152071432 -0700
+++ linux-2.6.9-rc4-mm1-ming/fs/ext3/balloc.c 2004-10-25 21:24:48.136415432 -0700
@@ -1154,7 +1154,7 @@ int ext3_new_block(handle_t *handle, str
struct ext3_sb_info *sbi;
struct reserve_window_node *my_rsv = NULL;
struct reserve_window_node *rsv = &EXT3_I(inode)->i_rsv_window;
- unsigned short windowsz = 0;
+ int windowsz = 0;
#ifdef EXT3FS_DEBUG
static int goal_hits, goal_attempts;
#endif
_
Mingming Cao <[email protected]> wrote:
>
> /*
> + * if user passed a seeky-access hint to kernel,
> + * through POSIX_FADV_RANDOM,(file->r_ra.ra_pages is cleared)
> + * turn off reservation for block allocation correspondingly.
> + *
> + * Otherwise, if user switch back to POSIX_FADV_SEQUENTIAL or
> + * POSIX_FADV_NORMAL, re-open the reservation window.
> + */
> + windowsz = atomic_read(&EXT3_I(inode)->i_rsv_window.rsv_goal_size);
> + if ((windowsz > 0) && (!file->f_ra.ra_pages))
> + atomic_set(&EXT3_I(inode)->i_rsv_window.rsv_goal_size, -1);
> + if ((windowsz == -1) && file->f_ra.ra_pages)
> + atomic_set(&EXT3_I(inode)->i_rsv_window.rsv_goal_size,
> + EXT3_DEFAULT_RESERVE_BLOCKS);
> +
It's pretty sad that we add this extra code into ext3_prepare_write() -
it's almost never actually executed.
I wonder how important this optimisation really is? I bet no applications
are using posix_fadvise(POSIX_FADV_RANDOM) anyway.
It we really see that benefits are available from this approach we probably
need to bite the bullet and add file_operations.fadvise()
On Mon, 2004-10-25 at 16:45, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > /*
> > + * if user passed a seeky-access hint to kernel,
> > + * through POSIX_FADV_RANDOM,(file->r_ra.ra_pages is cleared)
> > + * turn off reservation for block allocation correspondingly.
> > + *
> > + * Otherwise, if user switch back to POSIX_FADV_SEQUENTIAL or
> > + * POSIX_FADV_NORMAL, re-open the reservation window.
> > + */
> > + windowsz = atomic_read(&EXT3_I(inode)->i_rsv_window.rsv_goal_size);
> > + if ((windowsz > 0) && (!file->f_ra.ra_pages))
> > + atomic_set(&EXT3_I(inode)->i_rsv_window.rsv_goal_size, -1);
> > + if ((windowsz == -1) && file->f_ra.ra_pages)
> > + atomic_set(&EXT3_I(inode)->i_rsv_window.rsv_goal_size,
> > + EXT3_DEFAULT_RESERVE_BLOCKS);
> > +
>
> It's pretty sad that we add this extra code into ext3_prepare_write() -
> it's almost never actually executed.
>
:(
> I wonder how important this optimisation really is? I bet no applications
> are using posix_fadvise(POSIX_FADV_RANDOM) anyway.
>
I don't know if there is application using the POSIX_FADV_RANDOM. No? If
this is the truth, I think we don't need this optimization at present.
Logically reservation does not benefit seeky random write, but there is
no benchmark showing performance issue so far. We have already provided
ways for applications turn off reservation through the existing ioctl
for specified file and -o noreservation mount option for the whole
filesystem.
> It we really see that benefits are available from this approach we probably
> need to bite the bullet and add file_operations.fadvise()
>
I agree.
Mingming
Mingming Cao <[email protected]> wrote:
>
> > I wonder how important this optimisation really is? I bet no applications
> > are using posix_fadvise(POSIX_FADV_RANDOM) anyway.
> >
> I don't know if there is application using the POSIX_FADV_RANDOM. No? If
> this is the truth, I think we don't need this optimization at present.
> Logically reservation does not benefit seeky random write, but there is
> no benchmark showing performance issue so far. We have already provided
> ways for applications turn off reservation through the existing ioctl
> for specified file and -o noreservation mount option for the whole
> filesystem.
Well we definitely don't want to be encouraging application developers to be
adding ext3-specific ioctls. So we need to work out if any applications
can get significant benefit from manually disabling reservations and if
so, wire up fadvise() into filesystems and do it that way.
Do you know if disabling reservations helps any workloads?
On Tue, 2004-10-26 at 13:18, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > > I wonder how important this optimisation really is? I bet no applications
> > > are using posix_fadvise(POSIX_FADV_RANDOM) anyway.
> > >
> > I don't know if there is application using the POSIX_FADV_RANDOM. No? If
> > this is the truth, I think we don't need this optimization at present.
> > Logically reservation does not benefit seeky random write, but there is
> > no benchmark showing performance issue so far. We have already provided
> > ways for applications turn off reservation through the existing ioctl
> > for specified file and -o noreservation mount option for the whole
> > filesystem.
>
> Well we definitely don't want to be encouraging application developers to be
> adding ext3-specific ioctls. So we need to work out if any applications
> can get significant benefit from manually disabling reservations and if
> so, wire up fadvise() into filesystems and do it that way.
>
Okey. Also, if there is such a need to disable reservation, maybe we
could think about closing the reservation window dynamically based on
past I/O pattern, when user application did not give any hint.
> Do you know if disabling reservations helps any workloads?
>
Not that I aware of. We have done several microbenchmark on sequential
and random workload, haven't seen regression with reservations so far.
But our testing is certainly not comprehensive.
Mingming