2013-09-10 02:03:28

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH] ext4: fix dirty pages writback regression.

From: "Yan, Zheng" <[email protected]>

Our Linux Kernel Performance project found that commit 4e7ea81db5
(ext4: restructure writeback path) indroduced regression. After
the commit, ext4 does not merge adjacent mapped dirty pages during
writeback. The "!buffer_delay(bh) && !buffer_unwritten(bh)" check
in mpage_add_bh_to_extent() prevents the merging.

Signed-off-by: Yan, Zheng <[email protected]>
---
fs/ext4/inode.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c79fd7d..bfeb8b2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1944,8 +1944,7 @@ static bool mpage_add_bh_to_extent(struct mpage_da_data *mpd, ext4_lblk_t lblk,
struct ext4_map_blocks *map = &mpd->map;

/* Buffer that doesn't need mapping for writeback? */
- if (!buffer_dirty(bh) || !buffer_mapped(bh) ||
- (!buffer_delay(bh) && !buffer_unwritten(bh))) {
+ if (!buffer_dirty(bh) || !buffer_mapped(bh)) {
/* So far no extent to map => we write the buffer right away */
if (map->m_len == 0)
return true;
--
1.8.1.4



2013-09-10 09:00:47

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix dirty pages writback regression.

On Tue 10-09-13 10:02:58, Yan, Zheng wrote:
> From: "Yan, Zheng" <[email protected]>
>
> Our Linux Kernel Performance project found that commit 4e7ea81db5
> (ext4: restructure writeback path) indroduced regression. After
> the commit, ext4 does not merge adjacent mapped dirty pages during
> writeback. The "!buffer_delay(bh) && !buffer_unwritten(bh)" check
> in mpage_add_bh_to_extent() prevents the merging.
>
> Signed-off-by: Yan, Zheng <[email protected]>
> ---
> fs/ext4/inode.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c79fd7d..bfeb8b2 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1944,8 +1944,7 @@ static bool mpage_add_bh_to_extent(struct mpage_da_data *mpd, ext4_lblk_t lblk,
> struct ext4_map_blocks *map = &mpd->map;
>
> /* Buffer that doesn't need mapping for writeback? */
> - if (!buffer_dirty(bh) || !buffer_mapped(bh) ||
> - (!buffer_delay(bh) && !buffer_unwritten(bh))) {
> + if (!buffer_dirty(bh) || !buffer_mapped(bh)) {
Sadly it isn't that easy. The condition is there for a reason... The
reason is that we are looking for an extent to map. When we already have
some buffer to map and then there is buffer which doesn't need mapping we
cannot just add it to the extent because then we would allocate too many
blocks. Also the transaction credits we have reserved are just for
allocation of one extent and its possible conversion from unwritten to
written extent. So that's another reason why you cannot arbitrarily merge
allocated and unallocated buffers or written and unwritten buffers.

Now also I'm somewhat surprised that this condition is causing a regression
because it was also present in the previous version of the code although it
was there in a different place and in a slightly different form. I'll try to
reproduce results using your fio script and will have a look at what is
causing the problem.

Honza

> /* So far no extent to map => we write the buffer right away */
> if (map->m_len == 0)
> return true;
> --
> 1.8.1.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-09-10 09:10:15

by Yan, Zheng

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix dirty pages writback regression.

On 09/10/2013 05:00 PM, Jan Kara wrote:
> On Tue 10-09-13 10:02:58, Yan, Zheng wrote:
>> From: "Yan, Zheng" <[email protected]>
>>
>> Our Linux Kernel Performance project found that commit 4e7ea81db5
>> (ext4: restructure writeback path) indroduced regression. After
>> the commit, ext4 does not merge adjacent mapped dirty pages during
>> writeback. The "!buffer_delay(bh) && !buffer_unwritten(bh)" check
>> in mpage_add_bh_to_extent() prevents the merging.
>>
>> Signed-off-by: Yan, Zheng <[email protected]>
>> ---
>> fs/ext4/inode.c | 3 +--
>> 1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index c79fd7d..bfeb8b2 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1944,8 +1944,7 @@ static bool mpage_add_bh_to_extent(struct mpage_da_data *mpd, ext4_lblk_t lblk,
>> struct ext4_map_blocks *map = &mpd->map;
>>
>> /* Buffer that doesn't need mapping for writeback? */
>> - if (!buffer_dirty(bh) || !buffer_mapped(bh) ||
>> - (!buffer_delay(bh) && !buffer_unwritten(bh))) {
>> + if (!buffer_dirty(bh) || !buffer_mapped(bh)) {
> Sadly it isn't that easy. The condition is there for a reason... The
> reason is that we are looking for an extent to map. When we already have
> some buffer to map and then there is buffer which doesn't need mapping we
> cannot just add it to the extent because then we would allocate too many
> blocks.

the "(b_state & BH_FLAGS) == map->m_flags)" check in mpage_add_bh_to_extent() should
prevent delayed and non-delayed dirty pages from merging. What am I missing here?

Regards
Yan, Zheng

> Also the transaction credits we have reserved are just for
> allocation of one extent and its possible conversion from unwritten to
> written extent. So that's another reason why you cannot arbitrarily merge
> allocated and unallocated buffers or written and unwritten buffers.
>
> Now also I'm somewhat surprised that this condition is causing a regression
> because it was also present in the previous version of the code although it
> was there in a different place and in a slightly different form. I'll try to
> reproduce results using your fio script and will have a look at what is
> causing the problem.
>
> Honza
>
>> /* So far no extent to map => we write the buffer right away */
>> if (map->m_len == 0)
>> return true;
>> --
>> 1.8.1.4
>>


2013-09-10 09:17:17

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix dirty pages writback regression.

On Tue 10-09-13 17:10:13, Yan, Zheng wrote:
> On 09/10/2013 05:00 PM, Jan Kara wrote:
> > On Tue 10-09-13 10:02:58, Yan, Zheng wrote:
> >> From: "Yan, Zheng" <[email protected]>
> >>
> >> Our Linux Kernel Performance project found that commit 4e7ea81db5
> >> (ext4: restructure writeback path) indroduced regression. After
> >> the commit, ext4 does not merge adjacent mapped dirty pages during
> >> writeback. The "!buffer_delay(bh) && !buffer_unwritten(bh)" check
> >> in mpage_add_bh_to_extent() prevents the merging.
> >>
> >> Signed-off-by: Yan, Zheng <[email protected]>
> >> ---
> >> fs/ext4/inode.c | 3 +--
> >> 1 file changed, 1 insertion(+), 2 deletions(-)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index c79fd7d..bfeb8b2 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -1944,8 +1944,7 @@ static bool mpage_add_bh_to_extent(struct mpage_da_data *mpd, ext4_lblk_t lblk,
> >> struct ext4_map_blocks *map = &mpd->map;
> >>
> >> /* Buffer that doesn't need mapping for writeback? */
> >> - if (!buffer_dirty(bh) || !buffer_mapped(bh) ||
> >> - (!buffer_delay(bh) && !buffer_unwritten(bh))) {
> >> + if (!buffer_dirty(bh) || !buffer_mapped(bh)) {
> > Sadly it isn't that easy. The condition is there for a reason... The
> > reason is that we are looking for an extent to map. When we already have
> > some buffer to map and then there is buffer which doesn't need mapping we
> > cannot just add it to the extent because then we would allocate too many
> > blocks.
>
> the "(b_state & BH_FLAGS) == map->m_flags)" check in
> mpage_add_bh_to_extent() should prevent delayed and non-delayed dirty
> pages from merging. What am I missing here?
Yes, that is true. Sorry, I didn't realize this originally. But what
difference would then your patch make?

Honza

> > Also the transaction credits we have reserved are just for
> > allocation of one extent and its possible conversion from unwritten to
> > written extent. So that's another reason why you cannot arbitrarily merge
> > allocated and unallocated buffers or written and unwritten buffers.
> >
> > Now also I'm somewhat surprised that this condition is causing a regression
> > because it was also present in the previous version of the code although it
> > was there in a different place and in a slightly different form. I'll try to
> > reproduce results using your fio script and will have a look at what is
> > causing the problem.
> >
> > Honza
> >
> >> /* So far no extent to map => we write the buffer right away */
> >> if (map->m_len == 0)
> >> return true;
> >> --
> >> 1.8.1.4
> >>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-09-10 11:01:18

by Yan, Zheng

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix dirty pages writback regression.

On Tue, Sep 10, 2013 at 5:17 PM, Jan Kara <[email protected]> wrote:
> On Tue 10-09-13 17:10:13, Yan, Zheng wrote:
>> On 09/10/2013 05:00 PM, Jan Kara wrote:
>> > On Tue 10-09-13 10:02:58, Yan, Zheng wrote:
>> >> From: "Yan, Zheng" <[email protected]>
>> >>
>> >> Our Linux Kernel Performance project found that commit 4e7ea81db5
>> >> (ext4: restructure writeback path) indroduced regression. After
>> >> the commit, ext4 does not merge adjacent mapped dirty pages during
>> >> writeback. The "!buffer_delay(bh) && !buffer_unwritten(bh)" check
>> >> in mpage_add_bh_to_extent() prevents the merging.
>> >>
>> >> Signed-off-by: Yan, Zheng <[email protected]>
>> >> ---
>> >> fs/ext4/inode.c | 3 +--
>> >> 1 file changed, 1 insertion(+), 2 deletions(-)
>> >>
>> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> >> index c79fd7d..bfeb8b2 100644
>> >> --- a/fs/ext4/inode.c
>> >> +++ b/fs/ext4/inode.c
>> >> @@ -1944,8 +1944,7 @@ static bool mpage_add_bh_to_extent(struct mpage_da_data *mpd, ext4_lblk_t lblk,
>> >> struct ext4_map_blocks *map = &mpd->map;
>> >>
>> >> /* Buffer that doesn't need mapping for writeback? */
>> >> - if (!buffer_dirty(bh) || !buffer_mapped(bh) ||
>> >> - (!buffer_delay(bh) && !buffer_unwritten(bh))) {
>> >> + if (!buffer_dirty(bh) || !buffer_mapped(bh)) {
>> > Sadly it isn't that easy. The condition is there for a reason... The
>> > reason is that we are looking for an extent to map. When we already have
>> > some buffer to map and then there is buffer which doesn't need mapping we
>> > cannot just add it to the extent because then we would allocate too many
>> > blocks.
>>
>> the "(b_state & BH_FLAGS) == map->m_flags)" check in
>> mpage_add_bh_to_extent() should prevent delayed and non-delayed dirty
>> pages from merging. What am I missing here?
> Yes, that is true. Sorry, I didn't realize this originally. But what
> difference would then your patch make?
>

Continuous dirty pages that are "!buffer_delay(bh) && !buffer_unwritten(bh)"
can be merged during writeback. I think the change will reduce number of
bio for workload that re-writes existing data.

Regards
Yan, Zheng

> Honza
>
>> > Also the transaction credits we have reserved are just for
>> > allocation of one extent and its possible conversion from unwritten to
>> > written extent. So that's another reason why you cannot arbitrarily merge
>> > allocated and unallocated buffers or written and unwritten buffers.
>> >
>> > Now also I'm somewhat surprised that this condition is causing a regression
>> > because it was also present in the previous version of the code although it
>> > was there in a different place and in a slightly different form. I'll try to
>> > reproduce results using your fio script and will have a look at what is
>> > causing the problem.
>> >
>> > Honza
>> >
>> >> /* So far no extent to map => we write the buffer right away */
>> >> if (map->m_len == 0)
>> >> return true;
>> >> --
>> >> 1.8.1.4
>> >>
>>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-09-10 11:15:21

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix dirty pages writback regression.

On Tue 10-09-13 19:01:16, Yan, Zheng wrote:
> On Tue, Sep 10, 2013 at 5:17 PM, Jan Kara <[email protected]> wrote:
> > On Tue 10-09-13 17:10:13, Yan, Zheng wrote:
> >> On 09/10/2013 05:00 PM, Jan Kara wrote:
> >> > On Tue 10-09-13 10:02:58, Yan, Zheng wrote:
> >> >> From: "Yan, Zheng" <[email protected]>
> >> >>
> >> >> Our Linux Kernel Performance project found that commit 4e7ea81db5
> >> >> (ext4: restructure writeback path) indroduced regression. After
> >> >> the commit, ext4 does not merge adjacent mapped dirty pages during
> >> >> writeback. The "!buffer_delay(bh) && !buffer_unwritten(bh)" check
> >> >> in mpage_add_bh_to_extent() prevents the merging.
> >> >>
> >> >> Signed-off-by: Yan, Zheng <[email protected]>
> >> >> ---
> >> >> fs/ext4/inode.c | 3 +--
> >> >> 1 file changed, 1 insertion(+), 2 deletions(-)
> >> >>
> >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> >> index c79fd7d..bfeb8b2 100644
> >> >> --- a/fs/ext4/inode.c
> >> >> +++ b/fs/ext4/inode.c
> >> >> @@ -1944,8 +1944,7 @@ static bool mpage_add_bh_to_extent(struct mpage_da_data *mpd, ext4_lblk_t lblk,
> >> >> struct ext4_map_blocks *map = &mpd->map;
> >> >>
> >> >> /* Buffer that doesn't need mapping for writeback? */
> >> >> - if (!buffer_dirty(bh) || !buffer_mapped(bh) ||
> >> >> - (!buffer_delay(bh) && !buffer_unwritten(bh))) {
> >> >> + if (!buffer_dirty(bh) || !buffer_mapped(bh)) {
> >> > Sadly it isn't that easy. The condition is there for a reason... The
> >> > reason is that we are looking for an extent to map. When we already have
> >> > some buffer to map and then there is buffer which doesn't need mapping we
> >> > cannot just add it to the extent because then we would allocate too many
> >> > blocks.
> >>
> >> the "(b_state & BH_FLAGS) == map->m_flags)" check in
> >> mpage_add_bh_to_extent() should prevent delayed and non-delayed dirty
> >> pages from merging. What am I missing here?
> > Yes, that is true. Sorry, I didn't realize this originally. But what
> > difference would then your patch make?
> >
>
> Continuous dirty pages that are "!buffer_delay(bh) && !buffer_unwritten(bh)"
> can be merged during writeback. I think the change will reduce number of
> bio for workload that re-writes existing data.
I see. The code is actually supposed to achieve that as well - when we
have a sequence of mapped and dirty buffers (pages), we keep map->m_len ==
0 and just always return true from mpage_add_bh_to_extent(). This way
the caller keep adding pages to current bio in ext4_bio_write_page() while
they are contiguous.

However I agree there is something broken somewhere in this logic because I
can reproduce the regression with that commit as well and the request sizes
are somewhat smaller after the patch (not sure if that thing alone can be
the reason for rather big throughput drop). I'm investigating it now.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-09-10 12:53:55

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix dirty pages writback regression.

On Tue 10-09-13 13:15:19, Jan Kara wrote:
> On Tue 10-09-13 19:01:16, Yan, Zheng wrote:
> > On Tue, Sep 10, 2013 at 5:17 PM, Jan Kara <[email protected]> wrote:
> > > On Tue 10-09-13 17:10:13, Yan, Zheng wrote:
> > >> On 09/10/2013 05:00 PM, Jan Kara wrote:
> > >> > On Tue 10-09-13 10:02:58, Yan, Zheng wrote:
> > >> >> From: "Yan, Zheng" <[email protected]>
> > >> >>
> > >> >> Our Linux Kernel Performance project found that commit 4e7ea81db5
> > >> >> (ext4: restructure writeback path) indroduced regression. After
> > >> >> the commit, ext4 does not merge adjacent mapped dirty pages during
> > >> >> writeback. The "!buffer_delay(bh) && !buffer_unwritten(bh)" check
> > >> >> in mpage_add_bh_to_extent() prevents the merging.
> > >> >>
> > >> >> Signed-off-by: Yan, Zheng <[email protected]>
> > >> >> ---
> > >> >> fs/ext4/inode.c | 3 +--
> > >> >> 1 file changed, 1 insertion(+), 2 deletions(-)
> > >> >>
> > >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > >> >> index c79fd7d..bfeb8b2 100644
> > >> >> --- a/fs/ext4/inode.c
> > >> >> +++ b/fs/ext4/inode.c
> > >> >> @@ -1944,8 +1944,7 @@ static bool mpage_add_bh_to_extent(struct mpage_da_data *mpd, ext4_lblk_t lblk,
> > >> >> struct ext4_map_blocks *map = &mpd->map;
> > >> >>
> > >> >> /* Buffer that doesn't need mapping for writeback? */
> > >> >> - if (!buffer_dirty(bh) || !buffer_mapped(bh) ||
> > >> >> - (!buffer_delay(bh) && !buffer_unwritten(bh))) {
> > >> >> + if (!buffer_dirty(bh) || !buffer_mapped(bh)) {
> > >> > Sadly it isn't that easy. The condition is there for a reason... The
> > >> > reason is that we are looking for an extent to map. When we already have
> > >> > some buffer to map and then there is buffer which doesn't need mapping we
> > >> > cannot just add it to the extent because then we would allocate too many
> > >> > blocks.
> > >>
> > >> the "(b_state & BH_FLAGS) == map->m_flags)" check in
> > >> mpage_add_bh_to_extent() should prevent delayed and non-delayed dirty
> > >> pages from merging. What am I missing here?
> > > Yes, that is true. Sorry, I didn't realize this originally. But what
> > > difference would then your patch make?
> > >
> >
> > Continuous dirty pages that are "!buffer_delay(bh) && !buffer_unwritten(bh)"
> > can be merged during writeback. I think the change will reduce number of
> > bio for workload that re-writes existing data.
> I see. The code is actually supposed to achieve that as well - when we
> have a sequence of mapped and dirty buffers (pages), we keep map->m_len ==
> 0 and just always return true from mpage_add_bh_to_extent(). This way
> the caller keep adding pages to current bio in ext4_bio_write_page() while
> they are contiguous.
>
> However I agree there is something broken somewhere in this logic because I
> can reproduce the regression with that commit as well and the request sizes
> are somewhat smaller after the patch (not sure if that thing alone can be
> the reason for rather big throughput drop). I'm investigating it now.
I've tracked down the problem. It is that we were too aggressive writing
back pages and thus we were finding less consecutive dirty pages (with
random IO workloads the longer you wait with writeback the better IO
pattern you get). And the writeback was too aggressive because we weren't
properly terminating the writeback when nr_to_write dropped to zero. After
I fixed that condition I'm getting about 25% better throughput then before
code reorg... I'll be sending the patch with numbers etc. later this
afternoon.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR