LinuxLists.cc - Writes blocked on wait_for_stable_page (Writes of less than page size sometimes take too long)

2015-01-28 20:17:51

Subject: Writes blocked on wait_for_stable_page (Writes of less than page size sometimes take too long)

Hi
I am working on a 64 bit Android device and have been trying to improve
performance for stream based data download (for example an ftp)
The device has 3GB of ram and the dirty_ratio and dirty_background_ratio
are set to 5 and 1 respectively.

Kernel 3.10 , Highmem is not enabled and the backing device is a emmc
and checksumming is not enabled

I noticed when profiling writes that if we dont use streamed IO (ie. use
write of whatever size data was read on the tcp stream) there are some
writes that seem to get blocked on
wait_for_stable_page.

If I force the writes to be buffered in the userspace and ensure writing
4k chunks the writes never seem to stall.

I noticed there was earlier discussion on this and idea were proposed to
use snapshotting of the pages to avoid stalls...
For example: https://lwn.net/Articles/546658/

But this seems to only snapshot ext3 ... (unless i misunderstood what
the patch is doing)

Is there a similar patch to snapshot the buffers to not stall the writes
for ext4?

Please let me know.

I would really appreciate any help you can give me.

--
Thanks
Nikhilesh Reddy

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

2015-01-28 20:42:52

by Nikhilesh Reddy

[permalink] [raw]

Subject: Writes blocked on wait_for_stable_page (Writes of less than page size sometimes take too long)

2015-01-29 01:32:17

by Darrick J. Wong

[permalink] [raw]

Subject: Re: Writes blocked on wait_for_stable_page (Writes of less than page size sometimes take too long)

On Wed, Jan 28, 2015 at 11:27:13AM -0800, Nikhilesh Reddy wrote:
> Hi
> I am working on a 64 bit Android device and have been trying to
> improve performance for stream based data download (for example an
> ftp)
> The device has 3GB of ram and the dirty_ratio and
> dirty_background_ratio are set to 5 and 1 respectively.
>
> Kernel 3.10 , Highmem is not enabled and the backing device is a
> emmc and checksumming is not enabled

Ok, 3.10 kernel is new enough that stable page writes only apply to
devices that demand it, and apparently your eMMC demands it.

> I noticed when profiling writes that if we dont use streamed IO (ie.
> use write of whatever size data was read on the tcp stream) there
> are some writes that seem to get blocked on
> wait_for_stable_page.
>
> If I force the writes to be buffered in the userspace and ensure
> writing 4k chunks the writes never seem to stall.

That's consistent with a page being partially dirtied, written out,
and partially dirtied again before write-out finishes. If you buffer
the incoming data such that a page is only dirtied once, you'll never
notice wait_for_stable_page.

Are you explicitly forcing writeout (i.e. fsync()) after every chunk
arrives? Or, is the rate of incoming data high enough such that we
hit either dirty*ratio limit? It isn't too hard to hit 30MB these
days. Why are you lowering the ratios from their defaults?

> I noticed there was earlier discussion on this and idea were
> proposed to use snapshotting of the pages to avoid stalls...
> For example: https://lwn.net/Articles/546658/
>
> But this seems to only snapshot ext3 ... (unless i misunderstood
> what the patch is doing)
>
> Is there a similar patch to snapshot the buffers to not stall the
> writes for ext4?

No, there is not -- the problem with the snapshot solution is that it
requires page allocations when the FS is (potentially) trying to
reclaim memory by writing out dirty pages.

--D

> Please let me know.
>
> I would really appreciate any help you can give me.
>
>
> --
> Thanks
> Nikhilesh Reddy
>
> Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-01-29 01:52:53

by Nikhilesh Reddy

[permalink] [raw]

Subject: Re: Writes blocked on wait_for_stable_page (Writes of less than page size sometimes take too long)

Hi Darrick
Thanks so much for your reply.
I apologize the stable page wait seems to be an odd outlier in that
run ... I ran multiple traces again.

Please find my answers inline below.

On 01/28/2015 01:39 PM, Darrick J. Wong wrote:
> On Wed, Jan 28, 2015 at 11:27:13AM -0800, Nikhilesh Reddy wrote:
>> Hi
>> I am working on a 64 bit Android device and have been trying to
>> improve performance for stream based data download (for example an
>> ftp)
>> The device has 3GB of ram and the dirty_ratio and
>> dirty_background_ratio are set to 5 and 1 respectively.
>>
>> Kernel 3.10 , Highmem is not enabled and the backing device is a
>> emmc and checksumming is not enabled
> Ok, 3.10 kernel is new enough that stable page writes only apply to
> devices that demand it, and apparently your eMMC demands it.
I tried checking my logs again .. and yes the stable pages dont apply
... that seems to be an odd instance where it took a while.
/My apologies.

So On running many more trace runs it appears that the wait is actually
on in the function

wait_on_page_writeback(page) ext4_da_write_begin

Snippet below

/*
* grab_cache_page_write_begin() can take a long time if the
* system is thrashing due to memory pressure, or if the page
* is being written back. So grab it first before we start
* the transaction handle. This also allows us to allocate
* the page (if needed) without using GFP_NOFS.
*/
retry_grab:
page = grab_cache_page_write_begin(mapping, index, flags);
if (!page)
return -ENOMEM;
unlock_page(page);

retry_journal:
handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks);
if (IS_ERR(handle)) {
page_cache_release(page);
return PTR_ERR(handle);
}

lock_page(page);
if (page->mapping != mapping) {
/* The page got truncated from under us */
unlock_page(page);
page_cache_release(page);
ext4_journal_stop(handle);
goto retry_grab;
}
wait_on_page_writeback(page); <<<<<<<<<<<<<<<<<<This is where the wait is!

The writeback just seems to take long.. but only when we write to the
same page over again...
The grab_cache_page_write_begin gives the same page until we write
enough...

Is there a wait to not wait for the writeback?

>
>> I noticed when profiling writes that if we dont use streamed IO (ie.
>> use write of whatever size data was read on the tcp stream) there
>> are some writes that seem to get blocked on
>> wait_for_stable_page.
>>
>> If I force the writes to be buffered in the userspace and ensure
>> writing 4k chunks the writes never seem to stall.
> That's consistent with a page being partially dirtied, written out,
> and partially dirtied again before write-out finishes. If you buffer
> the incoming data such that a page is only dirtied once, you'll never
> notice wait_for_stable_page.
>
> Are you explicitly forcing writeout (i.e. fsync()) after every chunk
> arrives? Or, is the rate of incoming data high enough such that we
> hit either dirty*ratio limit? It isn't too hard to hit 30MB these
> days. Why are you lowering the ratios from their defaults?
Using the higher values seems to cause longer stall ...when we do stall...
Smaller values are cause the write out to happen often but each one
takes less time.

Setting it to the defaults causes a longer stall which seems to be
impacting TCP due to the delay in reading.
Additional info : dirty_expire_centisecs was set to 200 and the data
rate is close to 300Mbps. so yea 30MB should be less than a second.

Am I missing something obvious here?

Please do let me know.

Thanks so much for you help

>
>> I noticed there was earlier discussion on this and idea were
>> proposed to use snapshotting of the pages to avoid stalls...
>> For example: https://lwn.net/Articles/546658/
>>
>> But this seems to only snapshot ext3 ... (unless i misunderstood
>> what the patch is doing)
>>
>> Is there a similar patch to snapshot the buffers to not stall the
>> writes for ext4?
> No, there is not -- the problem with the snapshot solution is that it
> requires page allocations when the FS is (potentially) trying to
> reclaim memory by writing out dirty pages.
>
> --D
>
>> Please let me know.
>>
>> I would really appreciate any help you can give me.
>>
>>
>> --
>> Thanks
>> Nikhilesh Reddy
>>
>> Qualcomm Innovation Center, Inc.
>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
>> a Linux Foundation Collaborative Project.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Thanks
Nikhilesh Reddy

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

2015-01-29 02:02:53

by Nikhilesh Reddy

[permalink] [raw]

Subject: Re: Writes blocked on wait_for_stable_page (Writes of less than page size sometimes take too long)

Also can someone please confirm if the following patch help address
this issue That I described below?

https://kernel.googlesource.com/pub/scm/linux/kernel/git/tytso/ext4/+/7afe5aa59ed3da7b6161617e7f157c7c680dc41e

Quick Summary:

in the function
wait_on_page_writeback(page) inside ext4_da_write_begin seems to take
long and the process is stuck waiting.

The writeback just seems to take long.. but only when we write to the
same page over again...
The grab_cache_page_write_begin gives the same page until we write
enough...
Is there a wait to not wait for the writeback?

On 01/28/2015 03:23 PM, Nikhilesh Reddy wrote:
>
> Hi Darrick
> Thanks so much for your reply.
> I apologize the stable page wait seems to be an odd outlier in that
> run ... I ran multiple traces again.
>
> Please find my answers inline below.
>
> On 01/28/2015 01:39 PM, Darrick J. Wong wrote:
>> On Wed, Jan 28, 2015 at 11:27:13AM -0800, Nikhilesh Reddy wrote:
>>> Hi
>>> I am working on a 64 bit Android device and have been trying to
>>> improve performance for stream based data download (for example an
>>> ftp)
>>> The device has 3GB of ram and the dirty_ratio and
>>> dirty_background_ratio are set to 5 and 1 respectively.
>>>
>>> Kernel 3.10 , Highmem is not enabled and the backing device is a
>>> emmc and checksumming is not enabled
>> Ok, 3.10 kernel is new enough that stable page writes only apply to
>> devices that demand it, and apparently your eMMC demands it.
> I tried checking my logs again .. and yes the stable pages dont apply
> ... that seems to be an odd instance where it took a while.
> /My apologies.
>
> So On running many more trace runs it appears that the wait is actually
> on in the function
>
> wait_on_page_writeback(page) ext4_da_write_begin
>
> Snippet below
>
> /*
> * grab_cache_page_write_begin() can take a long time if the
> * system is thrashing due to memory pressure, or if the page
> * is being written back. So grab it first before we start
> * the transaction handle. This also allows us to allocate
> * the page (if needed) without using GFP_NOFS.
> */
> retry_grab:
> page = grab_cache_page_write_begin(mapping, index, flags);
> if (!page)
> return -ENOMEM;
> unlock_page(page);
>
> retry_journal:
> handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks);
> if (IS_ERR(handle)) {
> page_cache_release(page);
> return PTR_ERR(handle);
> }
>
> lock_page(page);
> if (page->mapping != mapping) {
> /* The page got truncated from under us */
> unlock_page(page);
> page_cache_release(page);
> ext4_journal_stop(handle);
> goto retry_grab;
> }
> wait_on_page_writeback(page); <<<<<<<<<<<<<<<<<<This is where the
> wait is!
>
>
>
> The writeback just seems to take long.. but only when we write to the
> same page over again...
> The grab_cache_page_write_begin gives the same page until we write
> enough...
>
> Is there a wait to not wait for the writeback?
>
>
>>
>>> I noticed when profiling writes that if we dont use streamed IO (ie.
>>> use write of whatever size data was read on the tcp stream) there
>>> are some writes that seem to get blocked on
>>> wait_for_stable_page.
>>>
>>> If I force the writes to be buffered in the userspace and ensure
>>> writing 4k chunks the writes never seem to stall.
>> That's consistent with a page being partially dirtied, written out,
>> and partially dirtied again before write-out finishes. If you buffer
>> the incoming data such that a page is only dirtied once, you'll never
>> notice wait_for_stable_page.
>>
>> Are you explicitly forcing writeout (i.e. fsync()) after every chunk
>> arrives? Or, is the rate of incoming data high enough such that we
>> hit either dirty*ratio limit? It isn't too hard to hit 30MB these
>> days. Why are you lowering the ratios from their defaults?
> Using the higher values seems to cause longer stall ...when we do stall...
> Smaller values are cause the write out to happen often but each one
> takes less time.
>
> Setting it to the defaults causes a longer stall which seems to be
> impacting TCP due to the delay in reading.
> Additional info : dirty_expire_centisecs was set to 200 and the data
> rate is close to 300Mbps. so yea 30MB should be less than a second.
>
> Am I missing something obvious here?
>
> Please do let me know.
>
> Thanks so much for you help
>
>
>>
>>> I noticed there was earlier discussion on this and idea were
>>> proposed to use snapshotting of the pages to avoid stalls...
>>> For example: https://lwn.net/Articles/546658/
>>>
>>> But this seems to only snapshot ext3 ... (unless i misunderstood
>>> what the patch is doing)
>>>
>>> Is there a similar patch to snapshot the buffers to not stall the
>>> writes for ext4?
>> No, there is not -- the problem with the snapshot solution is that it
>> requires page allocations when the FS is (potentially) trying to
>> reclaim memory by writing out dirty pages.
>>
>> --D
>>
>>> Please let me know.
>>>
>>> I would really appreciate any help you can give me.
>>>
>>>
>>> --
>>> Thanks
>>> Nikhilesh Reddy
>>>
>>> Qualcomm Innovation Center, Inc.
>>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
>>> Forum,
>>> a Linux Foundation Collaborative Project.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

--
Thanks
Nikhilesh Reddy

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

2015-01-29 02:03:22

by Darrick J. Wong

[permalink] [raw]

Subject: Re: Writes blocked on wait_for_stable_page (Writes of less than page size sometimes take too long)

On Wed, Jan 28, 2015 at 03:36:02PM -0800, Nikhilesh Reddy wrote:
> Also can someone please confirm if the following patch help address
> this issue That I described below?
>
> https://kernel.googlesource.com/pub/scm/linux/kernel/git/tytso/ext4/+/7afe5aa59ed3da7b6161617e7f157c7c680dc41e
>
> Quick Summary:
>
> in the function
> wait_on_page_writeback(page) inside ext4_da_write_begin seems to
> take long and the process is stuck waiting.
>
> The writeback just seems to take long.. but only when we write to the
> same page over again...
> The grab_cache_page_write_begin gives the same page until we write
> enough...
> Is there a wait to not wait for the writeback?

Oh, right. <smacks head>

Yes, that wait_for_page_writeback() was replaced with a call to
wait_for_stable_page() upstream, so it'll probably work for your 3.10
kernel.

(I also wonder why not jump to a newer LTS kernel, but that's neither
here nor there.)

<more below>

> On 01/28/2015 03:23 PM, Nikhilesh Reddy wrote:
> >
> >Hi Darrick
> >Thanks so much for your reply.
> >I apologize the stable page wait seems to be an odd outlier in that
> >run ... I ran multiple traces again.
> >
> >Please find my answers inline below.
> >
> >On 01/28/2015 01:39 PM, Darrick J. Wong wrote:
> >>On Wed, Jan 28, 2015 at 11:27:13AM -0800, Nikhilesh Reddy wrote:
> >>>Hi
> >>>I am working on a 64 bit Android device and have been trying to
> >>>improve performance for stream based data download (for example an
> >>>ftp)
> >>>The device has 3GB of ram and the dirty_ratio and
> >>>dirty_background_ratio are set to 5 and 1 respectively.
> >>>
> >>>Kernel 3.10 , Highmem is not enabled and the backing device is a
> >>>emmc and checksumming is not enabled
> >>Ok, 3.10 kernel is new enough that stable page writes only apply to
> >>devices that demand it, and apparently your eMMC demands it.
> >I tried checking my logs again .. and yes the stable pages dont apply
> >... that seems to be an odd instance where it took a while.
> >/My apologies.
> >
> >So On running many more trace runs it appears that the wait is actually
> >on in the function
> >
> >wait_on_page_writeback(page) ext4_da_write_begin
> >
> >Snippet below
> >
> > /*
> > * grab_cache_page_write_begin() can take a long time if the
> > * system is thrashing due to memory pressure, or if the page
> > * is being written back. So grab it first before we start
> > * the transaction handle. This also allows us to allocate
> > * the page (if needed) without using GFP_NOFS.
> > */
> >retry_grab:
> > page = grab_cache_page_write_begin(mapping, index, flags);
> > if (!page)
> > return -ENOMEM;
> > unlock_page(page);
> >
> >retry_journal:
> > handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks);
> > if (IS_ERR(handle)) {
> > page_cache_release(page);
> > return PTR_ERR(handle);
> > }
> >
> > lock_page(page);
> > if (page->mapping != mapping) {
> > /* The page got truncated from under us */
> > unlock_page(page);
> > page_cache_release(page);
> > ext4_journal_stop(handle);
> > goto retry_grab;
> > }
> > wait_on_page_writeback(page); <<<<<<<<<<<<<<<<<<This is where the
> >wait is!
> >
> >
> >
> >The writeback just seems to take long.. but only when we write to the
> >same page over again...
> >The grab_cache_page_write_begin gives the same page until we write
> >enough...
> >
> >Is there a wait to not wait for the writeback?
> >
> >
> >>
> >>>I noticed when profiling writes that if we dont use streamed IO (ie.
> >>>use write of whatever size data was read on the tcp stream) there
> >>>are some writes that seem to get blocked on
> >>>wait_for_stable_page.
> >>>
> >>>If I force the writes to be buffered in the userspace and ensure
> >>>writing 4k chunks the writes never seem to stall.
> >>That's consistent with a page being partially dirtied, written out,
> >>and partially dirtied again before write-out finishes. If you buffer
> >>the incoming data such that a page is only dirtied once, you'll never
> >>notice wait_for_stable_page.
> >>
> >>Are you explicitly forcing writeout (i.e. fsync()) after every chunk
> >>arrives? Or, is the rate of incoming data high enough such that we
> >>hit either dirty*ratio limit? It isn't too hard to hit 30MB these
> >>days. Why are you lowering the ratios from their defaults?
>
> >Using the higher values seems to cause longer stall ...when we do stall...
> >Smaller values are cause the write out to happen often but each one
> >takes less time.
> >
> >Setting it to the defaults causes a longer stall which seems to be
> >impacting TCP due to the delay in reading.
> >Additional info : dirty_expire_centisecs was set to 200 and the data
> >rate is close to 300Mbps. so yea 30MB should be less than a second.

Ah, you're concerned about the writeout latency, and it's important
that incoming data end up on flash as quickly as is convenient. Hm.

That patch you found will probably get rid of the symptoms; I'd also
suggest fsync()ing after each page boundary instead of lowering the
writeback ratios since that increases the write load systemwide,
which will make the writeouts for your app take more time and wear out
the flash sooner.

--D

> >
> >Am I missing something obvious here?
> >
> >Please do let me know.
> >
> >Thanks so much for you help
> >
> >
> >>
> >>>I noticed there was earlier discussion on this and idea were
> >>>proposed to use snapshotting of the pages to avoid stalls...
> >>>For example: https://lwn.net/Articles/546658/
> >>>
> >>>But this seems to only snapshot ext3 ... (unless i misunderstood
> >>>what the patch is doing)
> >>>
> >>>Is there a similar patch to snapshot the buffers to not stall the
> >>>writes for ext4?
> >>No, there is not -- the problem with the snapshot solution is that it
> >>requires page allocations when the FS is (potentially) trying to
> >>reclaim memory by writing out dirty pages.
> >>
> >>--D
> >>
> >>>Please let me know.
> >>>
> >>>I would really appreciate any help you can give me.
> >>>
> >>>
> >>>--
> >>>Thanks
> >>>Nikhilesh Reddy
> >>>
> >>>Qualcomm Innovation Center, Inc.
> >>>The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
> >>>Forum,
> >>>a Linux Foundation Collaborative Project.
> >>>--
> >>>To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >>>the body of a message to [email protected]
> >>>More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
>
>
> --
> Thanks
> Nikhilesh Reddy
>
> Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project.

2015-01-30 21:26:01

by Nikhilesh Reddy

[permalink] [raw]

Subject: Re: Writes blocked on wait_for_stable_page (Writes of less than page size sometimes take too long)

Thanks so much Darrick for all your help.
That patch helped.

I had a bit of a query with respect to stable pages though...

From what I understand only backing devices that need some sort of
checksumming seem to need stable pages... AS in regular emmc shouldnt
need it correct?

If not then the below patch would now cause a pass-through rather than
wait for writeback and I was wondering if that could cause corruptions
when the same page is dirtied again with writeback in progress.

Also with respect to the flush... since the applications are prettymuch
third party I cant really do much there...

But I was wondering now that there is no contention with the pages after
the patch ... if reducing the cache may actually be bad.

Still running metrics so wont know until they are done.

On 01/28/2015 03:57 PM, Darrick J. Wong wrote:
> On Wed, Jan 28, 2015 at 03:36:02PM -0800, Nikhilesh Reddy wrote:
>> Also can someone please confirm if the following patch help address
>> this issue That I described below?
>>
>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/tytso/ext4/+/7afe5aa59ed3da7b6161617e7f157c7c680dc41e
>>
>> Quick Summary:
>>
>> in the function
>> wait_on_page_writeback(page) inside ext4_da_write_begin seems to
>> take long and the process is stuck waiting.
>>
>> The writeback just seems to take long.. but only when we write to the
>> same page over again...
>> The grab_cache_page_write_begin gives the same page until we write
>> enough...
>> Is there a wait to not wait for the writeback?
>
> Oh, right. <smacks head>
>
> Yes, that wait_for_page_writeback() was replaced with a call to
> wait_for_stable_page() upstream, so it'll probably work for your 3.10
> kernel.
>
> (I also wonder why not jump to a newer LTS kernel, but that's neither
> here nor there.)
>
> <more below>
>
>> On 01/28/2015 03:23 PM, Nikhilesh Reddy wrote:
>>>
>>> Hi Darrick
>>> Thanks so much for your reply.
>>> I apologize the stable page wait seems to be an odd outlier in that
>>> run ... I ran multiple traces again.
>>>
>>> Please find my answers inline below.
>>>
>>> On 01/28/2015 01:39 PM, Darrick J. Wong wrote:
>>>> On Wed, Jan 28, 2015 at 11:27:13AM -0800, Nikhilesh Reddy wrote:
>>>>> Hi
>>>>> I am working on a 64 bit Android device and have been trying to
>>>>> improve performance for stream based data download (for example an
>>>>> ftp)
>>>>> The device has 3GB of ram and the dirty_ratio and
>>>>> dirty_background_ratio are set to 5 and 1 respectively.
>>>>>
>>>>> Kernel 3.10 , Highmem is not enabled and the backing device is a
>>>>> emmc and checksumming is not enabled
>>>> Ok, 3.10 kernel is new enough that stable page writes only apply to
>>>> devices that demand it, and apparently your eMMC demands it.
>>> I tried checking my logs again .. and yes the stable pages dont apply
>>> ... that seems to be an odd instance where it took a while.
>>> /My apologies.
>>>
>>> So On running many more trace runs it appears that the wait is actually
>>> on in the function
>>>
>>> wait_on_page_writeback(page) ext4_da_write_begin
>>>
>>> Snippet below
>>>
>>> /*
>>> * grab_cache_page_write_begin() can take a long time if the
>>> * system is thrashing due to memory pressure, or if the page
>>> * is being written back. So grab it first before we start
>>> * the transaction handle. This also allows us to allocate
>>> * the page (if needed) without using GFP_NOFS.
>>> */
>>> retry_grab:
>>> page = grab_cache_page_write_begin(mapping, index, flags);
>>> if (!page)
>>> return -ENOMEM;
>>> unlock_page(page);
>>>
>>> retry_journal:
>>> handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks);
>>> if (IS_ERR(handle)) {
>>> page_cache_release(page);
>>> return PTR_ERR(handle);
>>> }
>>>
>>> lock_page(page);
>>> if (page->mapping != mapping) {
>>> /* The page got truncated from under us */
>>> unlock_page(page);
>>> page_cache_release(page);
>>> ext4_journal_stop(handle);
>>> goto retry_grab;
>>> }
>>> wait_on_page_writeback(page); <<<<<<<<<<<<<<<<<<This is where the
>>> wait is!
>>>
>>>
>>>
>>> The writeback just seems to take long.. but only when we write to the
>>> same page over again...
>>> The grab_cache_page_write_begin gives the same page until we write
>>> enough...
>>>
>>> Is there a wait to not wait for the writeback?
>>>
>>>
>>>>
>>>>> I noticed when profiling writes that if we dont use streamed IO (ie.
>>>>> use write of whatever size data was read on the tcp stream) there
>>>>> are some writes that seem to get blocked on
>>>>> wait_for_stable_page.
>>>>>
>>>>> If I force the writes to be buffered in the userspace and ensure
>>>>> writing 4k chunks the writes never seem to stall.
>>>> That's consistent with a page being partially dirtied, written out,
>>>> and partially dirtied again before write-out finishes. If you buffer
>>>> the incoming data such that a page is only dirtied once, you'll never
>>>> notice wait_for_stable_page.
>>>>
>>>> Are you explicitly forcing writeout (i.e. fsync()) after every chunk
>>>> arrives? Or, is the rate of incoming data high enough such that we
>>>> hit either dirty*ratio limit? It isn't too hard to hit 30MB these
>>>> days. Why are you lowering the ratios from their defaults?
>>
>>> Using the higher values seems to cause longer stall ...when we do stall...
>>> Smaller values are cause the write out to happen often but each one
>>> takes less time.
>>>
>>> Setting it to the defaults causes a longer stall which seems to be
>>> impacting TCP due to the delay in reading.
>>> Additional info : dirty_expire_centisecs was set to 200 and the data
>>> rate is close to 300Mbps. so yea 30MB should be less than a second.
>
> Ah, you're concerned about the writeout latency, and it's important
> that incoming data end up on flash as quickly as is convenient. Hm.
>
> That patch you found will probably get rid of the symptoms; I'd also
> suggest fsync()ing after each page boundary instead of lowering the
> writeback ratios since that increases the write load systemwide,
> which will make the writeouts for your app take more time and wear out
> the flash sooner.
>
> --D
>
>>>
>>> Am I missing something obvious here?
>>>
>>> Please do let me know.
>>>
>>> Thanks so much for you help
>>>
>>>
>>>>
>>>>> I noticed there was earlier discussion on this and idea were
>>>>> proposed to use snapshotting of the pages to avoid stalls...
>>>>> For example: https://lwn.net/Articles/546658/
>>>>>
>>>>> But this seems to only snapshot ext3 ... (unless i misunderstood
>>>>> what the patch is doing)
>>>>>
>>>>> Is there a similar patch to snapshot the buffers to not stall the
>>>>> writes for ext4?
>>>> No, there is not -- the problem with the snapshot solution is that it
>>>> requires page allocations when the FS is (potentially) trying to
>>>> reclaim memory by writing out dirty pages.
>>>>
>>>> --D
>>>>
>>>>> Please let me know.
>>>>>
>>>>> I would really appreciate any help you can give me.
>>>>>
>>>>>
>>>>> --
>>>>> Thanks
>>>>> Nikhilesh Reddy
>>>>>
>>>>> Qualcomm Innovation Center, Inc.
>>>>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
>>>>> Forum,
>>>>> a Linux Foundation Collaborative Project.
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>>>> the body of a message to [email protected]
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
>>
>> --
>> Thanks
>> Nikhilesh Reddy
>>
>> Qualcomm Innovation Center, Inc.
>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
>> a Linux Foundation Collaborative Project.

--
Thanks
Nikhilesh Reddy

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

2015-01-30 21:53:38

by Darrick J. Wong

[permalink] [raw]

Subject: Re: Writes blocked on wait_for_stable_page (Writes of less than page size sometimes take too long)

On Fri, Jan 30, 2015 at 01:25:58PM -0800, Nikhilesh Reddy wrote:
> Thanks so much Darrick for all your help.
> That patch helped.
>
> I had a bit of a query with respect to stable pages though...
>
> From what I understand only backing devices that need some sort of
> checksumming seem to need stable pages... AS in regular emmc
> shouldnt need it correct?

Right.

> If not then the below patch would now cause a pass-through rather than
> wait for writeback and I was wondering if that could cause corruptions
> when the same page is dirtied again with writeback in progress.

It looked ok to me and several other reviewers. On a non-stable-pages device,
dirtying a page during writeback should be fine since the dirty bit will be
set and the page will eventually be written back again.

> Also with respect to the flush... since the applications are
> prettymuch third party I cant really do much there...

There's eatmydata, but that's evil... :)

> But I was wondering now that there is no contention with the pages
> after the patch ... if reducing the cache may actually be bad.
>
> Still running metrics so wont know until they are done.

--D

>
>
> On 01/28/2015 03:57 PM, Darrick J. Wong wrote:
> >On Wed, Jan 28, 2015 at 03:36:02PM -0800, Nikhilesh Reddy wrote:
> >>Also can someone please confirm if the following patch help address
> >>this issue That I described below?
> >>
> >>https://kernel.googlesource.com/pub/scm/linux/kernel/git/tytso/ext4/+/7afe5aa59ed3da7b6161617e7f157c7c680dc41e
> >>
> >>Quick Summary:
> >>
> >>in the function
> >>wait_on_page_writeback(page) inside ext4_da_write_begin seems to
> >>take long and the process is stuck waiting.
> >>
> >>The writeback just seems to take long.. but only when we write to the
> >>same page over again...
> >>The grab_cache_page_write_begin gives the same page until we write
> >>enough...
> >>Is there a wait to not wait for the writeback?
> >
> >Oh, right. <smacks head>
> >
> >Yes, that wait_for_page_writeback() was replaced with a call to
> >wait_for_stable_page() upstream, so it'll probably work for your 3.10
> >kernel.
> >
> >(I also wonder why not jump to a newer LTS kernel, but that's neither
> >here nor there.)
> >
> ><more below>
> >
> >>On 01/28/2015 03:23 PM, Nikhilesh Reddy wrote:
> >>>
> >>>Hi Darrick
> >>>Thanks so much for your reply.
> >>>I apologize the stable page wait seems to be an odd outlier in that
> >>>run ... I ran multiple traces again.
> >>>
> >>>Please find my answers inline below.
> >>>
> >>>On 01/28/2015 01:39 PM, Darrick J. Wong wrote:
> >>>>On Wed, Jan 28, 2015 at 11:27:13AM -0800, Nikhilesh Reddy wrote:
> >>>>>Hi
> >>>>>I am working on a 64 bit Android device and have been trying to
> >>>>>improve performance for stream based data download (for example an
> >>>>>ftp)
> >>>>>The device has 3GB of ram and the dirty_ratio and
> >>>>>dirty_background_ratio are set to 5 and 1 respectively.
> >>>>>
> >>>>>Kernel 3.10 , Highmem is not enabled and the backing device is a
> >>>>>emmc and checksumming is not enabled
> >>>>Ok, 3.10 kernel is new enough that stable page writes only apply to
> >>>>devices that demand it, and apparently your eMMC demands it.
> >>>I tried checking my logs again .. and yes the stable pages dont apply
> >>>... that seems to be an odd instance where it took a while.
> >>>/My apologies.
> >>>
> >>>So On running many more trace runs it appears that the wait is actually
> >>>on in the function
> >>>
> >>>wait_on_page_writeback(page) ext4_da_write_begin
> >>>
> >>>Snippet below
> >>>
> >>> /*
> >>> * grab_cache_page_write_begin() can take a long time if the
> >>> * system is thrashing due to memory pressure, or if the page
> >>> * is being written back. So grab it first before we start
> >>> * the transaction handle. This also allows us to allocate
> >>> * the page (if needed) without using GFP_NOFS.
> >>> */
> >>>retry_grab:
> >>> page = grab_cache_page_write_begin(mapping, index, flags);
> >>> if (!page)
> >>> return -ENOMEM;
> >>> unlock_page(page);
> >>>
> >>>retry_journal:
> >>> handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks);
> >>> if (IS_ERR(handle)) {
> >>> page_cache_release(page);
> >>> return PTR_ERR(handle);
> >>> }
> >>>
> >>> lock_page(page);
> >>> if (page->mapping != mapping) {
> >>> /* The page got truncated from under us */
> >>> unlock_page(page);
> >>> page_cache_release(page);
> >>> ext4_journal_stop(handle);
> >>> goto retry_grab;
> >>> }
> >>> wait_on_page_writeback(page); <<<<<<<<<<<<<<<<<<This is where the
> >>>wait is!
> >>>
> >>>
> >>>
> >>>The writeback just seems to take long.. but only when we write to the
> >>>same page over again...
> >>>The grab_cache_page_write_begin gives the same page until we write
> >>>enough...
> >>>
> >>>Is there a wait to not wait for the writeback?
> >>>
> >>>
> >>>>
> >>>>>I noticed when profiling writes that if we dont use streamed IO (ie.
> >>>>>use write of whatever size data was read on the tcp stream) there
> >>>>>are some writes that seem to get blocked on
> >>>>>wait_for_stable_page.
> >>>>>
> >>>>>If I force the writes to be buffered in the userspace and ensure
> >>>>>writing 4k chunks the writes never seem to stall.
> >>>>That's consistent with a page being partially dirtied, written out,
> >>>>and partially dirtied again before write-out finishes. If you buffer
> >>>>the incoming data such that a page is only dirtied once, you'll never
> >>>>notice wait_for_stable_page.
> >>>>
> >>>>Are you explicitly forcing writeout (i.e. fsync()) after every chunk
> >>>>arrives? Or, is the rate of incoming data high enough such that we
> >>>>hit either dirty*ratio limit? It isn't too hard to hit 30MB these
> >>>>days. Why are you lowering the ratios from their defaults?
> >>
> >>>Using the higher values seems to cause longer stall ...when we do stall...
> >>>Smaller values are cause the write out to happen often but each one
> >>>takes less time.
> >>>
> >>>Setting it to the defaults causes a longer stall which seems to be
> >>>impacting TCP due to the delay in reading.
> >>>Additional info : dirty_expire_centisecs was set to 200 and the data
> >>>rate is close to 300Mbps. so yea 30MB should be less than a second.
> >
> >Ah, you're concerned about the writeout latency, and it's important
> >that incoming data end up on flash as quickly as is convenient. Hm.
> >
> >That patch you found will probably get rid of the symptoms; I'd also
> >suggest fsync()ing after each page boundary instead of lowering the
> >writeback ratios since that increases the write load systemwide,
> >which will make the writeouts for your app take more time and wear out
> >the flash sooner.
> >
> >--D
> >
> >>>
> >>>Am I missing something obvious here?
> >>>
> >>>Please do let me know.
> >>>
> >>>Thanks so much for you help
> >>>
> >>>
> >>>>
> >>>>>I noticed there was earlier discussion on this and idea were
> >>>>>proposed to use snapshotting of the pages to avoid stalls...
> >>>>>For example: https://lwn.net/Articles/546658/
> >>>>>
> >>>>>But this seems to only snapshot ext3 ... (unless i misunderstood
> >>>>>what the patch is doing)
> >>>>>
> >>>>>Is there a similar patch to snapshot the buffers to not stall the
> >>>>>writes for ext4?
> >>>>No, there is not -- the problem with the snapshot solution is that it
> >>>>requires page allocations when the FS is (potentially) trying to
> >>>>reclaim memory by writing out dirty pages.
> >>>>
> >>>>--D
> >>>>
> >>>>>Please let me know.
> >>>>>
> >>>>>I would really appreciate any help you can give me.
> >>>>>
> >>>>>
> >>>>>--
> >>>>>Thanks
> >>>>>Nikhilesh Reddy
> >>>>>
> >>>>>Qualcomm Innovation Center, Inc.
> >>>>>The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
> >>>>>Forum,
> >>>>>a Linux Foundation Collaborative Project.
> >>>>>--
> >>>>>To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >>>>>the body of a message to [email protected]
> >>>>>More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>
> >>>
> >>
> >>
> >>--
> >>Thanks
> >>Nikhilesh Reddy
> >>
> >>Qualcomm Innovation Center, Inc.
> >>The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> >>a Linux Foundation Collaborative Project.
>
>
> --
> Thanks
> Nikhilesh Reddy
>
> Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project.

2015-02-01 02:37:29

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Writes blocked on wait_for_stable_page (Writes of less than page size sometimes take too long)

On Wed, Jan 28, 2015 at 03:57:05PM -0800, Darrick J. Wong wrote:
> Yes, that wait_for_page_writeback() was replaced with a call to
> wait_for_stable_page() upstream, so it'll probably work for your 3.10
> kernel.
>
> (I also wonder why not jump to a newer LTS kernel, but that's neither
> here nor there.)

Nikhilesh said that he was using a 64-bit android device. I'm going
to guess it is based off of the AOSP kernel for the Nexus 9, which is
a 64-bit android device that uses 3.10 as a base. It's not *really* a
3.10 kernel, though, becausre it looks like bits and pieces of the VFS
was forward ported to a somewhat newer version so they could backport
F2FS to the Nexus 9 kernel. But I'm guessing he's using it because
there are device drivers he needs that might not be available without
a lot of forward-porting work on a newer kernel.

If you are interested in trying to use a somewhat newer version of
ext4, I just *happen* to have a backport of the 3.18 version of ext4
on top of a stock version of 3.10. You can find it at the
backport-to-3.10 branch of the ext4.git tree on git.kernel.org.

If you are starting from the AOSP version of the Nexus 9 kernel, the
patch series from that branch won't apply 100% cleanly, because of the
changes resulting from the F2FS backport. It's relatively
straight-forward though to use the backport-to-3.10 patch series as a
model to get a version of 3.18 ext4 on top of the AOSP kernel, though;
it's mostly dropping or reworking patches that were no longer
necessary thanks to the F2FS backport.

If there is interest, I can look into what might be involved in making
a git repo of an AOSP kernel with the 3.18 ext4 code backported to it
available.

I won't give any warrantees, of course, since the AOSP kernel doesn't
build on x86 and so I can't easily run regression tests on it. So if
it breaks, you will get to keep both pieces. I will try to at least
look at bug reports, though, and I can say that 3.18 backport on the
stock 3.10 kernel survives xfstests much better than the 3.10 version
of ext4, since a modern xfstests very quickly caused the stock 3.10
kernel to panic. :-)

- Ted

2015-02-03 23:51:43

by Nikhilesh Reddy

[permalink] [raw]

Subject: Re: Writes blocked on wait_for_stable_page (Writes of less than page size sometimes take too long)

Thanks so much Darrick and Ted!
You guys have been a lot of help. I really appreciate it.

On Sat 31 Jan 2015 06:37:23 PM PST, Theodore Ts'o wrote:
> On Wed, Jan 28, 2015 at 03:57:05PM -0800, Darrick J. Wong wrote:
>> Yes, that wait_for_page_writeback() was replaced with a call to
>> wait_for_stable_page() upstream, so it'll probably work for your 3.10
>> kernel.
>>
>> (I also wonder why not jump to a newer LTS kernel, but that's neither
>> here nor there.)
>
> Nikhilesh said that he was using a 64-bit android device. I'm going
> to guess it is based off of the AOSP kernel for the Nexus 9, which is
> a 64-bit android device that uses 3.10 as a base. It's not *really* a
> 3.10 kernel, though, becausre it looks like bits and pieces of the VFS
> was forward ported to a somewhat newer version so they could backport
> F2FS to the Nexus 9 kernel. But I'm guessing he's using it because
> there are device drivers he needs that might not be available without
> a lot of forward-porting work on a newer kernel.
>

Yes we cant really move to a newer kernel just yet. :)

> If you are interested in trying to use a somewhat newer version of
> ext4, I just *happen* to have a backport of the 3.18 version of ext4
> on top of a stock version of 3.10. You can find it at the
> backport-to-3.10 branch of the ext4.git tree on git.kernel.org.

Thanks that is awesome... yes...Will look into them.Thanks!

>
> If you are starting from the AOSP version of the Nexus 9 kernel, the
> patch series from that branch won't apply 100% cleanly, because of the
> changes resulting from the F2FS backport. It's relatively
> straight-forward though to use the backport-to-3.10 patch series as a
> model to get a version of 3.18 ext4 on top of the AOSP kernel, though;
> it's mostly dropping or reworking patches that were no longer
> necessary thanks to the F2FS backport.
>
> If there is interest, I can look into what might be involved in making
> a git repo of an AOSP kernel with the 3.18 ext4 code backported to it
> available.
>
> I won't give any warrantees, of course, since the AOSP kernel doesn't
> build on x86 and so I can't easily run regression tests on it. So if
> it breaks, you will get to keep both pieces. I will try to at least
> look at bug reports, though, and I can say that 3.18 backport on the
> stock 3.10 kernel survives xfstests much better than the 3.10 version
> of ext4, since a modern xfstests very quickly caused the stock 3.10
> kernel to panic. :-)
>
> - Ted
>

Yes I can run testing on my end to make sure everything is good.

Thanks So much once again.

--
Thanks
Nikhilesh Reddy

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

2015-02-04 17:02:57

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Writes blocked on wait_for_stable_page (Writes of less than page size sometimes take too long)

On Tue, Feb 03, 2015 at 03:51:41PM -0800, Nikhilesh Reddy wrote:
> >
> >I won't give any warrantees, of course, since the AOSP kernel doesn't
> >build on x86 and so I can't easily run regression tests on it. So if
> >it breaks, you will get to keep both pieces. I will try to at least
> >look at bug reports, though, and I can say that 3.18 backport on the
> >stock 3.10 kernel survives xfstests much better than the 3.10 version
> >of ext4, since a modern xfstests very quickly caused the stock 3.10
> >kernel to panic. :-)

Note: I managed to patch/bludgeon the AOSP kernel so it will build on
x86 (apparently some Nvidia engineers wasn't very careful about adding
Tegra support without breaking x86 builds, boo hiss), and so I was
able to run a quick smoke test using kvm-xfstests. "kvm-xfstests
smoke" runs most of the tests successfully, modulo a few expected
failures, but then the kernel locks up after executing "echo 3 >
/proc/sys/vm/drop_caches; cat /proc/slabinfo" in the test framework,
which happens just before it halts the VM.

I suspect there is a bug in my commit (f9f56f19dc) which lets ext4 use
the older shrinker API which is used by the 3.10 kernel. It's on my
todo list to look at, but I'm in Mountain View this week and my time
has been almost toally sucked up by meetings, so I haven't gotten to
it yet.

So in case you end up doing some testing of the 3.10 backport branch,
I thought I should give you a quick heads up.

Cheers,

- Ted