2011-12-12 02:27:52

by Andrea Arcangeli

[permalink] [raw]
Subject: ext4_da_write_end() and copied == 0

Hi,

I've been triggering ext4 troubles out of some experimental autonuma
code if I set knumad frequency rates very high.

The first patch looks fairly safe, but the second patch I've no idea
if it's correct and if damages stuff instead of fixing it, but clearly
it fixes the -EINVAL problem for me and it looks like to work fine now.

[PATCH 1/2] ext4: avoid hangs in ext4_da_should_update_i_disksize()
[PATCH 2/2] ext4: ext4_discard_partial_page_buffers_no_lock() wrong parameters

If you make alternate fixes for this let me know and I can test
them. I can reproduce both issues instantly.

Thanks,
Andrea


2011-12-12 02:27:52

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 1/2] ext4: avoid hangs in ext4_da_should_update_i_disksize()

If the pte mapping in generic_perform_write() is unmapped between
iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic(), the
"copied" parameter to ->end_write can be zero. ext4 couldn't cope with
it with delayed allocations enabled. This skips the i_disksize
enlargement logic if copied is zero and no new data was appeneded to
the inode.

gdb> bt
#0 0xffffffff811afe80 in ext4_da_should_update_i_disksize (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x1\
08000, len=0x1000, copied=0x0, page=0xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2467
#1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
#2 0xffffffff810d97f1 in generic_perform_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value o\
ptimized out>, pos=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2440
#3 generic_file_buffered_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value optimized out>, p\
os=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2482
#4 0xffffffff810db5d1 in __generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, ppos=0\
xffff88001e26be40) at mm/filemap.c:2600
#5 0xffffffff810db853 in generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=<value optimi\
zed out>, pos=<value optimized out>) at mm/filemap.c:2632
#6 0xffffffff811a71aa in ext4_file_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, pos=0x108000) a\
t fs/ext4/file.c:136
#7 0xffffffff811375aa in do_sync_write (filp=0xffff88003f606a80, buf=<value optimized out>, len=<value optimized out>, \
ppos=0xffff88001e26bf48) at fs/read_write.c:406
#8 0xffffffff81137e56 in vfs_write (file=0xffff88003f606a80, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x4\
000, pos=0xffff88001e26bf48) at fs/read_write.c:435
#9 0xffffffff8113816c in sys_write (fd=<value optimized out>, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x\
4000) at fs/read_write.c:487
#10 <signal handler called>
#11 0x00007f120077a390 in __brk_reservation_fn_dmi_alloc__ ()
#12 0x0000000000000000 in ?? ()
gdb> print offset
$22 = 0xffffffffffffffff
gdb> print idx
$23 = 0xffffffff
gdb> print inode->i_blkbits
$24 = 0xc
gdb> up
#1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
2512 if (ext4_da_should_update_i_disksize(page, end)) {
gdb> print start
$25 = 0x0
gdb> print end
$26 = 0xffffffffffffffff
gdb> print pos
$27 = 0x108000
gdb> print new_i_size
$28 = 0x108000
gdb> print ((struct ext4_inode_info *)((char *)inode-((int)(&((struct ext4_inode_info *)0)->vfs_inode))))->i_disksize
$29 = 0xd9000
gdb> down
2467 for (i = 0; i < idx; i++)
gdb> print i
$30 = 0xd44acbee

This is 100% reproducible with some autonuma development code tuned in
a very aggressive manner (not normal way even for knumad) which does
"exotic" changes to the ptes. It wouldn't normally trigger but I don't
see why it can't happen normally if the page is added to swap cache in
between the two faults leading to "copied" being zero (which then
hangs in ext4). So it should be fixed. Especially possible with lumpy
reclaim (albeit disabled if compaction is enabled) as that would
ignore the young bits in the ptes.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/ext4/inode.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fffec40..63f9541 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2508,7 +2508,7 @@ static int ext4_da_write_end(struct file *file,
*/

new_i_size = pos + copied;
- if (new_i_size > EXT4_I(inode)->i_disksize) {
+ if (copied && new_i_size > EXT4_I(inode)->i_disksize) {
if (ext4_da_should_update_i_disksize(page, end)) {
down_write(&EXT4_I(inode)->i_data_sem);
if (new_i_size > EXT4_I(inode)->i_disksize) {

2011-12-12 02:27:52

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 2/2] ext4: ext4_discard_partial_page_buffers_no_lock() wrong parameters

If "copied" is zero (it can happen if the pte is unmapped before the
atomic copy_user that does the data copy runs) the "from" passed to
ext4_discard_partial_page_buffers_no_lock() points to pos-1, which
would correspond to a logical page index before the page->index
leading to ext4_discard_partial_page_buffers_no_lock() returning
-EINVAL (because index != page->index). In such a case write() returns
-EINVAL and userland gets a failure and filemap.c doesn't retry the
copy_user anymore.

I'm not certain of why exactly
ext4_discard_partial_page_buffers_no_lock() is run here, so it's hard
to tell if this is the correct fix. But it that functions clears data
starting from the "from" parameter, so regardless of the -EINVAL
retval, the right "from" to start clearing data should be pos+copied,
not pos+copied-1. If this assumption is correct, it could mean that
this bug in addition to the -EINVAL error, could also zero out 1 byte
by mistake. I'm not sure what the implications for that are (not sure
if data corruption is possible in some circumstances because of
that). I guess normally this functions runs on unmapped buffers and
the EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED makes it a noop on those.

After fixing the hang in ext4_da_should_update_i_disksize, this write
-EINVAL error becomes trivially reproducible with experimental knumad
autonuma code running at heavy frequency (not the normal case). But
like for the ext4_da_should_update_i_disksize hang, it should not be
impossible to reproduce it with legacy swapping behavior.

After dropping all caches a md5sum is successful and I can't find
errors anymore with this patch, the -EINVAL stops, but it's not
conclusive (and I haven't run e2fsck -f yet but I doubt this affects
metadata coherency, seems more like a delayalloc data issue).

Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/ext4/inode.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 63f9541..528c4c5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2534,11 +2534,11 @@ static int ext4_da_write_end(struct file *file,
page, fsdata);

page_len = PAGE_CACHE_SIZE -
- ((pos + copied - 1) & (PAGE_CACHE_SIZE - 1));
+ ((pos + copied) & (PAGE_CACHE_SIZE - 1));

- if (page_len > 0) {
+ if (page_len < PAGE_CACHE_SIZE) {
ret = ext4_discard_partial_page_buffers_no_lock(handle,
- inode, page, pos + copied - 1, page_len,
+ inode, page, pos + copied, page_len,
EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED);
}


2011-12-12 03:17:55

by Yongqiang Yang

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: ext4_discard_partial_page_buffers_no_lock() wrong parameters

Hi Andrea,

The code you are testing are removed in recent patches, the patches
have not been merged.

Please try following patches:
[PATCH 1/2] ext4: let mpage_submit_io works well when blocksize < pagesize
[PATCH 2/2] ext4: let ext4_discard_partial_buffers handle pages
without buffers correctly

and

[PATCH 1/2] ext4: remove a wrong BUG_ON in ext4_ext_convert_to_initialized
[PATCH 2/2] ext4: let ext4_bio_write_page handle EOF correctly

Yongqiang.

On Mon, Dec 12, 2011 at 10:27 AM, Andrea Arcangeli <[email protected]> wrote:
> If "copied" is zero (it can happen if the pte is unmapped before the
> atomic copy_user that does the data copy runs) the "from" passed to
> ext4_discard_partial_page_buffers_no_lock() points to pos-1, which
> would correspond to a logical page index before the page->index
> leading to ext4_discard_partial_page_buffers_no_lock() returning
> -EINVAL (because index != page->index). In such a case write() returns
> -EINVAL and userland gets a failure and filemap.c doesn't retry the
> copy_user anymore.
>
> I'm not certain of why exactly
> ext4_discard_partial_page_buffers_no_lock() is run here, so it's hard
> to tell if this is the correct fix. But it that functions clears data
> starting from the "from" parameter, so regardless of the -EINVAL
> retval, the right "from" to start clearing data should be pos+copied,
> not pos+copied-1. If this assumption is correct, it could mean that
> this bug in addition to the -EINVAL error, could also zero out 1 byte
> by mistake. I'm not sure what the implications for that are (not sure
> if data corruption is possible in some circumstances because of
> that). I guess normally this functions runs on unmapped buffers and
> the EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED makes it a noop on those.
>
> After fixing the hang in ext4_da_should_update_i_disksize, this write
> -EINVAL error becomes trivially reproducible with experimental knumad
> autonuma code running at heavy frequency (not the normal case). But
> like for the ext4_da_should_update_i_disksize hang, it should not be
> impossible to reproduce it with legacy swapping behavior.
>
> After dropping all caches a md5sum is successful and I can't find
> errors anymore with this patch, the -EINVAL stops, but it's not
> conclusive (and I haven't run e2fsck -f yet but I doubt this affects
> metadata coherency, seems more like a delayalloc data issue).
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> ?fs/ext4/inode.c | ? ?6 +++---
> ?1 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 63f9541..528c4c5 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2534,11 +2534,11 @@ static int ext4_da_write_end(struct file *file,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?page, fsdata);
>
> ? ? ? ?page_len = PAGE_CACHE_SIZE -
> - ? ? ? ? ? ? ? ? ? ? ? ((pos + copied - 1) & (PAGE_CACHE_SIZE - 1));
> + ? ? ? ? ? ? ? ? ? ? ? ((pos + copied) & (PAGE_CACHE_SIZE - 1));
>
> - ? ? ? if (page_len > 0) {
> + ? ? ? if (page_len < PAGE_CACHE_SIZE) {
> ? ? ? ? ? ? ? ?ret = ext4_discard_partial_page_buffers_no_lock(handle,
> - ? ? ? ? ? ? ? ? ? ? ? inode, page, pos + copied - 1, page_len,
> + ? ? ? ? ? ? ? ? ? ? ? inode, page, pos + copied, page_len,
> ? ? ? ? ? ? ? ? ? ? ? ?EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED);
> ? ? ? ?}
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html



--
Best Wishes
Yongqiang Yang

2011-12-12 03:28:22

by Yongqiang Yang

[permalink] [raw]
Subject: Re: [PATCH 1/2] ext4: avoid hangs in ext4_da_should_update_i_disksize()

Hi Andrea,

I can not figure out why ext4 hangs. Could you explain more on hanging?

Thanks,
Yongqiang.

On Mon, Dec 12, 2011 at 10:27 AM, Andrea Arcangeli <[email protected]> wrote:
> If the pte mapping in generic_perform_write() is unmapped between
> iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic(), the
> "copied" parameter to ->end_write can be zero. ext4 couldn't cope with
> it with delayed allocations enabled. This skips the i_disksize
> enlargement logic if copied is zero and no new data was appeneded to
> the inode.
>
> ?gdb> bt
> ?#0 ?0xffffffff811afe80 in ext4_da_should_update_i_disksize (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x1\
> ?08000, len=0x1000, copied=0x0, page=0xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2467
> ?#1 ?ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
> ?xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
> ?#2 ?0xffffffff810d97f1 in generic_perform_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value o\
> ?ptimized out>, pos=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2440
> ?#3 ?generic_file_buffered_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value optimized out>, p\
> ?os=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2482
> ?#4 ?0xffffffff810db5d1 in __generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, ppos=0\
> ?xffff88001e26be40) at mm/filemap.c:2600
> ?#5 ?0xffffffff810db853 in generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=<value optimi\
> ?zed out>, pos=<value optimized out>) at mm/filemap.c:2632
> ?#6 ?0xffffffff811a71aa in ext4_file_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, pos=0x108000) a\
> ?t fs/ext4/file.c:136
> ?#7 ?0xffffffff811375aa in do_sync_write (filp=0xffff88003f606a80, buf=<value optimized out>, len=<value optimized out>, \
> ?ppos=0xffff88001e26bf48) at fs/read_write.c:406
> ?#8 ?0xffffffff81137e56 in vfs_write (file=0xffff88003f606a80, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x4\
> ?000, pos=0xffff88001e26bf48) at fs/read_write.c:435
> ?#9 ?0xffffffff8113816c in sys_write (fd=<value optimized out>, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x\
> ?4000) at fs/read_write.c:487
> ?#10 <signal handler called>
> ?#11 0x00007f120077a390 in __brk_reservation_fn_dmi_alloc__ ()
> ?#12 0x0000000000000000 in ?? ()
> ?gdb> print offset
> ?$22 = 0xffffffffffffffff
> ?gdb> print idx
> ?$23 = 0xffffffff
> ?gdb> print inode->i_blkbits
> ?$24 = 0xc
> ?gdb> up
> ?#1 ?ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
> ?xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
> ?2512 ? ? ? ? ? ? ? ? ? ?if (ext4_da_should_update_i_disksize(page, end)) {
> ?gdb> print start
> ?$25 = 0x0
> ?gdb> print end
> ?$26 = 0xffffffffffffffff
> ?gdb> print pos
> ?$27 = 0x108000
> ?gdb> print new_i_size
> ?$28 = 0x108000
> ?gdb> print ((struct ext4_inode_info *)((char *)inode-((int)(&((struct ext4_inode_info *)0)->vfs_inode))))->i_disksize
> ?$29 = 0xd9000
> ?gdb> down
> ?2467 ? ? ? ? ? ?for (i = 0; i < idx; i++)
> ?gdb> print i
> ?$30 = 0xd44acbee
>
> This is 100% reproducible with some autonuma development code tuned in
> a very aggressive manner (not normal way even for knumad) which does
> "exotic" changes to the ptes. It wouldn't normally trigger but I don't
> see why it can't happen normally if the page is added to swap cache in
> between the two faults leading to "copied" being zero (which then
> hangs in ext4). So it should be fixed. Especially possible with lumpy
> reclaim (albeit disabled if compaction is enabled) as that would
> ignore the young bits in the ptes.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> ?fs/ext4/inode.c | ? ?2 +-
> ?1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index fffec40..63f9541 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2508,7 +2508,7 @@ static int ext4_da_write_end(struct file *file,
> ? ? ? ? */
>
> ? ? ? ?new_i_size = pos + copied;
> - ? ? ? if (new_i_size > EXT4_I(inode)->i_disksize) {
> + ? ? ? if (copied && new_i_size > EXT4_I(inode)->i_disksize) {
> ? ? ? ? ? ? ? ?if (ext4_da_should_update_i_disksize(page, end)) {
> ? ? ? ? ? ? ? ? ? ? ? ?down_write(&EXT4_I(inode)->i_data_sem);
> ? ? ? ? ? ? ? ? ? ? ? ?if (new_i_size > EXT4_I(inode)->i_disksize) {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html



--
Best Wishes
Yongqiang Yang

2011-12-12 16:55:47

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: ext4_discard_partial_page_buffers_no_lock() wrong parameters

On Mon, Dec 12, 2011 at 11:17:54AM +0800, Yongqiang Yang wrote:
> Hi Andrea,
>
> The code you are testing are removed in recent patches, the patches
> have not been merged.

I checked out the git ext4 tree and the two patches applies clean to
it. So I'm unsure why you say it was removed. origin/dev origin/master
are equal too.

>
> Please try following patches:
> [PATCH 1/2] ext4: let mpage_submit_io works well when blocksize < pagesize
> [PATCH 2/2] ext4: let ext4_discard_partial_buffers handle pages
> without buffers correctly
>
> and
>
> [PATCH 1/2] ext4: remove a wrong BUG_ON in ext4_ext_convert_to_initialized
> [PATCH 2/2] ext4: let ext4_bio_write_page handle EOF correctly

Is there a git tree with these patches? I'm not subscribed to the list.

Thanks,
Andrea

2011-12-12 16:58:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 1/2] ext4: avoid hangs in ext4_da_should_update_i_disksize()

On Mon, Dec 12, 2011 at 11:28:21AM +0800, Yongqiang Yang wrote:
> Hi Andrea,
>
> I can not figure out why ext4 hangs. Could you explain more on hanging?

Well I tried to explain it in the changeset comment.

If "copied" is zero, end/offset/idx becomes -1, and this loops:

for (i = 0; i < idx; i++)

definitely a bug and checking ext4 git I don't see anything fixing it.

2011-12-12 20:08:12

by Allison Henderson

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: ext4_discard_partial_page_buffers_no_lock() wrong parameters

On 12/12/2011 09:55 AM, Andrea Arcangeli wrote:
> On Mon, Dec 12, 2011 at 11:17:54AM +0800, Yongqiang Yang wrote:
>> Hi Andrea,
>>
>> The code you are testing are removed in recent patches, the patches
>> have not been merged.
>
> I checked out the git ext4 tree and the two patches applies clean to
> it. So I'm unsure why you say it was removed. origin/dev origin/master
> are equal too.
>
>>
>> Please try following patches:
>> [PATCH 1/2] ext4: let mpage_submit_io works well when blocksize< pagesize
>> [PATCH 2/2] ext4: let ext4_discard_partial_buffers handle pages
>> without buffers correctly
>>
>> and
>>
>> [PATCH 1/2] ext4: remove a wrong BUG_ON in ext4_ext_convert_to_initialized
>> [PATCH 2/2] ext4: let ext4_bio_write_page handle EOF correctly
>
> Is there a git tree with these patches? I'm not subscribed to the list.
>
> Thanks,
> Andrea

Hi Andrea,

I think what Yongqiang means is that the code you are modifying is being
removed in the patches listed above and replaced with a different
solution. The patches have not yet been merged so you will not see them
in a git tree yet, so you will need to apply them yourself. If you are
not subscribed to the list, you can still find the patches here:

http://www.spinics.net/lists/linux-ext4/msg29109.html
http://www.spinics.net/lists/linux-ext4/msg29110.html

http://www.spinics.net/lists/linux-ext4/msg29375.html
http://www.spinics.net/lists/linux-ext4/msg29376.html

We want to make sure the solution works for everyone, so please apply
these patches and let us know if it corrects the issues you are seeing.
Thx!

Allison Henderson

> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2011-12-12 20:27:25

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/2] ext4: avoid hangs in ext4_da_should_update_i_disksize()

On Mon 12-12-11 03:27:07, Andrea Arcangeli wrote:
> If the pte mapping in generic_perform_write() is unmapped between
> iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic(), the
> "copied" parameter to ->end_write can be zero. ext4 couldn't cope with
> it with delayed allocations enabled.
I'd expand this sentence with: because ext4_da_should_update_i_disksize()
is passed -1 as an offset and loops (almost) infinitely.
I know it's also in the gdb dump below but it takes a while to notice it
if you don't know what you are looking for.

> This skips the i_disksize
> enlargement logic if copied is zero and no new data was appeneded to
> the inode.
>
> gdb> bt
> #0 0xffffffff811afe80 in ext4_da_should_update_i_disksize (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x1\
> 08000, len=0x1000, copied=0x0, page=0xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2467
> #1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
> xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
> #2 0xffffffff810d97f1 in generic_perform_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value o\
> ptimized out>, pos=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2440
> #3 generic_file_buffered_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value optimized out>, p\
> os=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2482
> #4 0xffffffff810db5d1 in __generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, ppos=0\
> xffff88001e26be40) at mm/filemap.c:2600
> #5 0xffffffff810db853 in generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=<value optimi\
> zed out>, pos=<value optimized out>) at mm/filemap.c:2632
> #6 0xffffffff811a71aa in ext4_file_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, pos=0x108000) a\
> t fs/ext4/file.c:136
> #7 0xffffffff811375aa in do_sync_write (filp=0xffff88003f606a80, buf=<value optimized out>, len=<value optimized out>, \
> ppos=0xffff88001e26bf48) at fs/read_write.c:406
> #8 0xffffffff81137e56 in vfs_write (file=0xffff88003f606a80, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x4\
> 000, pos=0xffff88001e26bf48) at fs/read_write.c:435
> #9 0xffffffff8113816c in sys_write (fd=<value optimized out>, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x\
> 4000) at fs/read_write.c:487
> #10 <signal handler called>
> #11 0x00007f120077a390 in __brk_reservation_fn_dmi_alloc__ ()
> #12 0x0000000000000000 in ?? ()
> gdb> print offset
> $22 = 0xffffffffffffffff
> gdb> print idx
> $23 = 0xffffffff
> gdb> print inode->i_blkbits
> $24 = 0xc
> gdb> up
> #1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
> xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
> 2512 if (ext4_da_should_update_i_disksize(page, end)) {
> gdb> print start
> $25 = 0x0
> gdb> print end
> $26 = 0xffffffffffffffff
> gdb> print pos
> $27 = 0x108000
> gdb> print new_i_size
> $28 = 0x108000
> gdb> print ((struct ext4_inode_info *)((char *)inode-((int)(&((struct ext4_inode_info *)0)->vfs_inode))))->i_disksize
> $29 = 0xd9000
> gdb> down
> 2467 for (i = 0; i < idx; i++)
> gdb> print i
> $30 = 0xd44acbee
>
> This is 100% reproducible with some autonuma development code tuned in
> a very aggressive manner (not normal way even for knumad) which does
> "exotic" changes to the ptes. It wouldn't normally trigger but I don't
> see why it can't happen normally if the page is added to swap cache in
> between the two faults leading to "copied" being zero (which then
> hangs in ext4). So it should be fixed. Especially possible with lumpy
> reclaim (albeit disabled if compaction is enabled) as that would
> ignore the young bits in the ptes.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
You can add:
Reviewed-by: Jan Kara <[email protected]>

Honza
> ---
> fs/ext4/inode.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index fffec40..63f9541 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2508,7 +2508,7 @@ static int ext4_da_write_end(struct file *file,
> */
>
> new_i_size = pos + copied;
> - if (new_i_size > EXT4_I(inode)->i_disksize) {
> + if (copied && new_i_size > EXT4_I(inode)->i_disksize) {
> if (ext4_da_should_update_i_disksize(page, end)) {
> down_write(&EXT4_I(inode)->i_data_sem);
> if (new_i_size > EXT4_I(inode)->i_disksize) {
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-12-13 00:41:16

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: ext4_discard_partial_page_buffers_no_lock() wrong parameters

Hi Allison,

On Mon, Dec 12, 2011 at 01:07:47PM -0700, Allison Henderson wrote:
> Hi Andrea,
>
> I think what Yongqiang means is that the code you are modifying is being
> removed in the patches listed above and replaced with a different
> solution. The patches have not yet been merged so you will not see them
> in a git tree yet, so you will need to apply them yourself. If you are
> not subscribed to the list, you can still find the patches here:
>
> http://www.spinics.net/lists/linux-ext4/msg29109.html
> http://www.spinics.net/lists/linux-ext4/msg29110.html
>
> http://www.spinics.net/lists/linux-ext4/msg29375.html
> http://www.spinics.net/lists/linux-ext4/msg29376.html

Thanks for the links. Applied all 4.

> We want to make sure the solution works for everyone, so please apply
> these patches and let us know if it corrects the issues you are seeing.
> Thx!

With all 4 patches applied, my 1/2 patch is still needed to avoid the
hang and writes hangs all the time (but now they don't return -EINVAL
anymore).

My patch 2/2 is not needed with the above 4 patches applied so that
can be discared if you apply the above 4 patches (but I suggest you
check my patch 2/2 too because if it happens to be correct it could
mean there's the potential of 1 byte data corruption in current
upstream kernels in some circumstances, in addition to the -EINVAL
failure in write()).

Let me know if you need me to test me anything else in relation to the
copied=0 case.

Thanks,
Andrea

2011-12-13 01:05:47

by Yongqiang Yang

[permalink] [raw]
Subject: Re: [PATCH 1/2] ext4: avoid hangs in ext4_da_should_update_i_disksize()

On Tue, Dec 13, 2011 at 12:57 AM, Andrea Arcangeli <[email protected]> wrote:
> On Mon, Dec 12, 2011 at 11:28:21AM +0800, Yongqiang Yang wrote:
>> Hi Andrea,
>>
>> I can not figure out why ext4 hangs. ?Could you explain more on hanging?
>
> Well I tried to explain it in the changeset comment.
>
> If "copied" is zero, end/offset/idx becomes -1, and this loops:
>
> ? ? ? ?for (i = 0; i < idx; i++)
>
> definitely a bug and checking ext4 git I don't see anything fixing it.
Got it. Thanks.

Looks good to me.


--
Best Wishes
Yongqiang Yang

2011-12-13 01:16:25

by Yongqiang Yang

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: ext4_discard_partial_page_buffers_no_lock() wrong parameters

On Tue, Dec 13, 2011 at 8:40 AM, Andrea Arcangeli <[email protected]> wrote:
> Hi Allison,
>
> On Mon, Dec 12, 2011 at 01:07:47PM -0700, Allison Henderson wrote:
>> Hi Andrea,
>>
>> I think what Yongqiang means is that the code you are modifying is being
>> removed in the patches listed above and replaced with a different
>> solution. ?The patches have not yet been merged so you will not see them
>> in a git tree yet, so you will need to apply them yourself. ?If you are
>> not subscribed to the list, you can still find the patches here:
>>
>> http://www.spinics.net/lists/linux-ext4/msg29109.html
>> http://www.spinics.net/lists/linux-ext4/msg29110.html
>>
>> http://www.spinics.net/lists/linux-ext4/msg29375.html
>> http://www.spinics.net/lists/linux-ext4/msg29376.html
>
> Thanks for the links. Applied all 4.
>
>> We want to make sure the solution works for everyone, so please apply
>> these patches and let us know if it corrects the issues you are seeing.
>> ? Thx!
>
> With all 4 patches applied, my 1/2 patch is still needed to avoid the
> hang and writes hangs all the time (but now they don't return -EINVAL
> anymore).
>
> My patch 2/2 is not needed with the above 4 patches applied so that
> can be discared if you apply the above 4 patches (but I suggest you
> check my patch 2/2 too because if it happens to be correct it could
> mean there's the potential of 1 byte data corruption in current
> upstream kernels in some circumstances, in addition to the -EINVAL
> failure in write()).
page_len is removed in the 2nd patch, so there is no problem related
to it any more.

Hi Andy and Hugh,

Andrea found the reason of EINVAL, and the 2nd patch we have tested
can fix it. Many thanks to Andrea.


Yongqiang.
>
> Let me know if you need me to test me anything else in relation to the
> copied=0 case.
>
> Thanks,
> Andrea



--
Best Wishes
Yongqiang Yang

2011-12-14 02:44:40

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/2] ext4: avoid hangs in ext4_da_should_update_i_disksize()

On Mon, Dec 12, 2011 at 03:27:07AM +0100, Andrea Arcangeli wrote:
> If the pte mapping in generic_perform_write() is unmapped between
> iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic(), the
> "copied" parameter to ->end_write can be zero. ext4 couldn't cope with
> it with delayed allocations enabled. This skips the i_disksize
> enlargement logic if copied is zero and no new data was appeneded to
> the inode.
...
>
> This is 100% reproducible with some autonuma development code tuned in
> a very aggressive manner (not normal way even for knumad) which does
> "exotic" changes to the ptes. It wouldn't normally trigger but I don't
> see why it can't happen normally if the page is added to swap cache in
> between the two faults leading to "copied" being zero (which then
> hangs in ext4). So it should be fixed. Especially possible with lumpy
> reclaim (albeit disabled if compaction is enabled) as that would
> ignore the young bits in the ptes.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>

Thanks, applied.

- Ted