2011-08-26 02:53:27

by Yongqiang Yang

[permalink] [raw]
Subject: question about punch hole

Hi Allison,

Currently, punch hole flushes all pages to disk and releases pages in
page cache, and then calls ext4_ext_map_blocks.

Assume that if a new page in the punching's range is mapped after
releasing pages and before down_write i_data_sem,
then ext4_ext_map_blocks will release map info of the page in extent
tree. However, up layers does not know this, and they think the page
is mapped.

I can not find how punch hole handle the situation above. Could you
shed a light on it?


--
Best Wishes
Yongqiang Yang


2011-08-26 22:35:34

by Allison Henderson

[permalink] [raw]
Subject: Re: question about punch hole

On 08/25/2011 07:53 PM, Yongqiang Yang wrote:
> Hi Allison,
>
> Currently, punch hole flushes all pages to disk and releases pages in
> page cache, and then calls ext4_ext_map_blocks.
>
> Assume that if a new page in the punching's range is mapped after
> releasing pages and before down_write i_data_sem,
> then ext4_ext_map_blocks will release map info of the page in extent
> tree. However, up layers does not know this, and they think the page
> is mapped.
>
> I can not find how punch hole handle the situation above. Could you
> shed a light on it?
>
>
Hi Yongqiang

This is a really good question and at the moment Im still looking into
it. :) The calling sequence in punch hole was modeled after truncate,
which also only locks i_data_sem when modifying the extent tree.
ext4_ext_map_blocks when called with the punch hole flag, only releases
blocks in the extent tree, using the same routines truncate does, but it
does not modify the state of the pages. Though that still does not
prevent the race condition you describe, so I am still investigating it.
I've found that I can catch a lot of race conditions by simply running
the stress test over night, and so far I havnt had anything like this
come up, but that certainly doesnt mean its not there. I will let you
know what I find. Thx!

Allison Henderson

2011-08-27 09:04:51

by Yongqiang Yang

[permalink] [raw]
Subject: Re: question about punch hole

On Sat, Aug 27, 2011 at 6:35 AM, Allison Henderson
<[email protected]> wrote:
> On 08/25/2011 07:53 PM, Yongqiang Yang wrote:
>>
>> Hi Allison,
>>
>> Currently, punch hole flushes all pages to disk and releases pages in
>> page cache, and then calls ext4_ext_map_blocks.
>>
>> Assume that if a new page in the punching's range is mapped after
>> releasing pages and before down_write i_data_sem,
>> then ext4_ext_map_blocks will release map info of the page in extent
>> tree. ?However, up layers does not know this, and they think the page
>> is mapped.
>>
>> I can not find how punch hole handle the situation above. ?Could you
>> shed a light on it?
>>
>>
> Hi Yongqiang
>
> This is a really good question and at the moment Im still looking into it.
> ?:) ?The calling sequence in punch hole was modeled after truncate, which
> also only locks i_data_sem when modifying the extent tree.
> ext4_ext_map_blocks when called with the punch hole flag, only releases
> blocks in the extent tree, using the same routines truncate does, but it
> does not modify the state of the pages. Though that still does not prevent
> the race condition you describe, so I am still investigating it.
> I've found that I can catch a lot of race conditions by simply running the
> stress test over night, and so far I havnt had anything like this come up,
> but that certainly doesnt mean its not there. ?I will let you know what I
> find. ?Thx!

Hi Allison,

I had a look at truncate code, truncates and writes are serialized by
inode->i_mutex in vfs layer, but fallocate does not take i_mutex, so
we need to take i_mutex in punching hole as well, I think. Fallocate
behaves differently with punching hole, so it is safe without taking
i_mutex.


What's your opinion?

Yongqiang.
>
> Allison Henderson
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>



--
Best Wishes
Yongqiang Yang

2011-08-27 09:33:52

by Yongqiang Yang

[permalink] [raw]
Subject: Re: question about punch hole

On Sat, Aug 27, 2011 at 5:04 PM, Yongqiang Yang <[email protected]> wrote:
> On Sat, Aug 27, 2011 at 6:35 AM, Allison Henderson
> <[email protected]> wrote:
>> On 08/25/2011 07:53 PM, Yongqiang Yang wrote:
>>>
>>> Hi Allison,
>>>
>>> Currently, punch hole flushes all pages to disk and releases pages in
>>> page cache, and then calls ext4_ext_map_blocks.
>>>
>>> Assume that if a new page in the punching's range is mapped after
>>> releasing pages and before down_write i_data_sem,
>>> then ext4_ext_map_blocks will release map info of the page in extent
>>> tree. ?However, up layers does not know this, and they think the page
>>> is mapped.
>>>
>>> I can not find how punch hole handle the situation above. ?Could you
>>> shed a light on it?
>>>
>>>
>> Hi Yongqiang
>>
>> This is a really good question and at the moment Im still looking into it.
>> ?:) ?The calling sequence in punch hole was modeled after truncate, which
>> also only locks i_data_sem when modifying the extent tree.
>> ext4_ext_map_blocks when called with the punch hole flag, only releases
>> blocks in the extent tree, using the same routines truncate does, but it
>> does not modify the state of the pages. Though that still does not prevent
>> the race condition you describe, so I am still investigating it.
>> I've found that I can catch a lot of race conditions by simply running the
>> stress test over night, and so far I havnt had anything like this come up,
>> but that certainly doesnt mean its not there. ?I will let you know what I
>> find. ?Thx!
>
> Hi Allison,
>
> I had a look at truncate code, truncates and writes are serialized by
> inode->i_mutex in vfs layer, ?but fallocate does not take i_mutex, so
> we need to take i_mutex in punching hole as well, I think. ?Fallocate
> behaves differently with punching hole, so it is safe without taking
> i_mutex.
It seems that race exists between reads and punching hole as well. If
a read comes after releasing pages and before down_write(i_data_sem),
then a page will be mapped, if the page is written later, it will
introduce an error. truncate avoids this situation by set file size
before truncating pages.

Yongqiang.

>
>
> What's your opinion?
>
> Yongqiang.
>>
>> Allison Henderson
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>>
>
>
>
> --
> Best Wishes
> Yongqiang Yang
>



--
Best Wishes
Yongqiang Yang

2011-08-28 01:09:54

by Allison Henderson

[permalink] [raw]
Subject: Re: question about punch hole

On 08/27/2011 02:33 AM, Yongqiang Yang wrote:
> On Sat, Aug 27, 2011 at 5:04 PM, Yongqiang Yang<[email protected]> wrote:
>> On Sat, Aug 27, 2011 at 6:35 AM, Allison Henderson
>> <[email protected]> wrote:
>>> On 08/25/2011 07:53 PM, Yongqiang Yang wrote:
>>>>
>>>> Hi Allison,
>>>>
>>>> Currently, punch hole flushes all pages to disk and releases pages in
>>>> page cache, and then calls ext4_ext_map_blocks.
>>>>
>>>> Assume that if a new page in the punching's range is mapped after
>>>> releasing pages and before down_write i_data_sem,
>>>> then ext4_ext_map_blocks will release map info of the page in extent
>>>> tree. However, up layers does not know this, and they think the page
>>>> is mapped.
>>>>
>>>> I can not find how punch hole handle the situation above. Could you
>>>> shed a light on it?
>>>>
>>>>
>>> Hi Yongqiang
>>>
>>> This is a really good question and at the moment Im still looking into it.
>>> :) The calling sequence in punch hole was modeled after truncate, which
>>> also only locks i_data_sem when modifying the extent tree.
>>> ext4_ext_map_blocks when called with the punch hole flag, only releases
>>> blocks in the extent tree, using the same routines truncate does, but it
>>> does not modify the state of the pages. Though that still does not prevent
>>> the race condition you describe, so I am still investigating it.
>>> I've found that I can catch a lot of race conditions by simply running the
>>> stress test over night, and so far I havnt had anything like this come up,
>>> but that certainly doesnt mean its not there. I will let you know what I
>>> find. Thx!
>>
>> Hi Allison,
>>
>> I had a look at truncate code, truncates and writes are serialized by
>> inode->i_mutex in vfs layer, but fallocate does not take i_mutex, so
>> we need to take i_mutex in punching hole as well, I think. Fallocate
>> behaves differently with punching hole, so it is safe without taking
>> i_mutex.
> It seems that race exists between reads and punching hole as well. If
> a read comes after releasing pages and before down_write(i_data_sem),
> then a page will be mapped, if the page is written later, it will
> introduce an error. truncate avoids this situation by set file size
> before truncating pages.
>
> Yongqiang.
>

Hi Yongqiang,

Alrighty, I found the code for truncate that you are referring to and
what you are saying makes a lot of sense, so I will add a fix for it in
the punch hole patch set I am working on at the moment. Thx for finding
this one for me :)

Allison Henderson

>>
>>
>> What's your opinion?
>>
>> Yongqiang.
>>>
>>> Allison Henderson
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>>
>> --
>> Best Wishes
>> Yongqiang Yang
>>
>
>
>