2011-07-25 15:58:38

by Jan Kara

[permalink] [raw]
Subject: Checks in ext4_ext_fiemap_cb() broken

Hello,

I just had a look at the code checking delayed allocated buffers in
ext4_ext_fiemap_cb(). I believe the checks there could use some elimiation
of common patterns but that's just a minor thing. The main problem is that
the code can easily crash the kernel when it races with page reclaim. You
just cannot access most of the page contents (and for buffers it is
especially true) without locking the page. Getting a reference via
find_get_pages_tag() guarantees you the structure cannot go away but mm is
still free to detach the page from the mapping at any moment. So you must
always lock a page and check that it still belongs to the desired mapping
before you check 'page_has_buffers()'.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


2011-07-26 01:20:28

by Yongqiang Yang

[permalink] [raw]
Subject: Re: Checks in ext4_ext_fiemap_cb() broken

Hi Jan,

I have been thinking if we can handle fiemap much simpler for a while.
Current code is very ugly due to page cache look up. I have a
thought on simplifying these code. The reason leading us to looking
up page cache is that delayed extents are not in extents tree. I
think we can add an in-memory delayed extents list in inode, and we
can delete entries in the list after we allocate blocks for them.
There is no limit on length of extents in the list, this way can an
entry contain as many blocks as they are contiguous logically.

What's your opinion?

Yongqiang.

On Mon, Jul 25, 2011 at 11:58 PM, Jan Kara <[email protected]> wrote:
> ?Hello,
>
> ?I just had a look at the code checking delayed allocated buffers in
> ext4_ext_fiemap_cb(). I believe the checks there could use some elimiation
> of common patterns but that's just a minor thing. The main problem is that
> the code can easily crash the kernel when it races with page reclaim. You
> just cannot access most of the page contents (and for buffers it is
> especially true) without locking the page. Getting a reference via
> find_get_pages_tag() guarantees you the structure cannot go away but mm is
> still free to detach the page from the mapping at any moment. So you must
> always lock a page and check that it still belongs to the desired mapping
> before you check 'page_has_buffers()'.
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>



--
Best Wishes
Yongqiang Yang

2011-07-26 12:12:32

by Jan Kara

[permalink] [raw]
Subject: Re: Checks in ext4_ext_fiemap_cb() broken

Hi Yongqiang,

On Tue 26-07-11 09:20:28, Yongqiang Yang wrote:
> I have been thinking if we can handle fiemap much simpler for a while.
> Current code is very ugly due to page cache look up. I have a
> thought on simplifying these code. The reason leading us to looking
> up page cache is that delayed extents are not in extents tree. I
> think we can add an in-memory delayed extents list in inode, and we
> can delete entries in the list after we allocate blocks for them.
> There is no limit on length of extents in the list, this way can an
> entry contain as many blocks as they are contiguous logically.
>
> What's your opinion?
Yes, that should be doable and shouldn't have too big overhead. It's just
stupid we'll do all this stuff only for fiemap call which is relatively
rare.

Honza

> On Mon, Jul 25, 2011 at 11:58 PM, Jan Kara <[email protected]> wrote:
> > ?Hello,
> >
> > ?I just had a look at the code checking delayed allocated buffers in
> > ext4_ext_fiemap_cb(). I believe the checks there could use some elimiation
> > of common patterns but that's just a minor thing. The main problem is that
> > the code can easily crash the kernel when it races with page reclaim. You
> > just cannot access most of the page contents (and for buffers it is
> > especially true) without locking the page. Getting a reference via
> > find_get_pages_tag() guarantees you the structure cannot go away but mm is
> > still free to detach the page from the mapping at any moment. So you must
> > always lock a page and check that it still belongs to the desired mapping
> > before you check 'page_has_buffers()'.
> >
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Honza
> > --
> > Jan Kara <[email protected]>
> > SUSE Labs, CR
> >
>
>
>
> --
> Best Wishes
> Yongqiang Yang
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-07-26 12:48:24

by Yongqiang Yang

[permalink] [raw]
Subject: Re: Checks in ext4_ext_fiemap_cb() broken

On Tue, Jul 26, 2011 at 8:12 PM, Jan Kara <[email protected]> wrote:
> ?Hi Yongqiang,
>
> On Tue 26-07-11 09:20:28, Yongqiang Yang wrote:
>> I have been thinking if we can handle fiemap much simpler for a while.
>> ?Current code is very ugly due to page cache look up. ?I have a
>> thought on simplifying these code. ?The reason leading us to looking
>> up page cache is that delayed extents are not in extents tree. ?I
>> think we can add an in-memory delayed extents list in inode, and we
>> can delete entries in the list after we allocate blocks for them.
>> There is no limit on length of extents in the list, this way can an
>> entry contain as many blocks as they are contiguous logically.
>>
>> What's your opinion?
> ?Yes, that should be doable and shouldn't have too big overhead. It's just
> stupid we'll do all this stuff only for fiemap call which is relatively
> rare.

I guess there are other places where delayed extents should be handled
by looking up page cache.

SEEK_HOLE and SEEK_DATA also need to lookup page cache to handle
delayed extents.

Hi Allison,

If a delayed extents list added in the inode, could punch hole code be simpler?


Yongqiang.
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Honza
>
>> On Mon, Jul 25, 2011 at 11:58 PM, Jan Kara <[email protected]> wrote:
>> > ?Hello,
>> >
>> > ?I just had a look at the code checking delayed allocated buffers in
>> > ext4_ext_fiemap_cb(). I believe the checks there could use some elimiation
>> > of common patterns but that's just a minor thing. The main problem is that
>> > the code can easily crash the kernel when it races with page reclaim. You
>> > just cannot access most of the page contents (and for buffers it is
>> > especially true) without locking the page. Getting a reference via
>> > find_get_pages_tag() guarantees you the structure cannot go away but mm is
>> > still free to detach the page from the mapping at any moment. So you must
>> > always lock a page and check that it still belongs to the desired mapping
>> > before you check 'page_has_buffers()'.
>> >
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Honza
>> > --
>> > Jan Kara <[email protected]>
>> > SUSE Labs, CR
>> >
>>
>>
>>
>> --
>> Best Wishes
>> Yongqiang Yang
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>



--
Best Wishes
Yongqiang Yang

2011-07-26 16:30:54

by Allison Henderson

[permalink] [raw]
Subject: Re: Checks in ext4_ext_fiemap_cb() broken

On 07/26/2011 05:48 AM, Yongqiang Yang wrote:
> On Tue, Jul 26, 2011 at 8:12 PM, Jan Kara<[email protected]> wrote:
>> Hi Yongqiang,
>>
>> On Tue 26-07-11 09:20:28, Yongqiang Yang wrote:
>>> I have been thinking if we can handle fiemap much simpler for a while.
>>> Current code is very ugly due to page cache look up. I have a
>>> thought on simplifying these code. The reason leading us to looking
>>> up page cache is that delayed extents are not in extents tree. I
>>> think we can add an in-memory delayed extents list in inode, and we
>>> can delete entries in the list after we allocate blocks for them.
>>> There is no limit on length of extents in the list, this way can an
>>> entry contain as many blocks as they are contiguous logically.
>>>
>>> What's your opinion?
>> Yes, that should be doable and shouldn't have too big overhead. It's just
>> stupid we'll do all this stuff only for fiemap call which is relatively
>> rare.
>
> I guess there are other places where delayed extents should be handled
> by looking up page cache.
>
> SEEK_HOLE and SEEK_DATA also need to lookup page cache to handle
> delayed extents.
>
> Hi Allison,
>
> If a delayed extents list added in the inode, could punch hole code be simpler?
>
>
> Yongqiang.

Hi there,

Well, I think we may be able to make it more efficient if we had the
delayed extent list.

The earlier versions of punch hole were complex because of the different
mechanisms needed to identify when extents were mapped, delayed or a
hole. Later we decided that this was too complex, and the pages that
covered the hole need to be sync'd anyway, which eliminated the need to
detect the delayed extents, but it is a wasteful operation if the
extents in the hole were just unwritten. If we had the delayed extent
list, I think we may just be able to sync extents as needed instead of
syncing the entire hole.

Allison Henderson

>>
>> Honza
>>
>>> On Mon, Jul 25, 2011 at 11:58 PM, Jan Kara<[email protected]> wrote:
>>>> Hello,
>>>>
>>>> I just had a look at the code checking delayed allocated buffers in
>>>> ext4_ext_fiemap_cb(). I believe the checks there could use some elimiation
>>>> of common patterns but that's just a minor thing. The main problem is that
>>>> the code can easily crash the kernel when it races with page reclaim. You
>>>> just cannot access most of the page contents (and for buffers it is
>>>> especially true) without locking the page. Getting a reference via
>>>> find_get_pages_tag() guarantees you the structure cannot go away but mm is
>>>> still free to detach the page from the mapping at any moment. So you must
>>>> always lock a page and check that it still belongs to the desired mapping
>>>> before you check 'page_has_buffers()'.
>>>>
>>>> Honza
>>>> --
>>>> Jan Kara<[email protected]>
>>>> SUSE Labs, CR
>>>>
>>>
>>>
>>>
>>> --
>>> Best Wishes
>>> Yongqiang Yang
>> --
>> Jan Kara<[email protected]>
>> SUSE Labs, CR
>>
>
>
>


2011-07-26 16:44:29

by Andreas Dilger

[permalink] [raw]
Subject: Re: Checks in ext4_ext_fiemap_cb() broken

On Tue 26-07-11 09:20:28, Yongqiang Yang wrote:
> I have been thinking if we can handle fiemap much simpler for a while.
> Current code is very ugly due to page cache look up. I have a
> thought on simplifying these code. The reason leading us to looking
> up page cache is that delayed extents are not in extents tree. I
> think we can add an in-memory delayed extents list in inode, and we
> can delete entries in the list after we allocate blocks for them.
> There is no limit on length of extents in the list, this way can an
> entry contain as many blocks as they are contiguous logically.
>
> What's your opinion?

It may also be useful to have an extent list for submitting large contiguous writeouts to disk, instead of having to look them up.

The main question is whether the added overhead of maintaining the list is worthwhile if we can get an equivalent functionality using a tag lookup in the page cache.

Cheers, Andreas






2011-07-26 17:07:09

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Checks in ext4_ext_fiemap_cb() broken

On Tue, Jul 26, 2011 at 08:48:21PM +0800, Yongqiang Yang wrote:
> I guess there are other places where delayed extents should be handled
> by looking up page cache.
>
> SEEK_HOLE and SEEK_DATA also need to lookup page cache to handle
> delayed extents.

Another place where we're using testing the page cache for delalloc
extents is in the bigalloc patches. See ext4_find_delalloc_range in
Aditya's patch:

http://permalink.gmane.org/gmane.comp.file-systems.ext4/26619

- Ted


2011-07-26 18:49:07

by Aditya Kali

[permalink] [raw]
Subject: Re: Checks in ext4_ext_fiemap_cb() broken

On Tue, Jul 26, 2011 at 5:12 AM, Jan Kara <[email protected]> wrote:
>  Hi Yongqiang,
>
> On Tue 26-07-11 09:20:28, Yongqiang Yang wrote:
>> I have been thinking if we can handle fiemap much simpler for a while.
>>  Current code is very ugly due to page cache look up.  I have a
>> thought on simplifying these code.  The reason leading us to looking
>> up page cache is that delayed extents are not in extents tree.  I
>> think we can add an in-memory delayed extents list in inode, and we
>> can delete entries in the list after we allocate blocks for them.
>> There is no limit on length of extents in the list, this way can an
>> entry contain as many blocks as they are contiguous logically.
>>
>> What's your opinion?
>  Yes, that should be doable and shouldn't have too big overhead. It's just
> stupid we'll do all this stuff only for fiemap call which is relatively
> rare.
>
Delayed extents lookup will also help resolve another race that we
currently have in bigalloc code path. Here, we need to figure out if a
cluster is already under delayed allocation or not (to determine
whether we need to reserve quota for this cluster). But, determining
this races against the writeback of delayed allocated pages.
ext4_find_delalloc_range() function has a comment about this race. If
there is a delayed extents list and the extents are removed from his
list when they are actually mapped, then ext4_find_delalloc_range()
can simply check against this list.


>                                                                Honza
>
>> On Mon, Jul 25, 2011 at 11:58 PM, Jan Kara <[email protected]> wrote:
>> >  Hello,
>> >
>> >  I just had a look at the code checking delayed allocated buffers in
>> > ext4_ext_fiemap_cb(). I believe the checks there could use some elimiation
>> > of common patterns but that's just a minor thing. The main problem is that
>> > the code can easily crash the kernel when it races with page reclaim. You
>> > just cannot access most of the page contents (and for buffers it is
>> > especially true) without locking the page. Getting a reference via
>> > find_get_pages_tag() guarantees you the structure cannot go away but mm is
>> > still free to detach the page from the mapping at any moment. So you must
>> > always lock a page and check that it still belongs to the desired mapping
>> > before you check 'page_has_buffers()'.
>> >
>> >                                                                Honza
>> > --
>> > Jan Kara <[email protected]>
>> > SUSE Labs, CR
>> >
>>
>>
>>
>> --
>> Best Wishes
>> Yongqiang Yang
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>