2009-06-04 06:25:32

by Nai Xia

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <[email protected]> wrote:
> On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
>> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
>
> [snip]
>
>> >
>> > BTW. I don't know if you are checking for PG_writeback often enough?
>> > You can't remove a PG_writeback page from pagecache. The normal
>> > pattern is lock_page(page); wait_on_page_writeback(page); which I
>>
>> So pages can be in writeback without being locked? I still
>> wasn't able to find such a case (in fact unless I'm misreading
>> the code badly the writeback bit is only used by NFS and a few
>> obscure cases)
>
> Yes the writeback page is typically not locked. Only read IO requires
> to be exclusive. Read IO is in fact page *writer*, while writeback IO
> is page *reader* :-)

Sorry for maybe somewhat a little bit off topic,
I am trying to get a good understanding of PG_writeback & PG_locked ;)

So you are saying PG_writeback & PG_locked are acting like a read/write lock?
I notice wait_on_page_writeback(page) seems always called with page locked --
that is the semantics of a writer waiting to get the lock while it's
acquired by
some reader:The caller(e.g. truncate_inode_pages_range() and
invalidate_inode_pages2_range()) are the writers waiting for
writeback readers (as you clarified ) to finish their job, right ?

So do you think the idea is sane to group the two bits together
to form a real read/write lock, which does not care about the _number_
of readers ?


>
> The writeback bit is _widely_ used. ?test_set_page_writeback() is
> directly used by NFS/AFS etc. But its main user is in fact
> set_page_writeback(), which is called in 26 places.
>
>> > think would be safest
>>
>> Okay. I'll just add it after the page lock.
>>
>> > (then you never have to bother with the writeback bit again)
>>
>> Until Fengguang does something fancy with it.
>
> Yes I'm going to do it without wait_on_page_writeback().
>
> The reason truncate_inode_pages_range() has to wait on writeback page
> is to ensure data integrity. Otherwise if there comes two events:
> ? ? ? ?truncate page A at offset X
> ? ? ? ?populate page B at offset X
> If A and B are all writeback pages, then B can hit disk first and then
> be overwritten by A. Which corrupts the data at offset X from user's POV.
>
> But for hwpoison, there are no such worries. If A is poisoned, we do
> our best to isolate it as well as intercepting its IO. If the interception
> fails, it will trigger another machine check before hitting the disk.
>
> After all, poisoned A means the data at offset X is already corrupted.
> It doesn't matter if there comes another B page.
>
> Thanks,
> Fengguang
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>


2009-06-07 16:02:43

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <[email protected]> wrote:
> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> >
> > [snip]
> >
> >> >
> >> > BTW. I don't know if you are checking for PG_writeback often enough?
> >> > You can't remove a PG_writeback page from pagecache. The normal
> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
> >>
> >> So pages can be in writeback without being locked? I still
> >> wasn't able to find such a case (in fact unless I'm misreading
> >> the code badly the writeback bit is only used by NFS and a few
> >> obscure cases)
> >
> > Yes the writeback page is typically not locked. Only read IO requires
> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
> > is page *reader* :-)
>
> Sorry for maybe somewhat a little bit off topic,
> I am trying to get a good understanding of PG_writeback & PG_locked ;)
>
> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
> I notice wait_on_page_writeback(page) seems always called with page locked --

No. Note that pages are not locked in wait_on_page_writeback_range().

> that is the semantics of a writer waiting to get the lock while it's
> acquired by
> some reader:The caller(e.g. truncate_inode_pages_range() and
> invalidate_inode_pages2_range()) are the writers waiting for
> writeback readers (as you clarified ) to finish their job, right ?

Sorry if my metaphor confused you. But they are not typical
reader/writer problems, but more about data integrities.

Pages have to be "not under writeback" when truncated.
Otherwise data lost is possible:

1) create a file with one page (page A)
2) truncate page A that is under writeback
3) write to file, which creates page B
4) sync file, which sends page B to disk quickly

Now if page B reaches disk before A, the new data will be overwritten
by truncated old data, which corrupts the file.

> So do you think the idea is sane to group the two bits together
> to form a real read/write lock, which does not care about the _number_
> of readers ?

We don't care number of readers here. So please forget about it.

Thanks,
Fengguang

> > The writeback bit is _widely_ used.  test_set_page_writeback() is
> > directly used by NFS/AFS etc. But its main user is in fact
> > set_page_writeback(), which is called in 26 places.
> >
> >> > think would be safest
> >>
> >> Okay. I'll just add it after the page lock.
> >>
> >> > (then you never have to bother with the writeback bit again)
> >>
> >> Until Fengguang does something fancy with it.
> >
> > Yes I'm going to do it without wait_on_page_writeback().
> >
> > The reason truncate_inode_pages_range() has to wait on writeback page
> > is to ensure data integrity. Otherwise if there comes two events:
> >        truncate page A at offset X
> >        populate page B at offset X
> > If A and B are all writeback pages, then B can hit disk first and then
> > be overwritten by A. Which corrupts the data at offset X from user's POV.
> >
> > But for hwpoison, there are no such worries. If A is poisoned, we do
> > our best to isolate it as well as intercepting its IO. If the interception
> > fails, it will trigger another machine check before hitting the disk.
> >
> > After all, poisoned A means the data at offset X is already corrupted.
> > It doesn't matter if there comes another B page.
> >
> > Thanks,
> > Fengguang
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >

2009-06-08 11:06:20

by Nai Xia

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Mon, Jun 8, 2009 at 12:02 AM, Wu Fengguang<[email protected]> wrote:
> On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
>> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <[email protected]> wrote:
>> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
>> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
>> >
>> > [snip]
>> >
>> >> >
>> >> > BTW. I don't know if you are checking for PG_writeback often enough?
>> >> > You can't remove a PG_writeback page from pagecache. The normal
>> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
>> >>
>> >> So pages can be in writeback without being locked? I still
>> >> wasn't able to find such a case (in fact unless I'm misreading
>> >> the code badly the writeback bit is only used by NFS and a few
>> >> obscure cases)
>> >
>> > Yes the writeback page is typically not locked. Only read IO requires
>> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
>> > is page *reader* :-)
>>
>> Sorry for maybe somewhat a little bit off topic,
>> I am trying to get a good understanding of PG_writeback & PG_locked ;)
>>
>> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
>> I notice wait_on_page_writeback(page) seems always called with page locked --
>
> No. Note that pages are not locked in wait_on_page_writeback_range().

I see. This function seems mostly called on the sync path,
it just waits for data being synchronized to disk.
No writers from the pages' POV, so no lock.
I missed this case, but my argument about the role of read/write lock.
seems still consistent. :)

>
>> that is the semantics of a writer waiting to get the lock while it's
>> acquired by
>> some reader:The caller(e.g. truncate_inode_pages_range() ?and
>> invalidate_inode_pages2_range()) are the writers waiting for
>> writeback readers (as you clarified ) to finish their job, right ?
>
> Sorry if my metaphor confused you. But they are not typical
> reader/writer problems, but more about data integrities.

No, you didn't :)
Actually, you make me clear about the mixed roles for
those bits.

>
> Pages have to be "not under writeback" when truncated.
> Otherwise data lost is possible:
>
> 1) create a file with one page (page A)
> 2) truncate page A that is under writeback
> 3) write to file, which creates page B
> 4) sync file, which sends page B to disk quickly
>
> Now if page B reaches disk before A, the new data will be overwritten
> by truncated old data, which corrupts the file.

I fully understand this scenario which you had already clarified in a
previous message. :)

1. someone make index1-> page A
2. Path P1 is acting as a *reader* to a cache page at index1 by
setting PG_writeback on, while at the same time as a *writer* to
the corresponding file blocks.
3. Another path P2 comes in and truncate page A, he is the writer
to the same cache page.
4. Yet another path P3 comes as the writer to the cache page
making it points to page B: index1--> page B.
5. Path P4 comes writing back the cache page(and set PG_writeback).
He is the reader of the cache page and the writer to the file blocks.

The corrupts occur because P1 & P4 races when writing file blocks.
But the _root_ of this racing is because nothing is used to serialize
them on the side of writing the file blocks and above stream reading was
inconsistent because of the writers(P2 & P3) to cache page at index1.

Note that the "sync file" is somewhat irrelevant, even without "sync file",
the racing still may exists. I know you must want to show me that this could
make the corruption more easy to occur.

So I think the simple logic is:
1) if you want to truncate/change the mapping from a index to a struct *page,
test writeback bit because the writebacker to the file blocks is the reader
of this mapping.
2) if a writebacker want to start a read of this mapping with
test_set_page_writeback()
or set_page_writeback(), he'd be sure this page is locked to keep out the
writers to this mapping of index-->struct *page.

This is really behavior of a read/write lock, right ?

wait_on_page_writeback_range() looks different only because "sync"
operates on "struct page", it's not sensitive to index-->struct *page mapping.
It does care about if pages returned by pagevec_lookup_tag() are
still maintains the mapping when wait_on_page_writeback(page).
Here, PG_writeback is only a status flag for "struct page" not a lock bit for
index->struct *page mapping.

>
>> So do you think the idea is sane to group the two bits together
>> to form a real read/write lock, which does not care about the _number_
>> of readers ?
>
> We don't care number of readers here. So please forget about it.
Yeah, I meant number of readers is not important.

I still hold that these two bits in some way act like a _sparse_
read/write lock.
But I am going to drop the idea of making them a pure lock, since PG_writeback
does has other meaning -- the page is being writing back: for sync
path, it's only
a status flag.
Making a pure read/write lock definitely will lose that or at least distort it.


Hoping I've made my words understandable, correct me if wrong, and
many thanks for your time and patience. :-)


Nai Xia

>
> Thanks,
> Fengguang
>
>> > The writeback bit is _widely_ used. ?test_set_page_writeback() is
>> > directly used by NFS/AFS etc. But its main user is in fact
>> > set_page_writeback(), which is called in 26 places.
>> >
>> >> > think would be safest
>> >>
>> >> Okay. I'll just add it after the page lock.
>> >>
>> >> > (then you never have to bother with the writeback bit again)
>> >>
>> >> Until Fengguang does something fancy with it.
>> >
>> > Yes I'm going to do it without wait_on_page_writeback().
>> >
>> > The reason truncate_inode_pages_range() has to wait on writeback page
>> > is to ensure data integrity. Otherwise if there comes two events:
>> > ? ? ? ?truncate page A at offset X
>> > ? ? ? ?populate page B at offset X
>> > If A and B are all writeback pages, then B can hit disk first and then
>> > be overwritten by A. Which corrupts the data at offset X from user's POV.
>> >
>> > But for hwpoison, there are no such worries. If A is poisoned, we do
>> > our best to isolate it as well as intercepting its IO. If the interception
>> > fails, it will trigger another machine check before hitting the disk.
>> >
>> > After all, poisoned A means the data at offset X is already corrupted.
>> > It doesn't matter if there comes another B page.
>> >
>> > Thanks,
>> > Fengguang
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> > the body of a message to [email protected]
>> > More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>> > Please read the FAQ at ?http://www.tux.org/lkml/
>> >
>

2009-06-08 12:47:25

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Mon, Jun 08, 2009 at 07:06:12PM +0800, Nai Xia wrote:
> On Mon, Jun 8, 2009 at 12:02 AM, Wu Fengguang<[email protected]> wrote:
> > On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
> >> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <[email protected]> wrote:
> >> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
> >> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> >> >
> >> > [snip]
> >> >
> >> >> >
> >> >> > BTW. I don't know if you are checking for PG_writeback often enough?
> >> >> > You can't remove a PG_writeback page from pagecache. The normal
> >> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
> >> >>
> >> >> So pages can be in writeback without being locked? I still
> >> >> wasn't able to find such a case (in fact unless I'm misreading
> >> >> the code badly the writeback bit is only used by NFS and a few
> >> >> obscure cases)
> >> >
> >> > Yes the writeback page is typically not locked. Only read IO requires
> >> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
> >> > is page *reader* :-)
> >>
> >> Sorry for maybe somewhat a little bit off topic,
> >> I am trying to get a good understanding of PG_writeback & PG_locked ;)
> >>
> >> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
> >> I notice wait_on_page_writeback(page) seems always called with page locked --
> >
> > No. Note that pages are not locked in wait_on_page_writeback_range().
>
> I see. This function seems mostly called on the sync path,
> it just waits for data being synchronized to disk.
> No writers from the pages' POV, so no lock.
> I missed this case, but my argument about the role of read/write lock.
> seems still consistent. :)

It's more constrained. Normal read/write locks allow concurrent readers,
however fsync() must wait for previous IO to finish before starting
its own IO.

> >
> >> that is the semantics of a writer waiting to get the lock while it's
> >> acquired by
> >> some reader:The caller(e.g. truncate_inode_pages_range()  and
> >> invalidate_inode_pages2_range()) are the writers waiting for
> >> writeback readers (as you clarified ) to finish their job, right ?
> >
> > Sorry if my metaphor confused you. But they are not typical
> > reader/writer problems, but more about data integrities.
>
> No, you didn't :)
> Actually, you make me clear about the mixed roles for
> those bits.
>
> >
> > Pages have to be "not under writeback" when truncated.
> > Otherwise data lost is possible:
> >
> > 1) create a file with one page (page A)
> > 2) truncate page A that is under writeback
> > 3) write to file, which creates page B
> > 4) sync file, which sends page B to disk quickly
> >
> > Now if page B reaches disk before A, the new data will be overwritten
> > by truncated old data, which corrupts the file.
>
> I fully understand this scenario which you had already clarified in a
> previous message. :)
>
> 1. someone make index1-> page A
> 2. Path P1 is acting as a *reader* to a cache page at index1 by
> setting PG_writeback on, while at the same time as a *writer* to
> the corresponding file blocks.
> 3. Another path P2 comes in and truncate page A, he is the writer
> to the same cache page.
> 4. Yet another path P3 comes as the writer to the cache page
> making it points to page B: index1--> page B.
> 5. Path P4 comes writing back the cache page(and set PG_writeback).
> He is the reader of the cache page and the writer to the file blocks.
>
> The corrupts occur because P1 & P4 races when writing file blocks.
> But the _root_ of this racing is because nothing is used to serialize
> them on the side of writing the file blocks and above stream reading was
> inconsistent because of the writers(P2 & P3) to cache page at index1.
>
> Note that the "sync file" is somewhat irrelevant, even without "sync file",
> the racing still may exists. I know you must want to show me that this could
> make the corruption more easy to occur.
>
> So I think the simple logic is:
> 1) if you want to truncate/change the mapping from a index to a struct *page,
> test writeback bit because the writebacker to the file blocks is the reader
> of this mapping.
> 2) if a writebacker want to start a read of this mapping with
> test_set_page_writeback()
> or set_page_writeback(), he'd be sure this page is locked to keep out the
> writers to this mapping of index-->struct *page.
>
> This is really behavior of a read/write lock, right ?

Please, that's a dangerous idea. A page can be written to at any time
when writeback to disk is under way. Does PG_writeback (your reader
lock) prevents page data writers? NO.

Thanks,
Fengguang

> wait_on_page_writeback_range() looks different only because "sync"
> operates on "struct page", it's not sensitive to index-->struct *page mapping.
> It does care about if pages returned by pagevec_lookup_tag() are
> still maintains the mapping when wait_on_page_writeback(page).
> Here, PG_writeback is only a status flag for "struct page" not a lock bit for
> index->struct *page mapping.
>
> >
> >> So do you think the idea is sane to group the two bits together
> >> to form a real read/write lock, which does not care about the _number_
> >> of readers ?
> >
> > We don't care number of readers here. So please forget about it.
> Yeah, I meant number of readers is not important.
>
> I still hold that these two bits in some way act like a _sparse_
> read/write lock.
> But I am going to drop the idea of making them a pure lock, since PG_writeback
> does has other meaning -- the page is being writing back: for sync
> path, it's only
> a status flag.
> Making a pure read/write lock definitely will lose that or at least distort it.
>
>
> Hoping I've made my words understandable, correct me if wrong, and
> many thanks for your time and patience. :-)
>
>
> Nai Xia
>
> >
> > Thanks,
> > Fengguang
> >
> >> > The writeback bit is _widely_ used.  test_set_page_writeback() is
> >> > directly used by NFS/AFS etc. But its main user is in fact
> >> > set_page_writeback(), which is called in 26 places.
> >> >
> >> >> > think would be safest
> >> >>
> >> >> Okay. I'll just add it after the page lock.
> >> >>
> >> >> > (then you never have to bother with the writeback bit again)
> >> >>
> >> >> Until Fengguang does something fancy with it.
> >> >
> >> > Yes I'm going to do it without wait_on_page_writeback().
> >> >
> >> > The reason truncate_inode_pages_range() has to wait on writeback page
> >> > is to ensure data integrity. Otherwise if there comes two events:
> >> >        truncate page A at offset X
> >> >        populate page B at offset X
> >> > If A and B are all writeback pages, then B can hit disk first and then
> >> > be overwritten by A. Which corrupts the data at offset X from user's POV.
> >> >
> >> > But for hwpoison, there are no such worries. If A is poisoned, we do
> >> > our best to isolate it as well as intercepting its IO. If the interception
> >> > fails, it will trigger another machine check before hitting the disk.
> >> >
> >> > After all, poisoned A means the data at offset X is already corrupted.
> >> > It doesn't matter if there comes another B page.
> >> >
> >> > Thanks,
> >> > Fengguang
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> > the body of a message to [email protected]
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> > Please read the FAQ at  http://www.tux.org/lkml/
> >> >
> >

2009-06-08 14:47:04

by Nai Xia

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Mon, Jun 8, 2009 at 8:31 PM, Wu Fengguang<[email protected]> wrote:
> On Mon, Jun 08, 2009 at 07:06:12PM +0800, Nai Xia wrote:
>> On Mon, Jun 8, 2009 at 12:02 AM, Wu Fengguang<[email protected]> wrote:
>> > On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
>> >> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <[email protected]> wrote:
>> >> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
>> >> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
>> >> >
>> >> > [snip]
>> >> >
>> >> >> >
>> >> >> > BTW. I don't know if you are checking for PG_writeback often enough?
>> >> >> > You can't remove a PG_writeback page from pagecache. The normal
>> >> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
>> >> >>
>> >> >> So pages can be in writeback without being locked? I still
>> >> >> wasn't able to find such a case (in fact unless I'm misreading
>> >> >> the code badly the writeback bit is only used by NFS and a few
>> >> >> obscure cases)
>> >> >
>> >> > Yes the writeback page is typically not locked. Only read IO requires
>> >> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
>> >> > is page *reader* :-)
>> >>
>> >> Sorry for maybe somewhat a little bit off topic,
>> >> I am trying to get a good understanding of PG_writeback & PG_locked ;)
>> >>
>> >> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
>> >> I notice wait_on_page_writeback(page) seems always called with page locked --
>> >
>> > No. Note that pages are not locked in wait_on_page_writeback_range().
>>
>> I see. This function seems mostly called ?on the sync path,
>> it just waits for data being synchronized to disk.
>> No writers from the pages' POV, so no lock.
>> I missed this case, but my argument about the role of read/write lock.
>> seems still consistent. :)
>
> It's more constrained. Normal read/write locks allow concurrent readers,
> however fsync() must wait for previous IO to finish before starting
> its own IO.

Oh, yes, this is what I called "mixed roles". One for lock, one for
status flag, twisted together in the same path, making the read lock
semantics totally broken.

>
>> >
>> >> that is the semantics of a writer waiting to get the lock while it's
>> >> acquired by
>> >> some reader:The caller(e.g. truncate_inode_pages_range() ?and
>> >> invalidate_inode_pages2_range()) are the writers waiting for
>> >> writeback readers (as you clarified ) to finish their job, right ?
>> >
>> > Sorry if my metaphor confused you. But they are not typical
>> > reader/writer problems, but more about data integrities.
>>
>> No, you didn't :)
>> Actually, you make me clear about the mixed roles for
>> those bits.
>>
>> >
>> > Pages have to be "not under writeback" when truncated.
>> > Otherwise data lost is possible:
>> >
>> > 1) create a file with one page (page A)
>> > 2) truncate page A that is under writeback
>> > 3) write to file, which creates page B
>> > 4) sync file, which sends page B to disk quickly
>> >
>> > Now if page B reaches disk before A, the new data will be overwritten
>> > by truncated old data, which corrupts the file.
>>
>> I fully understand this scenario which you had already clarified in a
>> previous message. :)
>>
>> 1. someone make index1-> page A
>> 2. Path P1 is acting as a *reader* to a cache page at index1 by
>> ? ? setting PG_writeback on, while at the same time as a *writer* to
>> ? ? the corresponding file blocks.
>> 3. Another path P2 comes in and ?truncate page A, he is the writer
>> ? ? to the same cache page.
>> 4. Yet another path P3 comes ?as the writer to the cache page
>> ? ? ?making it points to page B: index1--> page B.
>> 5. Path P4 comes writing back the cache page(and set PG_writeback).
>> ? ?He is the reader of the cache page and the writer to the file blocks.
>>
>> The corrupts occur because P1 & P4 races when writing file blocks.
>> But the _root_ of this racing is because nothing is used to serialize
>> them on the side of writing the file blocks and above stream reading was
>> inconsistent because of the writers(P2 & P3) to cache page at index1.
>>
>> Note that the "sync file" is somewhat irrelevant, even without "sync file",
>> the racing still may exists. I know you must want to show me that this could
>> make the corruption more easy to occur.
>>
>> So I think the simple logic is:
>> 1) if you want to truncate/change the mapping from a index to a struct *page,
>> test writeback bit because the writebacker to the file blocks is the reader
>> of this mapping.
>> 2) if a writebacker want to start a read of this mapping with
>> test_set_page_writeback()
>> or set_page_writeback(), he'd be sure this page is locked to keep out the
>> writers to this mapping of index-->struct *page.
>>
>> This is really behavior of a read/write lock, right ?
>
> Please, that's a dangerous idea. A page can be written to at any time
> when writeback to disk is under way. Does PG_writeback (your reader
> lock) prevents page data writers? ?NO.

I meant PG_writeback stops writers to index---->struct page mapping.

I think I should make my statements more concise and the "reader/writer"
less vague.

Here we care about the write/read operation for index---->struct page mapping.
Not for read/write operation for the page content.

Anyone who wants to change this mapping is a writer, he should take
page lock.
Anyone who wants to reference this mapping is a reader, writers should
wait for him. And when this reader wants to get ref, he should wait for
anyone one who is changing this mapping(e.g. page truncater).

When a path sets PG_writeback on a page, it need this index-->struct page
mapping be 100% valid right? (otherwise may leads to corruption.)
So writeback routines are readers of this index-->struct page mapping.
(oh, well if we can put the other role of PG_writeback aside)

Ok,Ok, since PG_locked does mean much more than just protecting
the per-page mapping which makes the lock abstraction even less clear.
so indeed, forget about it.

>
> Thanks,
> Fengguang
>
>> wait_on_page_writeback_range() looks different only because "sync"
>> operates on "struct page", it's not sensitive to index-->struct *page mapping.
>> It does care about if pages returned by pagevec_lookup_tag() are
>> still maintains the mapping when wait_on_page_writeback(page).
>> Here, PG_writeback is only a status flag for "struct page" not a lock bit for
>> index->struct *page mapping.
>>
>> >
>> >> So do you think the idea is sane to group the two bits together
>> >> to form a real read/write lock, which does not care about the _number_
>> >> of readers ?
>> >
>> > We don't care number of readers here. So please forget about it.
>> Yeah, I meant number of readers is not important.
>>
>> I still hold that these two bits in some way act like a _sparse_
>> read/write lock.
>> But I am going to drop the idea of making them a pure lock, since PG_writeback
>> does has other meaning -- the page is being writing back: for sync
>> path, it's only
>> a status flag.
>> Making a pure read/write lock definitely will lose that or at least distort it.
>>
>>
>> Hoping I've made my words understandable, correct me if wrong, and
>> many thanks for your time and patience. :-)
>>
>>
>> Nai Xia
>>
>> >
>> > Thanks,
>> > Fengguang
>> >
>> >> > The writeback bit is _widely_ used. ?test_set_page_writeback() is
>> >> > directly used by NFS/AFS etc. But its main user is in fact
>> >> > set_page_writeback(), which is called in 26 places.
>> >> >
>> >> >> > think would be safest
>> >> >>
>> >> >> Okay. I'll just add it after the page lock.
>> >> >>
>> >> >> > (then you never have to bother with the writeback bit again)
>> >> >>
>> >> >> Until Fengguang does something fancy with it.
>> >> >
>> >> > Yes I'm going to do it without wait_on_page_writeback().
>> >> >
>> >> > The reason truncate_inode_pages_range() has to wait on writeback page
>> >> > is to ensure data integrity. Otherwise if there comes two events:
>> >> > ? ? ? ?truncate page A at offset X
>> >> > ? ? ? ?populate page B at offset X
>> >> > If A and B are all writeback pages, then B can hit disk first and then
>> >> > be overwritten by A. Which corrupts the data at offset X from user's POV.
>> >> >
>> >> > But for hwpoison, there are no such worries. If A is poisoned, we do
>> >> > our best to isolate it as well as intercepting its IO. If the interception
>> >> > fails, it will trigger another machine check before hitting the disk.
>> >> >
>> >> > After all, poisoned A means the data at offset X is already corrupted.
>> >> > It doesn't matter if there comes another B page.
>> >> >
>> >> > Thanks,
>> >> > Fengguang
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> >> > the body of a message to [email protected]
>> >> > More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>> >> > Please read the FAQ at ?http://www.tux.org/lkml/
>> >> >
>> >
>

2009-06-09 06:49:07

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> On Mon, Jun 8, 2009 at 8:31 PM, Wu Fengguang<[email protected]> wrote:
> > On Mon, Jun 08, 2009 at 07:06:12PM +0800, Nai Xia wrote:
> >> On Mon, Jun 8, 2009 at 12:02 AM, Wu Fengguang<[email protected]> wrote:
> >> > On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
> >> >> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <[email protected]> wrote:
> >> >> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
> >> >> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> >> >> >
> >> >> > [snip]
> >> >> >
> >> >> >> >
> >> >> >> > BTW. I don't know if you are checking for PG_writeback often enough?
> >> >> >> > You can't remove a PG_writeback page from pagecache. The normal
> >> >> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
> >> >> >>
> >> >> >> So pages can be in writeback without being locked? I still
> >> >> >> wasn't able to find such a case (in fact unless I'm misreading
> >> >> >> the code badly the writeback bit is only used by NFS and a few
> >> >> >> obscure cases)
> >> >> >
> >> >> > Yes the writeback page is typically not locked. Only read IO requires
> >> >> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
> >> >> > is page *reader* :-)
> >> >>
> >> >> Sorry for maybe somewhat a little bit off topic,
> >> >> I am trying to get a good understanding of PG_writeback & PG_locked ;)
> >> >>
> >> >> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
> >> >> I notice wait_on_page_writeback(page) seems always called with page locked --
> >> >
> >> > No. Note that pages are not locked in wait_on_page_writeback_range().
> >>
> >> I see. This function seems mostly called  on the sync path,
> >> it just waits for data being synchronized to disk.
> >> No writers from the pages' POV, so no lock.
> >> I missed this case, but my argument about the role of read/write lock.
> >> seems still consistent. :)
> >
> > It's more constrained. Normal read/write locks allow concurrent readers,
> > however fsync() must wait for previous IO to finish before starting
> > its own IO.
>
> Oh, yes, this is what I called "mixed roles". One for lock, one for
> status flag, twisted together in the same path, making the read lock
> semantics totally broken.
>
> >
> >> >
> >> >> that is the semantics of a writer waiting to get the lock while it's
> >> >> acquired by
> >> >> some reader:The caller(e.g. truncate_inode_pages_range()  and
> >> >> invalidate_inode_pages2_range()) are the writers waiting for
> >> >> writeback readers (as you clarified ) to finish their job, right ?
> >> >
> >> > Sorry if my metaphor confused you. But they are not typical
> >> > reader/writer problems, but more about data integrities.
> >>
> >> No, you didn't :)
> >> Actually, you make me clear about the mixed roles for
> >> those bits.
> >>
> >> >
> >> > Pages have to be "not under writeback" when truncated.
> >> > Otherwise data lost is possible:
> >> >
> >> > 1) create a file with one page (page A)
> >> > 2) truncate page A that is under writeback
> >> > 3) write to file, which creates page B
> >> > 4) sync file, which sends page B to disk quickly
> >> >
> >> > Now if page B reaches disk before A, the new data will be overwritten
> >> > by truncated old data, which corrupts the file.
> >>
> >> I fully understand this scenario which you had already clarified in a
> >> previous message. :)
> >>
> >> 1. someone make index1-> page A
> >> 2. Path P1 is acting as a *reader* to a cache page at index1 by
> >>     setting PG_writeback on, while at the same time as a *writer* to
> >>     the corresponding file blocks.
> >> 3. Another path P2 comes in and  truncate page A, he is the writer
> >>     to the same cache page.
> >> 4. Yet another path P3 comes  as the writer to the cache page
> >>      making it points to page B: index1--> page B.
> >> 5. Path P4 comes writing back the cache page(and set PG_writeback).
> >>    He is the reader of the cache page and the writer to the file blocks.
> >>
> >> The corrupts occur because P1 & P4 races when writing file blocks.
> >> But the _root_ of this racing is because nothing is used to serialize
> >> them on the side of writing the file blocks and above stream reading was
> >> inconsistent because of the writers(P2 & P3) to cache page at index1.
> >>
> >> Note that the "sync file" is somewhat irrelevant, even without "sync file",
> >> the racing still may exists. I know you must want to show me that this could
> >> make the corruption more easy to occur.
> >>
> >> So I think the simple logic is:
> >> 1) if you want to truncate/change the mapping from a index to a struct *page,
> >> test writeback bit because the writebacker to the file blocks is the reader
> >> of this mapping.
> >> 2) if a writebacker want to start a read of this mapping with
> >> test_set_page_writeback()
> >> or set_page_writeback(), he'd be sure this page is locked to keep out the
> >> writers to this mapping of index-->struct *page.
> >>
> >> This is really behavior of a read/write lock, right ?
> >
> > Please, that's a dangerous idea. A page can be written to at any time
> > when writeback to disk is under way. Does PG_writeback (your reader
> > lock) prevents page data writers?  NO.
>
> I meant PG_writeback stops writers to index---->struct page mapping.

It's protected by the radix tree RCU locks. Period.

If you are referring to the reverse mapping: page->mapping is procted
by PG_lock. No one should make assumption that it won't change under
page writeback.

Thanks,
Fengguang

> I think I should make my statements more concise and the "reader/writer"
> less vague.
>
> Here we care about the write/read operation for index---->struct page mapping.
> Not for read/write operation for the page content.
>
> Anyone who wants to change this mapping is a writer, he should take
> page lock.
> Anyone who wants to reference this mapping is a reader, writers should
> wait for him. And when this reader wants to get ref, he should wait for
> anyone one who is changing this mapping(e.g. page truncater).
>
> When a path sets PG_writeback on a page, it need this index-->struct page
> mapping be 100% valid right? (otherwise may leads to corruption.)
> So writeback routines are readers of this index-->struct page mapping.
> (oh, well if we can put the other role of PG_writeback aside)
>
> Ok,Ok, since PG_locked does mean much more than just protecting
> the per-page mapping which makes the lock abstraction even less clear.
> so indeed, forget about it.
>
> >
> > Thanks,
> > Fengguang
> >
> >> wait_on_page_writeback_range() looks different only because "sync"
> >> operates on "struct page", it's not sensitive to index-->struct *page mapping.
> >> It does care about if pages returned by pagevec_lookup_tag() are
> >> still maintains the mapping when wait_on_page_writeback(page).
> >> Here, PG_writeback is only a status flag for "struct page" not a lock bit for
> >> index->struct *page mapping.
> >>
> >> >
> >> >> So do you think the idea is sane to group the two bits together
> >> >> to form a real read/write lock, which does not care about the _number_
> >> >> of readers ?
> >> >
> >> > We don't care number of readers here. So please forget about it.
> >> Yeah, I meant number of readers is not important.
> >>
> >> I still hold that these two bits in some way act like a _sparse_
> >> read/write lock.
> >> But I am going to drop the idea of making them a pure lock, since PG_writeback
> >> does has other meaning -- the page is being writing back: for sync
> >> path, it's only
> >> a status flag.
> >> Making a pure read/write lock definitely will lose that or at least distort it.
> >>
> >>
> >> Hoping I've made my words understandable, correct me if wrong, and
> >> many thanks for your time and patience. :-)
> >>
> >>
> >> Nai Xia
> >>
> >> >
> >> > Thanks,
> >> > Fengguang
> >> >
> >> >> > The writeback bit is _widely_ used.  test_set_page_writeback() is
> >> >> > directly used by NFS/AFS etc. But its main user is in fact
> >> >> > set_page_writeback(), which is called in 26 places.
> >> >> >
> >> >> >> > think would be safest
> >> >> >>
> >> >> >> Okay. I'll just add it after the page lock.
> >> >> >>
> >> >> >> > (then you never have to bother with the writeback bit again)
> >> >> >>
> >> >> >> Until Fengguang does something fancy with it.
> >> >> >
> >> >> > Yes I'm going to do it without wait_on_page_writeback().
> >> >> >
> >> >> > The reason truncate_inode_pages_range() has to wait on writeback page
> >> >> > is to ensure data integrity. Otherwise if there comes two events:
> >> >> >        truncate page A at offset X
> >> >> >        populate page B at offset X
> >> >> > If A and B are all writeback pages, then B can hit disk first and then
> >> >> > be overwritten by A. Which corrupts the data at offset X from user's POV.
> >> >> >
> >> >> > But for hwpoison, there are no such worries. If A is poisoned, we do
> >> >> > our best to isolate it as well as intercepting its IO. If the interception
> >> >> > fails, it will trigger another machine check before hitting the disk.
> >> >> >
> >> >> > After all, poisoned A means the data at offset X is already corrupted.
> >> >> > It doesn't matter if there comes another B page.
> >> >> >
> >> >> > Thanks,
> >> >> > Fengguang
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> >> > the body of a message to [email protected]
> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> > Please read the FAQ at  http://www.tux.org/lkml/
> >> >> >
> >> >
> >

2009-06-09 10:49:21

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 09, 2009 at 02:48:55PM +0800, Wu Fengguang wrote:
> On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> > I meant PG_writeback stops writers to index---->struct page mapping.
>
> It's protected by the radix tree RCU locks. Period.
>
> If you are referring to the reverse mapping: page->mapping is procted
> by PG_lock. No one should make assumption that it won't change under
> page writeback.

Well... I think probably PG_writeback should be enough. Phrased another
way: I think it is a very bad idea to truncate PG_writeback pages out of
pagecache. Does anything actually do that?

2009-06-09 12:15:27

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 09, 2009 at 06:48:25PM +0800, Nick Piggin wrote:
> On Tue, Jun 09, 2009 at 02:48:55PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> > > I meant PG_writeback stops writers to index---->struct page mapping.
> >
> > It's protected by the radix tree RCU locks. Period.
> >
> > If you are referring to the reverse mapping: page->mapping is procted
> > by PG_lock. No one should make assumption that it won't change under
> > page writeback.
>
> Well... I think probably PG_writeback should be enough. Phrased another
> way: I think it is a very bad idea to truncate PG_writeback pages out of
> pagecache. Does anything actually do that?

There shall be no one. OK I will follow that convention..

But as I stated it is only safe do rely on the fact "no one truncates
PG_writeback pages" in end_writeback_io handlers. And I suspect if
there does exist such a handler, it could be trivially converted to
take the page lock.

Thanks,
Fengguang

2009-06-09 12:17:28

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 09, 2009 at 08:15:10PM +0800, Wu Fengguang wrote:
> On Tue, Jun 09, 2009 at 06:48:25PM +0800, Nick Piggin wrote:
> > On Tue, Jun 09, 2009 at 02:48:55PM +0800, Wu Fengguang wrote:
> > > On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> > > > I meant PG_writeback stops writers to index---->struct page mapping.
> > >
> > > It's protected by the radix tree RCU locks. Period.
> > >
> > > If you are referring to the reverse mapping: page->mapping is procted
> > > by PG_lock. No one should make assumption that it won't change under
> > > page writeback.
> >
> > Well... I think probably PG_writeback should be enough. Phrased another
> > way: I think it is a very bad idea to truncate PG_writeback pages out of
> > pagecache. Does anything actually do that?
>
> There shall be no one. OK I will follow that convention..
>
> But as I stated it is only safe do rely on the fact "no one truncates
> PG_writeback pages" in end_writeback_io handlers. And I suspect if
> there does exist such a handler, it could be trivially converted to
> take the page lock.

Well, the writeback submitter first sets writeback, then unlocks
the page. I don't think he wants a truncate coming in at that point.

2009-06-09 12:48:38

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 09, 2009 at 08:17:22PM +0800, Nick Piggin wrote:
> On Tue, Jun 09, 2009 at 08:15:10PM +0800, Wu Fengguang wrote:
> > On Tue, Jun 09, 2009 at 06:48:25PM +0800, Nick Piggin wrote:
> > > On Tue, Jun 09, 2009 at 02:48:55PM +0800, Wu Fengguang wrote:
> > > > On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> > > > > I meant PG_writeback stops writers to index---->struct page mapping.
> > > >
> > > > It's protected by the radix tree RCU locks. Period.
> > > >
> > > > If you are referring to the reverse mapping: page->mapping is procted
> > > > by PG_lock. No one should make assumption that it won't change under
> > > > page writeback.
> > >
> > > Well... I think probably PG_writeback should be enough. Phrased another
> > > way: I think it is a very bad idea to truncate PG_writeback pages out of
> > > pagecache. Does anything actually do that?
> >
> > There shall be no one. OK I will follow that convention..
> >
> > But as I stated it is only safe do rely on the fact "no one truncates
> > PG_writeback pages" in end_writeback_io handlers. And I suspect if
> > there does exist such a handler, it could be trivially converted to
> > take the page lock.
>
> Well, the writeback submitter first sets writeback, then unlocks
> the page. I don't think he wants a truncate coming in at that point.

OK. I think we've mostly agreed on the consequences of PG_writeback vs
truncation. I'll follow the least surprise principle and stop here, hehe.

Thanks,
Fengguang

2009-06-09 13:36:30

by Nai Xia

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 9, 2009 at 8:47 PM, Wu Fengguang<[email protected]> wrote:
> On Tue, Jun 09, 2009 at 08:17:22PM +0800, Nick Piggin wrote:
>> On Tue, Jun 09, 2009 at 08:15:10PM +0800, Wu Fengguang wrote:
>> > On Tue, Jun 09, 2009 at 06:48:25PM +0800, Nick Piggin wrote:
>> > > On Tue, Jun 09, 2009 at 02:48:55PM +0800, Wu Fengguang wrote:
>> > > > On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
>> > > > > I meant PG_writeback stops writers to index---->struct page mapping.
>> > > >
>> > > > It's protected by the radix tree RCU locks. Period.
>> > > >
>> > > > If you are referring to the reverse mapping: page->mapping is procted
>> > > > by PG_lock. No one should make assumption that it won't change under
>> > > > page writeback.
>> > >
>> > > Well... I think probably PG_writeback should be enough. Phrased another
>> > > way: I think it is a very bad idea to truncate PG_writeback pages out of
>> > > pagecache. Does anything actually do that?
>> >
>> > There shall be no one. OK I will follow that convention..
>> >
>> > But as I stated it is only safe do rely on the fact "no one truncates
>> > PG_writeback pages" in end_writeback_io handlers. And I suspect if
>> > there does exist such a handler, it could be trivially converted to
>> > take the page lock.
>>
>> Well, the writeback submitter first sets writeback, then unlocks
>> the page. I don't think he wants a truncate coming in at that point.
>
> OK. I think we've mostly agreed on the consequences of PG_writeback vs
> truncation. I'll follow the least surprise principle and stop here, hehe.

And thank you both for your time & patience, :-)

Best Regards,
Nai Xia

>
> Thanks,
> Fengguang
>