This patch series applies to today's linux-next/akpm, commit
651e600ff3ecc4ac06747313a1c49d89a9ce988e.
Ed L. Cashin (1):
aoe: adjust ref of head for compound page tails
drivers/block/aoe/aoecmd.c | 17 +++++++----------
1 files changed, 7 insertions(+), 10 deletions(-)
As discussed previously, the fact that some users of the block
layer provide bios that point to pages with a zero _count means
that it is not OK for the network layer to do a put_page on the
skb frags during an skb_linearize, so the aoe driver gets a
reference to pages in bios and puts the reference before ending
the bio. And because it cannot use get_page on a page with a
zero _count, it manipulates the value directly.
It is not OK to increment the _count of a compound page tail,
though, since the VM layer will VM_BUG_ON a non-zero _count.
Block users that do direct I/O can result in the aoe driver
seeing compound page tails in bios. In that case, the same logic
works as long as the head of the compound page is used instead of
the tails. This patch handles compound pages and does not BUG.
It relies on the block layer user leaving the relationship
between the page tail and its head alone for the duration between
the submission of the bio and its completion, whether successful
or not.
Signed-off-by: Ed Cashin <[email protected]>
---
drivers/block/aoe/aoecmd.c | 17 +++++++----------
1 files changed, 7 insertions(+), 10 deletions(-)
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index 99cb944..4d45dba 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -906,16 +906,10 @@ bio_pageinc(struct bio *bio)
int i;
bio_for_each_segment(bv, bio, i) {
- page = bv->bv_page;
/* Non-zero page count for non-head members of
- * compound pages is no longer allowed by the kernel,
- * but this has never been seen here.
+ * compound pages is no longer allowed by the kernel.
*/
- if (unlikely(PageCompound(page)))
- if (compound_trans_head(page) != page) {
- pr_crit("page tail used for block I/O\n");
- BUG();
- }
+ page = compound_trans_head(bv->bv_page);
atomic_inc(&page->_count);
}
}
@@ -924,10 +918,13 @@ static void
bio_pagedec(struct bio *bio)
{
struct bio_vec *bv;
+ struct page *page;
int i;
- bio_for_each_segment(bv, bio, i)
- atomic_dec(&bv->bv_page->_count);
+ bio_for_each_segment(bv, bio, i) {
+ page = compound_trans_head(bv->bv_page);
+ atomic_dec(&page->_count);
+ }
}
static void
--
1.7.1
On Thu, 1 Aug 2013 21:29:59 -0400 Ed Cashin <[email protected]> wrote:
> As discussed previously,
I think I missed that.
> the fact that some users of the block
> layer provide bios that point to pages with a zero _count means
> that it is not OK for the network layer to do a put_page on the
> skb frags during an skb_linearize, so the aoe driver gets a
> reference to pages in bios and puts the reference before ending
> the bio. And because it cannot use get_page on a page with a
> zero _count, it manipulates the value directly.
Eh? What code is putting count==0 pages into bios? That sounds very
weird and broken.
On Thu, 1 Aug 2013 21:29:59 -0400 Ed Cashin <[email protected]> wrote:
> As discussed previously, the fact that some users of the block
> layer provide bios that point to pages with a zero _count means
> that it is not OK for the network layer to do a put_page on the
> skb frags during an skb_linearize, so the aoe driver gets a
> reference to pages in bios and puts the reference before ending
> the bio. And because it cannot use get_page on a page with a
> zero _count, it manipulates the value directly.
>
> It is not OK to increment the _count of a compound page tail,
> though, since the VM layer will VM_BUG_ON a non-zero _count.
> Block users that do direct I/O can result in the aoe driver
> seeing compound page tails in bios. In that case, the same logic
> works as long as the head of the compound page is used instead of
> the tails. This patch handles compound pages and does not BUG.
>
> It relies on the block layer user leaving the relationship
> between the page tail and its head alone for the duration between
> the submission of the bio and its completion, whether successful
> or not.
What are the end-user-visible effects of the problem which is being
fixed? Please always include this info when fixing bugs so that others
can work out what kernel version(s) need the patch.
On Aug 7, 2013, at 4:58 PM, Andrew Morton wrote:
> On Thu, 1 Aug 2013 21:29:59 -0400 Ed Cashin <[email protected]> wrote:
>
>> As discussed previously,
>
> I think I missed that.
>
>> the fact that some users of the block
>> layer provide bios that point to pages with a zero _count means
>> that it is not OK for the network layer to do a put_page on the
>> skb frags during an skb_linearize, so the aoe driver gets a
>> reference to pages in bios and puts the reference before ending
>> the bio. And because it cannot use get_page on a page with a
>> zero _count, it manipulates the value directly.
>
> Eh? What code is putting count==0 pages into bios? That sounds very
> weird and broken.
I thought so in 2007 but couldn't solicit a clear "this is wrong" consensus from the discussion.
http://article.gmane.org/gmane.linux.kernel/499197
https://lkml.org/lkml/2007/1/19/56
https://lkml.org/lkml/2006/12/18/230
We were seeing zero-count pages in bios from XFS, but Christoph Hellwig pointed out that kmalloced pages can also come from ext3 when it's doing log recovery, and they'll have zero page counts.
I don't have a counter in the driver to track how often zero-count pages show up in bios, so I don't know how common it is these days. I would have been motivated to look if I had gotten the impression that there was any interest in identifying and eliminating cases where zero-count pages were being used in bios.
--
Ed Cashin
[email protected]
On Wed, 7 Aug 2013 17:12:36 -0400 Ed Cashin <[email protected]> wrote:
>
> On Aug 7, 2013, at 4:58 PM, Andrew Morton wrote:
>
> > On Thu, 1 Aug 2013 21:29:59 -0400 Ed Cashin <[email protected]> wrote:
> >
> >> As discussed previously,
> >
> > I think I missed that.
> >
> >> the fact that some users of the block
> >> layer provide bios that point to pages with a zero _count means
> >> that it is not OK for the network layer to do a put_page on the
> >> skb frags during an skb_linearize, so the aoe driver gets a
> >> reference to pages in bios and puts the reference before ending
> >> the bio. And because it cannot use get_page on a page with a
> >> zero _count, it manipulates the value directly.
> >
> > Eh? What code is putting count==0 pages into bios? That sounds very
> > weird and broken.
>
> I thought so in 2007 but couldn't solicit a clear "this is wrong" consensus from the discussion.
>
> http://article.gmane.org/gmane.linux.kernel/499197
> https://lkml.org/lkml/2007/1/19/56
> https://lkml.org/lkml/2006/12/18/230
>
> We were seeing zero-count pages in bios from XFS, but Christoph Hellwig pointed out that kmalloced pages can also come from ext3 when it's doing log recovery, and they'll have zero page counts.
aiiee!
It is (I suppose) reasonable to put kmalloced memory into a BIO's page
array. And it is perfectly reasonable for a user of that bio to do a
get_page/put_page against that page. It is utterly unreasonable for
the damn page to get freed as a result!
I'd claim that slab is broken. The page is in use, so it should have an
elevated refcount, full stop.
On Aug 7, 2013, at 5:00 PM, Andrew Morton wrote:
> On Thu, 1 Aug 2013 21:29:59 -0400 Ed Cashin <[email protected]> wrote:
>
>> As discussed previously, the fact that some users of the block
>> layer provide bios that point to pages with a zero _count means
>> that it is not OK for the network layer to do a put_page on the
>> skb frags during an skb_linearize, so the aoe driver gets a
>> reference to pages in bios and puts the reference before ending
>> the bio. And because it cannot use get_page on a page with a
>> zero _count, it manipulates the value directly.
>>
>> It is not OK to increment the _count of a compound page tail,
>> though, since the VM layer will VM_BUG_ON a non-zero _count.
>> Block users that do direct I/O can result in the aoe driver
>> seeing compound page tails in bios. In that case, the same logic
>> works as long as the head of the compound page is used instead of
>> the tails. This patch handles compound pages and does not BUG.
>>
>> It relies on the block layer user leaving the relationship
>> between the page tail and its head alone for the duration between
>> the submission of the bio and its completion, whether successful
>> or not.
>
> What are the end-user-visible effects of the problem which is being
> fixed? Please always include this info when fixing bugs so that others
> can work out what kernel version(s) need the patch.
Sorry. I tried to cover that but maybe should have separated it more clearly from the running text.
The patch changes the current behavior encountered with direct I/O and aoe:
aoe can BUG when direct I/O used
... to a new behavior:
aoe does not BUG when direct I/O is used
And the user will never know that instead aoe is manipulating the _count of the head page, even though that's what's happening with this patch applied.
(The user-visible effects that motivated the older patch that added the BUG to aoe when page tails are used for block I/O was a VM_BUG_ON that occurred when the aoe driver manipulated the _count of a compound page tail.)
--
Ed Cashin
[email protected]
On Wed, 7 Aug 2013 17:21:32 -0400 Ed Cashin <[email protected]> wrote:
> > What are the end-user-visible effects of the problem which is being
> > fixed? Please always include this info when fixing bugs so that others
> > can work out what kernel version(s) need the patch.
>
>
> Sorry. I tried to cover that but maybe should have separated it more clearly from the running text.
>
> The patch changes the current behavior encountered with direct I/O and aoe:
>
> aoe can BUG when direct I/O used
>
> ... to a new behavior:
>
> aoe does not BUG when direct I/O is used
>
> And the user will never know that instead aoe is manipulating the _count of the head page, even though that's what's happening with this patch applied.
>
> (The user-visible effects that motivated the older patch that added the BUG to aoe when page tails are used for block I/O was a VM_BUG_ON that occurred when the aoe driver manipulated the _count of a compound page tail.)
OK, I added "Fix a BUG which can trigger when direct-IO is used with
AOE". Which kernel versions are affected?
On Wed, 7 Aug 2013 14:18:35 -0700 Andrew Morton <[email protected]> wrote:
> On Wed, 7 Aug 2013 17:12:36 -0400 Ed Cashin <[email protected]> wrote:
>
> >
> > On Aug 7, 2013, at 4:58 PM, Andrew Morton wrote:
> >
> > > On Thu, 1 Aug 2013 21:29:59 -0400 Ed Cashin <[email protected]> wrote:
> > >
> > >> As discussed previously,
> > >
> > > I think I missed that.
> > >
> > >> the fact that some users of the block
> > >> layer provide bios that point to pages with a zero _count means
> > >> that it is not OK for the network layer to do a put_page on the
> > >> skb frags during an skb_linearize, so the aoe driver gets a
> > >> reference to pages in bios and puts the reference before ending
> > >> the bio. And because it cannot use get_page on a page with a
> > >> zero _count, it manipulates the value directly.
> > >
> > > Eh? What code is putting count==0 pages into bios? That sounds very
> > > weird and broken.
> >
> > I thought so in 2007 but couldn't solicit a clear "this is wrong" consensus from the discussion.
> >
> > http://article.gmane.org/gmane.linux.kernel/499197
> > https://lkml.org/lkml/2007/1/19/56
> > https://lkml.org/lkml/2006/12/18/230
> >
> > We were seeing zero-count pages in bios from XFS, but Christoph Hellwig pointed out that kmalloced pages can also come from ext3 when it's doing log recovery, and they'll have zero page counts.
>
> aiiee!
>
> It is (I suppose) reasonable to put kmalloced memory into a BIO's page
> array. And it is perfectly reasonable for a user of that bio to do a
> get_page/put_page against that page. It is utterly unreasonable for
> the damn page to get freed as a result!
>
> I'd claim that slab is broken. The page is in use, so it should have an
> elevated refcount, full stop.
>
err, no. slab.c uses alloc_pages(), so the underlying page indeed has
a proper refcount. I'm still not understanding how this situation comes
about.
On Aug 7, 2013, at 5:26 PM, Andrew Morton wrote:
> On Wed, 7 Aug 2013 17:21:32 -0400 Ed Cashin <[email protected]> wrote:
...
>> Sorry. I tried to cover that but maybe should have separated it more clearly from the running text.
>>
>> The patch changes the current behavior encountered with direct I/O and aoe:
>>
>> aoe can BUG when direct I/O used
>>
>> ... to a new behavior:
>>
>> aoe does not BUG when direct I/O is used
>>
>> And the user will never know that instead aoe is manipulating the _count of the head page, even though that's what's happening with this patch applied.
>>
>> (The user-visible effects that motivated the older patch that added the BUG to aoe when page tails are used for block I/O was a VM_BUG_ON that occurred when the aoe driver manipulated the _count of a compound page tail.)
>
> OK, I added "Fix a BUG which can trigger when direct-IO is used with
> AOE". Which kernel versions are affected?
That BUG was added to the aoe driver in a commit that first appears in v3.7. It persists until this patch adds the code that handles the situation that the BUG was intended to detect.
[ecashin@marino linux]$ git describe --contains --match 'v*' 69cf2d85de773d998798e47e3335b85e5645d157
v3.7-rc1~110^2~22
[ecashin@marino linux]$
--
Ed Cashin
[email protected]
On Wed, 7 Aug 2013 19:27:40 -0400 Ed Cashin <[email protected]> wrote:
> On Aug 7, 2013, at 5:26 PM, Andrew Morton wrote:
>
> > On Wed, 7 Aug 2013 17:21:32 -0400 Ed Cashin <[email protected]> wrote:
> ...
> >> Sorry. I tried to cover that but maybe should have separated it more clearly from the running text.
> >>
> >> The patch changes the current behavior encountered with direct I/O and aoe:
> >>
> >> aoe can BUG when direct I/O used
> >>
> >> ... to a new behavior:
> >>
> >> aoe does not BUG when direct I/O is used
> >>
> >> And the user will never know that instead aoe is manipulating the _count of the head page, even though that's what's happening with this patch applied.
> >>
> >> (The user-visible effects that motivated the older patch that added the BUG to aoe when page tails are used for block I/O was a VM_BUG_ON that occurred when the aoe driver manipulated the _count of a compound page tail.)
> >
> > OK, I added "Fix a BUG which can trigger when direct-IO is used with
> > AOE". Which kernel versions are affected?
>
>
> That BUG was added to the aoe driver in a commit that first appears in v3.7. It persists until this patch adds the code that handles the situation that the BUG was intended to detect.
>
> [ecashin@marino linux]$ git describe --contains --match 'v*' 69cf2d85de773d998798e47e3335b85e5645d157
> v3.7-rc1~110^2~22
> [ecashin@marino linux]$
>
ah, is it this?
+bio_pageinc(struct bio *bio)
+{
+ struct bio_vec *bv;
+ struct page *page;
+ int i;
+
+ bio_for_each_segment(bv, bio, i) {
+ page = bv->bv_page;
+ /* Non-zero page count for non-head members of
+ * compound pages is no longer allowed by the kernel,
+ * but this has never been seen here.
+ */
+ if (unlikely(PageCompound(page)))
+ if (compound_trans_head(page) != page) {
+ pr_crit("page tail used for block I/O\n");
+ BUG();
+ }
But get_page/put_page against a tail page of a compound page should
Just Work. The core MM will hunt down the head page and manipulate its
refcount.
Perhaps the problem here is that AOE is diving under the covers and is
using low-level page refcount alterations:
+ atomic_inc(&page->_count);
Why does AOE do this? It would be better if it were to use the
official published MM interfaces. If those interfaces need
alteration/augmentation then we can do that.
On Aug 7, 2013, at 5:27 PM, Andrew Morton wrote:
> On Wed, 7 Aug 2013 14:18:35 -0700 Andrew Morton <[email protected]> wrote:
>
>> On Wed, 7 Aug 2013 17:12:36 -0400 Ed Cashin <[email protected]> wrote:
>>
>>>
>>> On Aug 7, 2013, at 4:58 PM, Andrew Morton wrote:
>>>
>>>> On Thu, 1 Aug 2013 21:29:59 -0400 Ed Cashin <[email protected]> wrote:
>>>>
>>>>> As discussed previously,
>>>>
>>>> I think I missed that.
>>>>
>>>>> the fact that some users of the block
>>>>> layer provide bios that point to pages with a zero _count means
>>>>> that it is not OK for the network layer to do a put_page on the
>>>>> skb frags during an skb_linearize, so the aoe driver gets a
>>>>> reference to pages in bios and puts the reference before ending
>>>>> the bio. And because it cannot use get_page on a page with a
>>>>> zero _count, it manipulates the value directly.
>>>>
>>>> Eh? What code is putting count==0 pages into bios? That sounds very
>>>> weird and broken.
>>>
>>> I thought so in 2007 but couldn't solicit a clear "this is wrong" consensus from the discussion.
>>>
>>> http://article.gmane.org/gmane.linux.kernel/499197
>>> https://lkml.org/lkml/2007/1/19/56
>>> https://lkml.org/lkml/2006/12/18/230
>>>
>>> We were seeing zero-count pages in bios from XFS, but Christoph Hellwig pointed out that kmalloced pages can also come from ext3 when it's doing log recovery, and they'll have zero page counts.
>>
>> aiiee!
>>
>> It is (I suppose) reasonable to put kmalloced memory into a BIO's page
>> array. And it is perfectly reasonable for a user of that bio to do a
>> get_page/put_page against that page. It is utterly unreasonable for
>> the damn page to get freed as a result!
>>
>> I'd claim that slab is broken. The page is in use, so it should have an
>> elevated refcount, full stop.
>>
>
> err, no. slab.c uses alloc_pages(), so the underlying page indeed has
> a proper refcount. I'm still not understanding how this situation comes
> about.
It sounds like it's wrong to give block pages with a zero count, so why not just have aoe BUG_ON(compound_trans_head(bv->page->_count) == 0) until we're sure nobody does that anymore?
If that idea makes sense to you, I will submit a new patch to follow the one under discussion.
--
Ed Cashin
[email protected]
On Aug 7, 2013, at 7:41 PM, Ed Cashin wrote:
> It sounds like it's wrong to give block pages with a zero count, so why not just have aoe BUG_ON(compound_trans_head(bv->page->_count) == 0) until we're sure nobody does that anymore?
Ugh. I goofed the parens and such. I meant,
BUG_ON(compound_trans_head(bv->bv_page)->_count == 0)
... will catch the cases we think should not be occurring.
> If that idea makes sense to you, I will submit a new patch to follow the one under discussion.
--
Ed Cashin
[email protected]
On Wed, 7 Aug 2013 19:41:48 -0400 Ed Cashin <[email protected]> wrote:
> On Aug 7, 2013, at 5:27 PM, Andrew Morton wrote:
>
> >> elevated refcount, full stop.
> >>
> >
> > err, no. slab.c uses alloc_pages(), so the underlying page indeed has
> > a proper refcount. I'm still not understanding how this situation comes
> > about.
>
> It sounds like it's wrong to give block pages with a zero count,
Depends on your definition of "page". It should be OK to put a
_count==0 tail page into a BIO, because the MM knows that it's a tail
page and that its refcount actually lives in the head page.
> so why not just have aoe BUG_ON(compound_trans_head(bv->page->_count) == 0) until we're sure nobody does that anymore?
AOE shouldn't be touching ->_count at all. That's why it has the
leading underscore. If AOE can stick with the usual interfaces such as
page_count(), everything should work?
On Aug 7, 2013, at 7:35 PM, Andrew Morton wrote:
> On Wed, 7 Aug 2013 19:27:40 -0400 Ed Cashin <[email protected]> wrote:
>
>> On Aug 7, 2013, at 5:26 PM, Andrew Morton wrote:
>>
>>> On Wed, 7 Aug 2013 17:21:32 -0400 Ed Cashin <[email protected]> wrote:
>> ...
>>>> Sorry. I tried to cover that but maybe should have separated it more clearly from the running text.
>>>>
>>>> The patch changes the current behavior encountered with direct I/O and aoe:
>>>>
>>>> aoe can BUG when direct I/O used
>>>>
>>>> ... to a new behavior:
>>>>
>>>> aoe does not BUG when direct I/O is used
>>>>
>>>> And the user will never know that instead aoe is manipulating the _count of the head page, even though that's what's happening with this patch applied.
>>>>
>>>> (The user-visible effects that motivated the older patch that added the BUG to aoe when page tails are used for block I/O was a VM_BUG_ON that occurred when the aoe driver manipulated the _count of a compound page tail.)
>>>
>>> OK, I added "Fix a BUG which can trigger when direct-IO is used with
>>> AOE". Which kernel versions are affected?
>>
>>
>> That BUG was added to the aoe driver in a commit that first appears in v3.7. It persists until this patch adds the code that handles the situation that the BUG was intended to detect.
>>
>> [ecashin@marino linux]$ git describe --contains --match 'v*' 69cf2d85de773d998798e47e3335b85e5645d157
>> v3.7-rc1~110^2~22
>> [ecashin@marino linux]$
>>
>
> ah, is it this?
>
> +bio_pageinc(struct bio *bio)
> +{
> + struct bio_vec *bv;
> + struct page *page;
> + int i;
> +
> + bio_for_each_segment(bv, bio, i) {
> + page = bv->bv_page;
> + /* Non-zero page count for non-head members of
> + * compound pages is no longer allowed by the kernel,
> + * but this has never been seen here.
> + */
> + if (unlikely(PageCompound(page)))
> + if (compound_trans_head(page) != page) {
> + pr_crit("page tail used for block I/O\n");
> + BUG();
> + }
>
> But get_page/put_page against a tail page of a compound page should
> Just Work. The core MM will hunt down the head page and manipulate its
> refcount.
>
> Perhaps the problem here is that AOE is diving under the covers and is
> using low-level page refcount alterations:
>
> + atomic_inc(&page->_count);
>
> Why does AOE do this? It would be better if it were to use the
> official published MM interfaces. If those interfaces need
> alteration/augmentation then we can do that.
The get_page cannot be used when the _count is 0 (which is exactly when we need to increment it), because it will VM_BUG_ON.
/*
* Getting a normal page or the head of a compound page
* requires to already have an elevated page->_count.
*/
VM_BUG_ON(atomic_read(&page->_count) <= 0);
(And seeing the code reminds me that I forgot to include atomic_read in my last email's proposal. LKML + dinner isn't a good mix. ;)
--
Ed Cashin
[email protected]
On Wed, 7 Aug 2013 19:54:45 -0400 Ed Cashin <[email protected]> wrote:
> > +bio_pageinc(struct bio *bio)
> > +{
> > + struct bio_vec *bv;
> > + struct page *page;
> > + int i;
> > +
> > + bio_for_each_segment(bv, bio, i) {
> > + page = bv->bv_page;
> > + /* Non-zero page count for non-head members of
> > + * compound pages is no longer allowed by the kernel,
> > + * but this has never been seen here.
> > + */
> > + if (unlikely(PageCompound(page)))
> > + if (compound_trans_head(page) != page) {
> > + pr_crit("page tail used for block I/O\n");
> > + BUG();
> > + }
> >
> > But get_page/put_page against a tail page of a compound page should
> > Just Work. The core MM will hunt down the head page and manipulate its
> > refcount.
> >
> > Perhaps the problem here is that AOE is diving under the covers and is
> > using low-level page refcount alterations:
> >
> > + atomic_inc(&page->_count);
> >
> > Why does AOE do this? It would be better if it were to use the
> > official published MM interfaces. If those interfaces need
> > alteration/augmentation then we can do that.
>
> The get_page cannot be used when the _count is 0 (which is exactly when we need to increment it), because it will VM_BUG_ON.
>
> /*
> * Getting a normal page or the head of a compound page
> * requires to already have an elevated page->_count.
> */
> VM_BUG_ON(atomic_read(&page->_count) <= 0);
>
But we shouldn't get that far:
static inline void get_page(struct page *page)
{
if (unlikely(PageTail(page)))
if (likely(__get_page_tail(page)))
return;
/*
* Getting a normal page or the head of a compound page
* requires to already have an elevated page->_count.
*/
VM_BUG_ON(atomic_read(&page->_count) <= 0);
atomic_inc(&page->_count);
}
This is a tail page, so we should be using __get_page_tail().
On Aug 7, 2013, at 8:05 PM, Andrew Morton wrote:
> On Wed, 7 Aug 2013 19:54:45 -0400 Ed Cashin <[email protected]> wrote:
>
>>> +bio_pageinc(struct bio *bio)
>>> +{
>>> + struct bio_vec *bv;
>>> + struct page *page;
>>> + int i;
>>> +
>>> + bio_for_each_segment(bv, bio, i) {
>>> + page = bv->bv_page;
>>> + /* Non-zero page count for non-head members of
>>> + * compound pages is no longer allowed by the kernel,
>>> + * but this has never been seen here.
>>> + */
>>> + if (unlikely(PageCompound(page)))
>>> + if (compound_trans_head(page) != page) {
>>> + pr_crit("page tail used for block I/O\n");
>>> + BUG();
>>> + }
>>>
>>> But get_page/put_page against a tail page of a compound page should
>>> Just Work. The core MM will hunt down the head page and manipulate its
>>> refcount.
>>>
>>> Perhaps the problem here is that AOE is diving under the covers and is
>>> using low-level page refcount alterations:
>>>
>>> + atomic_inc(&page->_count);
>>>
>>> Why does AOE do this? It would be better if it were to use the
>>> official published MM interfaces. If those interfaces need
>>> alteration/augmentation then we can do that.
>>
>> The get_page cannot be used when the _count is 0 (which is exactly when we need to increment it), because it will VM_BUG_ON.
>>
>> /*
>> * Getting a normal page or the head of a compound page
>> * requires to already have an elevated page->_count.
>> */
>> VM_BUG_ON(atomic_read(&page->_count) <= 0);
>>
>
> But we shouldn't get that far:
>
> static inline void get_page(struct page *page)
> {
> if (unlikely(PageTail(page)))
> if (likely(__get_page_tail(page)))
> return;
> /*
> * Getting a normal page or the head of a compound page
> * requires to already have an elevated page->_count.
> */
> VM_BUG_ON(atomic_read(&page->_count) <= 0);
> atomic_inc(&page->_count);
> }
>
> This is a tail page, so we should be using __get_page_tail().
When the workaround was created, it was with the assumption that the zero-count pages are not always tail pages, and that seemed to be the case in 2007, but as I said, I don't have a mechanism for detecting that now, so I cannot say whether it really happens with today's kernel.
If it is never correct for normal pages or compound page heads to have a zero count while they are associated with a bio, then yes, I think get_page is a great solution. The VM_BUG_ON in get_page would identify any parts of the kernel that are supplying bios that have pages without references.
Just a note in response to "we shouldn't get that far": I believe the VM_BUG_ON line in get_page does get executed for tail pages when the __get_page_tail returns false because the compound page head has a zero count:
get_page -> __get_page_tail -> get_page_unless_zero returns false, so
__get_page_tail returns "got", which is still false, so
get_page executes the VM_BUG_ON, where the _count will be zero for the tail page.
--
Ed Cashin
[email protected]
On Wed, 7 Aug 2013 20:50:09 -0400 Ed Cashin <[email protected]> wrote:
> > But we shouldn't get that far:
> >
> > static inline void get_page(struct page *page)
> > {
> > if (unlikely(PageTail(page)))
> > if (likely(__get_page_tail(page)))
> > return;
> > /*
> > * Getting a normal page or the head of a compound page
> > * requires to already have an elevated page->_count.
> > */
> > VM_BUG_ON(atomic_read(&page->_count) <= 0);
> > atomic_inc(&page->_count);
> > }
> >
> > This is a tail page, so we should be using __get_page_tail().
>
>
> When the workaround was created, it was with the assumption that the zero-count pages are not always tail pages, and that seemed to be the case in 2007, but as I said, I don't have a mechanism for detecting that now, so I cannot say whether it really happens with today's kernel.
It sounds we should pull out all that code and retest. It shouldn't be
needed - if this results in some failure then I suspect core MM will
need changes.
Why don't you have a "mechanism for detecting that"? It's a matter of
pointing AOE at some hugetlb pages?
> If it is never correct for normal pages or compound page heads to have a zero count while they are associated with a bio, then yes, I think get_page is a great solution. The VM_BUG_ON in get_page would identify any parts of the kernel that are supplying bios that have pages without references.
>
> Just a note in response to "we shouldn't get that far": I believe the VM_BUG_ON line in get_page does get executed for tail pages when the __get_page_tail returns false because the compound page head has a zero count:
>
> get_page -> __get_page_tail -> get_page_unless_zero returns false, so
> __get_page_tail returns "got", which is still false, so
> get_page executes the VM_BUG_ON, where the _count will be zero for the tail page.
Yup. If we hit a zero-ref huge page here then we want that BUG_ON to trigger.
On Aug 8, 2013, at 3:56 PM, Andrew Morton wrote:
> On Wed, 7 Aug 2013 20:50:09 -0400 Ed Cashin <[email protected]> wrote:
...
>> When the workaround was created, it was with the assumption that the zero-count pages are not always tail pages, and that seemed to be the case in 2007, but as I said, I don't have a mechanism for detecting that now, so I cannot say whether it really happens with today's kernel.
>
> It sounds we should pull out all that code and retest. It shouldn't be
> needed - if this results in some failure then I suspect core MM will
> need changes.
OK. I'll look into that. It sure would be nice to get rid of it.
> Why don't you have a "mechanism for detecting that"? It's a matter of
> pointing AOE at some hugetlb pages?
No, I was testing with hugetlb pages already. But there are many use case combinations, and I meant that aoe has no mechanism that has been in the aoe code all these years to catch specific (possibly rare) use cases that result in zero-count pages. It would be something like this:
+static void
+check_page_counts(struct bio *bio)
+{
+ struct bio_vec *bv;
+ int n;
+ int i;
+
+ bio_for_each_segment(bv, bio, i) {
+ n = page_count(bv->bv_page);
+ WARN_ONCE(n <= 0,
+ "aoe: page %d in bio has non-positive count %d\n",
+ i, n);
+ }
+}
I'll add that when I try testing without the page count manipulation.
--
Ed Cashin
[email protected]