2013-08-21 13:48:26

by David Vrabel

[permalink] [raw]
Subject: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

All,

179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
and _PTE_PAT.

With a Xen PV guest, the use of the _PTE_PAT will result in the page
having unexpected cachability which will introduce a range of subtle
performance and correctness issues. Xen programs the entry 4 in the PAT
table with WC so a page that was previously WB will end up as WC.

The use of this bit also appears to preclude the use of (transparent)
huge pages by the application. It is not clear if there is something
else guaranteeing that that there will be no huge pages.

To fix this regression I suggest one or more of:

1. If no other changes are made, at a mimimum, MEM_SOFT_DIRTY must
require !XEN and possibly !TRANSPARENT_HUGEPAGE and !HUGETLBFS. This
would prevent this option being enabled on the majority of standard
Linux distributions.

2. Find a different PTE bit to (re)use.

3. Avoid clearing the soft dirty bit when repopulating a swapped out page.

4. Redesign the soft dirty tracking to not require the use of
architecture specific PTE bits. e.g., by using a shadow set of
structures for the soft dirty bit tracking.

David


2013-08-21 13:53:57

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use


On 8/21/2013 9:48 AM, David Vrabel wrote:
> All,
>
> 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
> PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
> and _PTE_PAT.
>
> With a Xen PV guest, the use of the _PTE_PAT will result in the page
> having unexpected cachability which will introduce a range of subtle
> performance and correctness issues. Xen programs the entry 4 in the PAT
> table with WC so a page that was previously WB will end up as WC.
Especially with filesystems which would end up using those pages (as the
memory allocator
would recycle them) and with corruption in the filesystem. Took months
to figure
that out.

>
> The use of this bit also appears to preclude the use of (transparent)
> huge pages by the application. It is not clear if there is something
> else guaranteeing that that there will be no huge pages.
>
> To fix this regression I suggest one or more of:
>
> 1. If no other changes are made, at a mimimum, MEM_SOFT_DIRTY must
> require !XEN and possibly !TRANSPARENT_HUGEPAGE and !HUGETLBFS. This
> would prevent this option being enabled on the majority of standard
> Linux distributions.
>
> 2. Find a different PTE bit to (re)use.
>
> 3. Avoid clearing the soft dirty bit when repopulating a swapped out page.
>
> 4. Redesign the soft dirty tracking to not require the use of
> architecture specific PTE bits. e.g., by using a shadow set of
> structures for the soft dirty bit tracking.

Or revert this patch and in 3.12 fix it using one of the options above
or other ones.
>
> David

2013-08-21 14:12:05

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

Eep. This should be reverted, indeed. This isn't a manifest bug on !Xen but we have gotten requests for WT support which would mean adding in the PAT but again.

David Vrabel <[email protected]> wrote:
>All,
>
>179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
>PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
>and _PTE_PAT.
>
>With a Xen PV guest, the use of the _PTE_PAT will result in the page
>having unexpected cachability which will introduce a range of subtle
>performance and correctness issues. Xen programs the entry 4 in the
>PAT
>table with WC so a page that was previously WB will end up as WC.
>
>The use of this bit also appears to preclude the use of (transparent)
>huge pages by the application. It is not clear if there is something
>else guaranteeing that that there will be no huge pages.
>
>To fix this regression I suggest one or more of:
>
>1. If no other changes are made, at a mimimum, MEM_SOFT_DIRTY must
>require !XEN and possibly !TRANSPARENT_HUGEPAGE and !HUGETLBFS. This
>would prevent this option being enabled on the majority of standard
>Linux distributions.
>
>2. Find a different PTE bit to (re)use.
>
>3. Avoid clearing the soft dirty bit when repopulating a swapped out
>page.
>
>4. Redesign the soft dirty tracking to not require the use of
>architecture specific PTE bits. e.g., by using a shadow set of
>structures for the soft dirty bit tracking.
>
>David

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2013-08-21 14:12:28

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 02:48:20PM +0100, David Vrabel wrote:
> All,
>
> 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
> PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
> and _PTE_PAT.
>
> With a Xen PV guest, the use of the _PTE_PAT will result in the page
> having unexpected cachability which will introduce a range of subtle
> performance and correctness issues. Xen programs the entry 4 in the PAT
> table with WC so a page that was previously WB will end up as WC.
>

David, could you please explain, Xen keeps and analyze _PTE_PAT bit
for ptes which are not present?

2013-08-21 14:19:30

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 04:11:25PM +0200, H. Peter Anvin wrote:
>
> Eep. This should be reverted, indeed. This isn't a manifest bug on !Xen but we
> have gotten requests for WT support which would mean adding in the PAT but again.

Please no, letme fix it. That's what I'm having in mind: don't use pse bit in swap
entry but always mark page read from swap as dirty one (it's better than having
no tracker at all and will be the case on old machines with 32bit ptes only).

2013-08-21 14:22:57

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

OK now I'm confused. I guess I shouldn't comment while on vacation and cache cold on everything.

Cyrill Gorcunov <[email protected]> wrote:
>On Wed, Aug 21, 2013 at 04:11:25PM +0200, H. Peter Anvin wrote:
>>
>> Eep. This should be reverted, indeed. This isn't a manifest bug on
>!Xen but we
>> have gotten requests for WT support which would mean adding in the
>PAT but again.
>
>Please no, letme fix it. That's what I'm having in mind: don't use pse
>bit in swap
>entry but always mark page read from swap as dirty one (it's better
>than having
>no tracker at all and will be the case on old machines with 32bit ptes
>only).

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2013-08-21 14:23:49

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

Good question...

Cyrill Gorcunov <[email protected]> wrote:
>On Wed, Aug 21, 2013 at 02:48:20PM +0100, David Vrabel wrote:
>> All,
>>
>> 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
>> PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as
>_PTE_PSE
>> and _PTE_PAT.
>>
>> With a Xen PV guest, the use of the _PTE_PAT will result in the page
>> having unexpected cachability which will introduce a range of subtle
>> performance and correctness issues. Xen programs the entry 4 in the
>PAT
>> table with WC so a page that was previously WB will end up as WC.
>>
>
>David, could you please explain, Xen keeps and analyze _PTE_PAT bit
>for ptes which are not present?

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2013-08-21 14:29:59

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 04:22:21PM +0200, H. Peter Anvin wrote:
> OK now I'm confused. I guess I shouldn't comment while on vacation
> and cache cold on everything.

I rather think I'm missing something, that's why I asked David how
this featue affects non present pte.

2013-08-21 14:53:43

by Jan Beulich

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

>>> On 21.08.13 at 16:12, Cyrill Gorcunov <[email protected]> wrote:
> On Wed, Aug 21, 2013 at 02:48:20PM +0100, David Vrabel wrote:
>> All,
>>
>> 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
>> PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
>> and _PTE_PAT.
>>
>> With a Xen PV guest, the use of the _PTE_PAT will result in the page
>> having unexpected cachability which will introduce a range of subtle
>> performance and correctness issues. Xen programs the entry 4 in the PAT
>> table with WC so a page that was previously WB will end up as WC.
>>
>
> David, could you please explain, Xen keeps and analyze _PTE_PAT bit
> for ptes which are not present?

No, the problem isn't with not-present PTEs (i.e. swap entries),
but with present ones - the same bit (7) is being used for both,
according to this comment:

/*
* Tracking soft dirty bit when a page goes to a swap is tricky.
* We need a bit which can be stored in pte _and_ not conflict
* with swap entry format. On x86 bits 6 and 7 are *not* involved
* into swap entry computation, but bit 6 is used for nonlinear
* file mapping, so we borrow bit 7 for soft dirty tracking.
*/

Or are you telling me that the comment is misleading (at least me),
and this applies only to not-present PTEs? And even then - where
would the value of the original PAT bit be stored while swapped
out (or is it impossible - now and forever - for WC pages to get
swapped)?

Jan

2013-08-21 14:59:11

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

Only WB pages should be swappable, but even so, the cacheability should be in the vma.

Jan Beulich <[email protected]> wrote:
>>>> On 21.08.13 at 16:12, Cyrill Gorcunov <[email protected]> wrote:
>> On Wed, Aug 21, 2013 at 02:48:20PM +0100, David Vrabel wrote:
>>> All,
>>>
>>> 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a
>new
>>> PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as
>_PTE_PSE
>>> and _PTE_PAT.
>>>
>>> With a Xen PV guest, the use of the _PTE_PAT will result in the page
>>> having unexpected cachability which will introduce a range of subtle
>>> performance and correctness issues. Xen programs the entry 4 in the
>PAT
>>> table with WC so a page that was previously WB will end up as WC.
>>>
>>
>> David, could you please explain, Xen keeps and analyze _PTE_PAT bit
>> for ptes which are not present?
>
>No, the problem isn't with not-present PTEs (i.e. swap entries),
>but with present ones - the same bit (7) is being used for both,
>according to this comment:
>
>/*
> * Tracking soft dirty bit when a page goes to a swap is tricky.
> * We need a bit which can be stored in pte _and_ not conflict
> * with swap entry format. On x86 bits 6 and 7 are *not* involved
> * into swap entry computation, but bit 6 is used for nonlinear
> * file mapping, so we borrow bit 7 for soft dirty tracking.
> */
>
>Or are you telling me that the comment is misleading (at least me),
>and this applies only to not-present PTEs? And even then - where
>would the value of the original PAT bit be stored while swapped
>out (or is it impossible - now and forever - for WC pages to get
>swapped)?
>
>Jan

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2013-08-21 15:42:42

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 03:53:36PM +0100, Jan Beulich wrote:
> >>> On 21.08.13 at 16:12, Cyrill Gorcunov <[email protected]> wrote:
> > On Wed, Aug 21, 2013 at 02:48:20PM +0100, David Vrabel wrote:
> >> All,
> >>
> >> 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
> >> PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
> >> and _PTE_PAT.
> >>
> >> With a Xen PV guest, the use of the _PTE_PAT will result in the page
> >> having unexpected cachability which will introduce a range of subtle
> >> performance and correctness issues. Xen programs the entry 4 in the PAT
> >> table with WC so a page that was previously WB will end up as WC.
> >>
> >
> > David, could you please explain, Xen keeps and analyze _PTE_PAT bit
> > for ptes which are not present?
>
> No, the problem isn't with not-present PTEs (i.e. swap entries),
> but with present ones - the same bit (7) is being used for both,
> according to this comment:
>
> /*
> * Tracking soft dirty bit when a page goes to a swap is tricky.
> * We need a bit which can be stored in pte _and_ not conflict
> * with swap entry format. On x86 bits 6 and 7 are *not* involved
> * into swap entry computation, but bit 6 is used for nonlinear
> * file mapping, so we borrow bit 7 for soft dirty tracking.
> */
>
> Or are you telling me that the comment is misleading (at least me),
> and this applies only to not-present PTEs? And even then - where
> would the value of the original PAT bit be stored while swapped
> out (or is it impossible - now and forever - for WC pages to get
> swapped)?

Only to non-present ptes, as far as I know.

do_swap_page
...
pte = mk_pte(page, vma->vm_page_prot);

/* new pte from vm_page_prot generated */
...
set_pte_at(mm, address, page_table, pte);
/* and assigned to old place */

with soft dirty in swap it is somehow more weirdy

pte = mk_pte(page, vma->vm_page_prot);
...
if (pte_swp_soft_dirty(orig_pte))
pte = pte_mksoft_dirty(pte);
set_pte_at(mm, address, page_table, pte);

orig_pte has pse bit set if page has been soft dirty
when it reached swap.

2013-08-21 16:03:19

by Jan Beulich

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

>>> On 21.08.13 at 17:42, Cyrill Gorcunov <[email protected]> wrote:
> On Wed, Aug 21, 2013 at 03:53:36PM +0100, Jan Beulich wrote:
>> >>> On 21.08.13 at 16:12, Cyrill Gorcunov <[email protected]> wrote:
>> > David, could you please explain, Xen keeps and analyze _PTE_PAT bit
>> > for ptes which are not present?
>>
>> No, the problem isn't with not-present PTEs (i.e. swap entries),
>> but with present ones - the same bit (7) is being used for both,
>> according to this comment:
>>
>> /*
>> * Tracking soft dirty bit when a page goes to a swap is tricky.
>> * We need a bit which can be stored in pte _and_ not conflict
>> * with swap entry format. On x86 bits 6 and 7 are *not* involved
>> * into swap entry computation, but bit 6 is used for nonlinear
>> * file mapping, so we borrow bit 7 for soft dirty tracking.
>> */
>>
>> Or are you telling me that the comment is misleading (at least me),
>> and this applies only to not-present PTEs? And even then - where
>> would the value of the original PAT bit be stored while swapped
>> out (or is it impossible - now and forever - for WC pages to get
>> swapped)?
>
> Only to non-present ptes, as far as I know.

That's not really any guarantee. And the accessor functions also
don't check that they'd be used on non-present PTEs only.

> do_swap_page
> ...
> pte = mk_pte(page, vma->vm_page_prot);
>
> /* new pte from vm_page_prot generated */
> ...
> set_pte_at(mm, address, page_table, pte);
> /* and assigned to old place */
>
> with soft dirty in swap it is somehow more weirdy
>
> pte = mk_pte(page, vma->vm_page_prot);
> ...
> if (pte_swp_soft_dirty(orig_pte))
> pte = pte_mksoft_dirty(pte);
> set_pte_at(mm, address, page_table, pte);
>
> orig_pte has pse bit set if page has been soft dirty
> when it reached swap.

"when it reached swap" to me again implies that it could come
from a live page table, with the present bit set. So that
explanation attempt of yours confuses me more than it
clarifies things for me. (And referring to this bit as PSE bit is
sort of wrong here too - there's no PSE bit for 4k PTEs, that
bit is the PAT one, and that's what the whole discussion
started from.)

Jan

2013-08-21 16:19:51

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 05:03:13PM +0100, Jan Beulich wrote:
> >
> > Only to non-present ptes, as far as I know.
>
> That's not really any guarantee. And the accessor functions also
> don't check that they'd be used on non-present PTEs only.

Wait. This _PAGE_SWP_SOFT_DIRTY bit (which is in real PSE bit) assigned
in only one place -- in try_to_unmap_one(). The PTE get non-present then
and consists of swap entry format. I don't see any accessor to such entry
without testing if it's swap entry or pte-none. What I'm missing?

> > orig_pte has pse bit set if page has been soft dirty
> > when it reached swap.
>
> "when it reached swap" to me again implies that it could come
> from a live page table, with the present bit set. So that
> explanation attempt of yours confuses me more than it
> clarifies things for me. (And referring to this bit as PSE bit is

When page swapped out it become non-present in pte entry.

> sort of wrong here too - there's no PSE bit for 4k PTEs, that
> bit is the PAT one, and that's what the whole discussion
> started from.)

And I asked David to point me how it happens, because I don't
understand at which point pse bit get analized when page is
not present.

2013-08-21 16:30:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 7:19 AM, Cyrill Gorcunov <[email protected]> wrote:
> On Wed, Aug 21, 2013 at 04:11:25PM +0200, H. Peter Anvin wrote:
>>
>> Eep. This should be reverted, indeed. This isn't a manifest bug on !Xen but we
>> have gotten requests for WT support which would mean adding in the PAT but again.
>
> Please no, letme fix it. That's what I'm having in mind: don't use pse bit in swap
> entry but always mark page read from swap as dirty one (it's better than having
> no tracker at all and will be the case on old machines with 32bit ptes only).

Quite frankly, unless I see a patch later today that is

(a) obvious
(b) explains what is going on
(c) tested

I will be reverting the whole soft-dirty mess. I thought the
bit-mapping games it played were already too complicated (the patch to
pgtable-2level.h in commit 41bb3476b361 just makes me want to barf and
came in very late, so I'm not positive about the whole soft-dirty mess
in the first place). I really am not at all inclined to want to play
games in this area any more. It's too damn late in the release window.

Linus

2013-08-21 16:43:09

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 09:30:03AM -0700, Linus Torvalds wrote:
> On Wed, Aug 21, 2013 at 7:19 AM, Cyrill Gorcunov <[email protected]> wrote:
> > On Wed, Aug 21, 2013 at 04:11:25PM +0200, H. Peter Anvin wrote:
> >>
> >> Eep. This should be reverted, indeed. This isn't a manifest bug on !Xen but we
> >> have gotten requests for WT support which would mean adding in the PAT but again.
> >
> > Please no, letme fix it. That's what I'm having in mind: don't use pse bit in swap
> > entry but always mark page read from swap as dirty one (it's better than having
> > no tracker at all and will be the case on old machines with 32bit ptes only).
>
> Quite frankly, unless I see a patch later today that is

I'll do my best to fix it, but I've not yet been shown where the problem
happens. Neither I'm convinced if it is here at all. I'm asking David for details.

>
> (a) obvious
> (b) explains what is going on
> (c) tested
>
> I will be reverting the whole soft-dirty mess. I thought the
> bit-mapping games it played were already too complicated (the patch to
> pgtable-2level.h in commit 41bb3476b361 just makes me want to barf and
> came in very late, so I'm not positive about the whole soft-dirty mess

This was not my invention on such encoding, I simply had no choise but
follow existing scheme. Still there are cleanup patches in -mm tree which
make code a way more readable (even former one, without soft-dirty bit).

> in the first place). I really am not at all inclined to want to play
> games in this area any more. It's too damn late in the release window.

2013-08-21 16:56:12

by David Vrabel

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On 21/08/13 17:19, Cyrill Gorcunov wrote:
> On Wed, Aug 21, 2013 at 05:03:13PM +0100, Jan Beulich wrote:
>>>
>>> Only to non-present ptes, as far as I know.
>>
>> That's not really any guarantee. And the accessor functions also
>> don't check that they'd be used on non-present PTEs only.
>
> Wait. This _PAGE_SWP_SOFT_DIRTY bit (which is in real PSE bit) assigned
> in only one place -- in try_to_unmap_one(). The PTE get non-present then
> and consists of swap entry format. I don't see any accessor to such entry
> without testing if it's swap entry or pte-none. What I'm missing?
>
>>> orig_pte has pse bit set if page has been soft dirty
>>> when it reached swap.
>>
>> "when it reached swap" to me again implies that it could come
>> from a live page table, with the present bit set. So that
>> explanation attempt of yours confuses me more than it
>> clarifies things for me. (And referring to this bit as PSE bit is
>
> When page swapped out it become non-present in pte entry.
>
>> sort of wrong here too - there's no PSE bit for 4k PTEs, that
>> bit is the PAT one, and that's what the whole discussion
>> started from.)
>
> And I asked David to point me how it happens, because I don't
> understand at which point pse bit get analized when page is
> not present.

As Jan said, we're concerned that the bit was being used on present PTEs
and not just non-present ones. From a more careful look at this code
this does not appear to be the case.

However, I do find the use of PTE bits in this way somewhat fragile.
What other potential corner cases might still remain that will require
further games with PTE bits?

FWIW, Xen uses a separate dirty log to track which pages have become
dirty since the log was last cleared. Such a dirty log seems more
efficient than having scan all the PTEs looking for the soft dirty bits
and then having to scan them all again to clear them (particularly if
you need multiple passes because the task is still running and
continuing to dirty pages).

David

2013-08-21 17:25:52

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 05:56:08PM +0100, David Vrabel wrote:
> >
> > And I asked David to point me how it happens, because I don't
> > understand at which point pse bit get analized when page is
> > not present.
>
> As Jan said, we're concerned that the bit was being used on present PTEs
> and not just non-present ones. From a more careful look at this code
> this does not appear to be the case.
>
> However, I do find the use of PTE bits in this way somewhat fragile.
> What other potential corner cases might still remain that will require
> further games with PTE bits?

OK, so this is not a bug finally. The problem is that 2 level pte is
quite small and 7th bit is the only one spare I can use for soft dirty
tracking when page get swapped out. And swap engine is very depending
on pte being non-present, so we are on a safe side.

> FWIW, Xen uses a separate dirty log to track which pages have become
> dirty since the log was last cleared. Such a dirty log seems more
> efficient than having scan all the PTEs looking for the soft dirty bits
> and then having to scan them all again to clear them (particularly if
> you need multiple passes because the task is still running and
> continuing to dirty pages).

2013-08-21 17:29:00

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 6:48 AM, David Vrabel <[email protected]> wrote:
> All,
>
> 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
> PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
> and _PTE_PAT.
>
> With a Xen PV guest, the use of the _PTE_PAT will result in the page
> having unexpected cachability which will introduce a range of subtle
> performance and correctness issues. Xen programs the entry 4 in the PAT
> table with WC so a page that was previously WB will end up as WC.
>

Kind of off topic, but do you have a summary of how Xen uses the high
PAT bits? I'm the one who wants WT, and if there's already precedent
for using the high PAT bits, it'll be helpful.

(Also, Xen is a little weird, and I'd rather not break it again by
accident like I did with the vsyscall changes a couple years ago.)

--Andy

2013-08-21 18:17:37

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 09:25:47PM +0400, Cyrill Gorcunov wrote:
> >
> > However, I do find the use of PTE bits in this way somewhat fragile.
> > What other potential corner cases might still remain that will require
> > further games with PTE bits?
>
> OK, so this is not a bug finally. The problem is that 2 level pte is
> quite small and 7th bit is the only one spare I can use for soft dirty
> tracking when page get swapped out. And swap engine is very depending
> on pte being non-present, so we are on a safe side.

To make it clear: I'm working on patch that won't use pse bit for dirty
page tracking, just give some time to cook it and test.

2013-08-21 18:51:09

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

But is there a manifest bug or not? What is the deal with Xen?

Cyrill Gorcunov <[email protected]> wrote:
>On Wed, Aug 21, 2013 at 09:25:47PM +0400, Cyrill Gorcunov wrote:
>> >
>> > However, I do find the use of PTE bits in this way somewhat
>fragile.
>> > What other potential corner cases might still remain that will
>require
>> > further games with PTE bits?
>>
>> OK, so this is not a bug finally. The problem is that 2 level pte
>is
>> quite small and 7th bit is the only one spare I can use for soft
>dirty
>> tracking when page get swapped out. And swap engine is very depending
>> on pte being non-present, so we are on a safe side.
>
>To make it clear: I'm working on patch that won't use pse bit for dirty
>page tracking, just give some time to cook it and test.

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2013-08-21 19:03:12

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 08:50:26PM +0200, H. Peter Anvin wrote:
> But is there a manifest bug or not? What is the deal with Xen?
>

I personally don't see bug here because

- this swapped page soft dirty bit is set for non-present entries only,
never for present ones, just at moment we form swap pte entry

- i don't find any code which would test for this bit directly without
is_swap_pte call

but the use of paw bit itself is confusing, so I'm working on patch which
won't use it. Again, if someone knows where exactly access to pse bit when
pte keeps swap entry may happen (for any purpose other than dirty page
tracking) please share.

2013-08-21 19:07:41

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov <[email protected]> wrote:
> On Wed, Aug 21, 2013 at 08:50:26PM +0200, H. Peter Anvin wrote:
>> But is there a manifest bug or not? What is the deal with Xen?
>>
>
> I personally don't see bug here because
>
> - this swapped page soft dirty bit is set for non-present entries only,
> never for present ones, just at moment we form swap pte entry
>
> - i don't find any code which would test for this bit directly without
> is_swap_pte call
>
> but the use of paw bit itself is confusing, so I'm working on patch which
> won't use it. Again, if someone knows where exactly access to pse bit when
> pte keeps swap entry may happen (for any purpose other than dirty page
> tracking) please share.

I doubt that there are cacheability issues here, since the swap type
already overlaps with PWT and PCD.

What happens if the page being swapped is a THP page? (I have no idea
how, or even if, this works, but presumably PSE is important.)

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC

2013-08-21 19:20:14

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 12:07:13PM -0700, Andy Lutomirski wrote:
> On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov <[email protected]> wrote:
> > On Wed, Aug 21, 2013 at 08:50:26PM +0200, H. Peter Anvin wrote:
> >> But is there a manifest bug or not? What is the deal with Xen?
> >>
> >
> > I personally don't see bug here because
> >
> > - this swapped page soft dirty bit is set for non-present entries only,
> > never for present ones, just at moment we form swap pte entry
> >
> > - i don't find any code which would test for this bit directly without
> > is_swap_pte call
> >
> > but the use of paw bit itself is confusing, so I'm working on patch which
> > won't use it. Again, if someone knows where exactly access to pse bit when
> > pte keeps swap entry may happen (for any purpose other than dirty page
> > tracking) please share.
>
> I doubt that there are cacheability issues here, since the swap type
> already overlaps with PWT and PCD.
>
> What happens if the page being swapped is a THP page? (I have no idea
> how, or even if, this works, but presumably PSE is important.)

add_to_swap -> split_huge_page_to_list, so thp will be splitted.

2013-08-21 19:21:41

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On 08/21/2013 11:07 PM, Andy Lutomirski wrote:
> On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov <[email protected]> wrote:
>> On Wed, Aug 21, 2013 at 08:50:26PM +0200, H. Peter Anvin wrote:
>>> But is there a manifest bug or not? What is the deal with Xen?
>>>
>>
>> I personally don't see bug here because
>>
>> - this swapped page soft dirty bit is set for non-present entries only,
>> never for present ones, just at moment we form swap pte entry
>>
>> - i don't find any code which would test for this bit directly without
>> is_swap_pte call
>>
>> but the use of paw bit itself is confusing, so I'm working on patch which
>> won't use it. Again, if someone knows where exactly access to pse bit when
>> pte keeps swap entry may happen (for any purpose other than dirty page
>> tracking) please share.
>
> I doubt that there are cacheability issues here, since the swap type
> already overlaps with PWT and PCD.
>
> What happens if the page being swapped is a THP page? (I have no idea
> how, or even if, this works, but presumably PSE is important.)

Huge-page is splitted into small ones, then the smaller ones get swapped out.

> --Andy
>

2013-08-21 23:04:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov <[email protected]> wrote:
>
> I personally don't see bug here because
>
> - this swapped page soft dirty bit is set for non-present entries only,
> never for present ones, just at moment we form swap pte entry
>
> - i don't find any code which would test for this bit directly without
> is_swap_pte call

Ok, having gone through the places that use swp_*soft_dirty(), I have
to agree. Afaik, it's only ever used on a swap-entry that has (by
definition) the P bit clear. So with or without Xen, I don't see how
it can make any difference.

David/Konrad - did you actually see any issues, or was this just from
(mis)reading the code?

Linus

2013-08-21 23:06:03

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 09:30:03AM -0700, Linus Torvalds wrote:
> Quite frankly, unless I see a patch later today that is
>
> (a) obvious
> (b) explains what is going on
> (c) tested
>
> I will be reverting the whole soft-dirty mess. I thought the
> bit-mapping games it played were already too complicated (the patch to
> pgtable-2level.h in commit 41bb3476b361 just makes me want to barf and
> came in very late, so I'm not positive about the whole soft-dirty mess
> in the first place). I really am not at all inclined to want to play
> games in this area any more. It's too damn late in the release window.

Hi all, I worked on patch which would not touch PSE bit for dirty page
tracking and the result is not that good:

- 2level pages now always page dirty if page is swapped in and out, because
there is no space left in PTE (other than PSE bit)

- only 3level pages scheme uses high 32bits to keep offset of swap entry,
x86-64 shifts offset up to _PAGE_BIT_GLOBAL + 1 bit, thus I need some
different bit nonunified with anything else for no reason :(

Summarizing all things

- Using PSE bit for swap entries as indicator of soft dirty page is safe because
swap entries as saved in pte as non-presen and when #pf happens kernel generates
valid pte entry from vma->vm_page_prot

- __swp_entry() helper is clearing PSE bit explicitly so even without softdirty
patch it's not saved once page reach swap (with softdirty tracking we simply
reuse this bit for own needs).

- Using PSE bit allows to not modify swap encoding on all 3 page schemes (2level,
3level, 4level) because it's a spare bit there not intersected with swap format.

Thus I would *_really_* like to save current scheme. Probably I should add comment
into header where _PAGE_SWP_SOFT_DIRTY defined that it's valid only when PRESENT
bit clear? Similar to

/* If _PAGE_BIT_PRESENT is clear, we use these: */
/* - if the user mapped it with PROT_NONE; pte_present gives true */
#define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL
/* - set: nonlinear file mapping, saved PTE; unset:swap */
#define _PAGE_BIT_FILE _PAGE_BIT_DIRTY

Have I conviced you guys?

The former problem report came from impression that this PSE bit may be touched
(set and clean) on present PTE, but it's not the case for pages being swapped.

2013-08-21 23:42:56

by Andi Kleen

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

Cyrill Gorcunov <[email protected]> writes:
>
> Hi all, I worked on patch which would not touch PSE bit for dirty page
> tracking and the result is not that good:
>
> - 2level pages now always page dirty if page is swapped in and out, because
> there is no space left in PTE (other than PSE bit)

Maybe just don't support soft dirty for 2 level page tables?

2 level page tables should be really on the way out anyways, as they
have severe limits and do not support NX. With 3 levels there is enough
space.

-Andi

--
[email protected] -- Speaking for myself only

2013-08-22 00:51:41

by Dave Jones

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 04:04:54PM -0700, Linus Torvalds wrote:
> On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov <[email protected]> wrote:
> >
> > I personally don't see bug here because
> >
> > - this swapped page soft dirty bit is set for non-present entries only,
> > never for present ones, just at moment we form swap pte entry
> >
> > - i don't find any code which would test for this bit directly without
> > is_swap_pte call
>
> Ok, having gone through the places that use swp_*soft_dirty(), I have
> to agree. Afaik, it's only ever used on a swap-entry that has (by
> definition) the P bit clear. So with or without Xen, I don't see how
> it can make any difference.
>
> David/Konrad - did you actually see any issues, or was this just from
> (mis)reading the code?

Could this explain what I'm seeing in another thread ?
https://lkml.org/lkml/2013/8/7/27

Dave

2013-08-22 05:44:23

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 08:51:15PM -0400, Dave Jones wrote:
>
> Could this explain what I'm seeing in another thread ?
> https://lkml.org/lkml/2013/8/7/27

Don't think so. This code is merged in -rc6, while your report is saying the
kernel version is -rc4 (also this feature requires CONFIG_MEM_SOFT_DIRTY
to be turned on, do you have it?).

2013-08-22 05:49:23

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, Aug 21, 2013 at 04:42:53PM -0700, Andi Kleen wrote:
> Cyrill Gorcunov <[email protected]> writes:
> >
> > Hi all, I worked on patch which would not touch PSE bit for dirty page
> > tracking and the result is not that good:
> >
> > - 2level pages now always page dirty if page is swapped in and out, because
> > there is no space left in PTE (other than PSE bit)
>
> Maybe just don't support soft dirty for 2 level page tables?
>
> 2 level page tables should be really on the way out anyways, as they
> have severe limits and do not support NX. With 3 levels there is enough
> space.

Look, good thing is that 7th bit also available on the 4level pages
(ie x86-64) without additional code modification, that's why I picked
it in first place. I prepared the patch locally which doesn't use
pse bit for tracking but it only makes code more complex.

2013-08-22 06:37:02

by Minchan Kim

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

Hello Cyrill,

On Thu, Aug 22, 2013 at 09:49:19AM +0400, Cyrill Gorcunov wrote:
> On Wed, Aug 21, 2013 at 04:42:53PM -0700, Andi Kleen wrote:
> > Cyrill Gorcunov <[email protected]> writes:
> > >
> > > Hi all, I worked on patch which would not touch PSE bit for dirty page
> > > tracking and the result is not that good:
> > >
> > > - 2level pages now always page dirty if page is swapped in and out, because
> > > there is no space left in PTE (other than PSE bit)
> >
> > Maybe just don't support soft dirty for 2 level page tables?
> >
> > 2 level page tables should be really on the way out anyways, as they
> > have severe limits and do not support NX. With 3 levels there is enough
> > space.
>
> Look, good thing is that 7th bit also available on the 4level pages
> (ie x86-64) without additional code modification, that's why I picked
> it in first place. I prepared the patch locally which doesn't use
> pse bit for tracking but it only makes code more complex.

I'm not sure you read this thread.
http://comments.gmane.org/gmane.linux.kernel.mm/101756
In summary, I'd like to use it to track sub-ranges of some processes.

I already had a time to investigate and it enhanced our workload x2 on ARM
so I'd like to expand the concept for more general purpose.

For it, arch-specific stuff would be hurdle for port.
So, I support your non-arch-specific solutions.
Just FYI. Such future plan shouldn't force you.

Thanks.

--
Kind regards,
Minchan Kim

2013-08-22 06:41:47

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On 08/22/2013 04:51 AM, Dave Jones wrote:
> On Wed, Aug 21, 2013 at 04:04:54PM -0700, Linus Torvalds wrote:
> > On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov <[email protected]> wrote:
> > >
> > > I personally don't see bug here because
> > >
> > > - this swapped page soft dirty bit is set for non-present entries only,
> > > never for present ones, just at moment we form swap pte entry
> > >
> > > - i don't find any code which would test for this bit directly without
> > > is_swap_pte call
> >
> > Ok, having gone through the places that use swp_*soft_dirty(), I have
> > to agree. Afaik, it's only ever used on a swap-entry that has (by
> > definition) the P bit clear. So with or without Xen, I don't see how
> > it can make any difference.
> >
> > David/Konrad - did you actually see any issues, or was this just from
> > (mis)reading the code?
>
> Could this explain what I'm seeing in another thread ?
> https://lkml.org/lkml/2013/8/7/27

Was it caught with CONFIG_MEM_SOFT_DIRTY on or off? In the latter case all new
bits manipulations are no-op and couldn't cause this.

> Dave

Thanks,
Pavel

2013-08-22 06:56:32

by Jan Beulich

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

>>> On 21.08.13 at 18:19, Cyrill Gorcunov <[email protected]> wrote:
> On Wed, Aug 21, 2013 at 05:03:13PM +0100, Jan Beulich wrote:
>> >
>> > Only to non-present ptes, as far as I know.
>>
>> That's not really any guarantee. And the accessor functions also
>> don't check that they'd be used on non-present PTEs only.
>
> Wait. This _PAGE_SWP_SOFT_DIRTY bit (which is in real PSE bit) assigned
> in only one place -- in try_to_unmap_one(). The PTE get non-present then
> and consists of swap entry format. I don't see any accessor to such entry
> without testing if it's swap entry or pte-none. What I'm missing?

Fact is that this

static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
{
return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
}

has no checking whatsoever that the PTE being modified is a
non-present one, not even in any of the debugging modes. It
would be a different thing if the above acted on a swp_entry_t.

The fact that there currently may be just a single call site (where
the caller guarantees the non-present state) is no guarantee that
in the future another one won't appear, and then result in very
hard to debug problems.

Jan

2013-08-22 07:03:09

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Thu, Aug 22, 2013 at 07:56:26AM +0100, Jan Beulich wrote:
> >>> On 21.08.13 at 18:19, Cyrill Gorcunov <[email protected]> wrote:
> > On Wed, Aug 21, 2013 at 05:03:13PM +0100, Jan Beulich wrote:
> >> >
> >> > Only to non-present ptes, as far as I know.
> >>
> >> That's not really any guarantee. And the accessor functions also
> >> don't check that they'd be used on non-present PTEs only.
> >
> > Wait. This _PAGE_SWP_SOFT_DIRTY bit (which is in real PSE bit) assigned
> > in only one place -- in try_to_unmap_one(). The PTE get non-present then
> > and consists of swap entry format. I don't see any accessor to such entry
> > without testing if it's swap entry or pte-none. What I'm missing?
>
> Fact is that this
>
> static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
> {
> return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
> }
>
> has no checking whatsoever that the PTE being modified is a
> non-present one, not even in any of the debugging modes. It
> would be a different thing if the above acted on a swp_entry_t.
>
> The fact that there currently may be just a single call site (where
> the caller guarantees the non-present state) is no guarantee that
> in the future another one won't appear, and then result in very
> hard to debug problems.

Ok, how about this?

static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
{
BUG_ON(pte_present(pte));
return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
}

2013-08-22 07:27:51

by Jan Beulich

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

>>> On 22.08.13 at 09:03, Cyrill Gorcunov <[email protected]> wrote:
> Ok, how about this?
>
> static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
> {
> BUG_ON(pte_present(pte));
> return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
> }

Sure, fine with me. Perhaps VM_BUG_ON() or some other similar
construct limiting the scope when any extra code gets generated
would do too.

But as said, even better would perhaps be to have it act on a
swp_entry_t.

Jan

2013-08-22 07:47:24

by Jan Beulich

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

>>> On 22.08.13 at 01:04, Linus Torvalds <[email protected]> wrote:
> On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov <[email protected]> wrote:
>>
>> I personally don't see bug here because
>>
>> - this swapped page soft dirty bit is set for non-present entries only,
>> never for present ones, just at moment we form swap pte entry
>>
>> - i don't find any code which would test for this bit directly without
>> is_swap_pte call
>
> Ok, having gone through the places that use swp_*soft_dirty(), I have
> to agree. Afaik, it's only ever used on a swap-entry that has (by
> definition) the P bit clear. So with or without Xen, I don't see how
> it can make any difference.
>
> David/Konrad - did you actually see any issues, or was this just from
> (mis)reading the code?

It was actually me (mis)reading the code - as pointed out to Cyrill
already, setting _PAGE_PAT in a pte_t without even a comment
saying that this can only ever be done with a non-present entry
made me expect problems on Xen, because it's clear that to date
bare metal Linux doesn't care about the state of _PAGE_PAT in
present entries due to the way the PAT MSR gets set (and hence
quite likely no-one would have noticed the supposed problem
while testing).

So a comment either alongside the definition of _PAGE_SWP_SOFT_DIRTY
or directly in pte_swp_{mk,clear_}soft_dirty() would have been
the minimal thing I'd have expected for this sort of re-use of bits.
Ideally even a VM_BUG_ON(pte_present()) or similar. And perhaps
pte_swp_soft_dirty() should be either looking at the present bit
too or similarly asserting that it's clear...

In any event - I'm sorry for the red herring.

Jan

2013-08-22 07:54:49

by Jan Beulich

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

>>> On 21.08.13 at 19:28, Andy Lutomirski <[email protected]> wrote:
> On Wed, Aug 21, 2013 at 6:48 AM, David Vrabel <[email protected]> wrote:
>> All,
>>
>> 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
>> PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
>> and _PTE_PAT.
>>
>> With a Xen PV guest, the use of the _PTE_PAT will result in the page
>> having unexpected cachability which will introduce a range of subtle
>> performance and correctness issues. Xen programs the entry 4 in the PAT
>> table with WC so a page that was previously WB will end up as WC.
>>
>
> Kind of off topic, but do you have a summary of how Xen uses the high
> PAT bits? I'm the one who wants WT, and if there's already precedent
> for using the high PAT bits, it'll be helpful.

Xen's public headers have a comment explaining this, with the
main information being this table:

* The PAT MSR is as follow (it is a 64-bit value, each entry is 8 bits):
* PAT4 PAT0
* +---+----+----+----+-----+----+----+
* WC | WC | WB | UC | UC- | WC | WB | <= Linux
* +---+----+----+----+-----+----+----+
* WC | WT | WB | UC | UC- | WT | WB | <= BIOS (default when machine boots)
* +---+----+----+----+-----+----+----+
* WC | WP | WC | UC | UC- | WT | WB | <= Xen
* +---+----+----+----+-----+----+----+

i.e. Xen is retaining the BIOS (and legacy from non-PAT times)
meaning of the low four entries, putting WC and WP up into
the high half. The fact that the entry 6 is defined to be WC
is perhaps a mistake - it should really be considered reserved
for an eventual future memory type (just like entry 7). It also
seems like entry 6 is documented incorrectly here for Linux and
BIOS.

Jan

2013-08-22 09:32:53

by David Vrabel

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On 22/08/13 00:04, Linus Torvalds wrote:
> On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov <[email protected]> wrote:
>>
>> I personally don't see bug here because
>>
>> - this swapped page soft dirty bit is set for non-present entries only,
>> never for present ones, just at moment we form swap pte entry
>>
>> - i don't find any code which would test for this bit directly without
>> is_swap_pte call
>
> Ok, having gone through the places that use swp_*soft_dirty(), I have
> to agree. Afaik, it's only ever used on a swap-entry that has (by
> definition) the P bit clear. So with or without Xen, I don't see how
> it can make any difference.
>
> David/Konrad - did you actually see any issues, or was this just from
> (mis)reading the code?

There are no Xen related bugs in the code, we were misreading it.

It was my call to raise this as a regression without a repro and clearly
this was the wrong decision.

However, having looked at the soft dirty implementation and specifically
the userspace ABI I think that it is far to closely coupled to the
current implementation. I think this will constrain future development
of the feature should userspace require a more efficient ABI than
scanning all of /proc/<pid>/pagemaps.

Minimal downtime during 'live' checkpointing of a running task needs the
checkpointer to find and write out dirty pages faster than the task can
dirty them. This seems less likely to be possible if every iteration
all PTEs have to be scanned by the checkpointer instead of (e.g.,)
accessing a separate list of dirtied pages.

David

2013-08-22 10:16:18

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On 08/22/2013 01:32 PM, David Vrabel wrote:
> On 22/08/13 00:04, Linus Torvalds wrote:
>> On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov <[email protected]> wrote:
>>>
>>> I personally don't see bug here because
>>>
>>> - this swapped page soft dirty bit is set for non-present entries only,
>>> never for present ones, just at moment we form swap pte entry
>>>
>>> - i don't find any code which would test for this bit directly without
>>> is_swap_pte call
>>
>> Ok, having gone through the places that use swp_*soft_dirty(), I have
>> to agree. Afaik, it's only ever used on a swap-entry that has (by
>> definition) the P bit clear. So with or without Xen, I don't see how
>> it can make any difference.
>>
>> David/Konrad - did you actually see any issues, or was this just from
>> (mis)reading the code?
>
> There are no Xen related bugs in the code, we were misreading it.
>
> It was my call to raise this as a regression without a repro and clearly
> this was the wrong decision.
>
> However, having looked at the soft dirty implementation and specifically
> the userspace ABI I think that it is far to closely coupled to the
> current implementation. I think this will constrain future development
> of the feature should userspace require a more efficient ABI than
> scanning all of /proc/<pid>/pagemaps.
>
> Minimal downtime during 'live' checkpointing of a running task needs the
> checkpointer to find and write out dirty pages faster than the task can
> dirty them.

Absolutely, but in "find and write" the "write" component is likely to take
the majority of time -- we can scan PTEs of a mapping MUCH faster, than
transmitting those over even 10Gbit link.

We actually see this IRL -- in CRIU there's an atomic test, that checks
mappings get dumped and restored properly. One of sub-tests is one 512Mb mapping.
With it total dump time _minus_ memory dump time (which includes not only pagemap
file scan, but also files, registers, process tree, sessions, etc.) is fractions
of one second, while only the memory dump part's time is several seconds.

That said, the super-fast API for getting "what has changed" is not as tempting
to have as faster network/disk.

What is _more_ time consuming in iterative migration in our case is the need to
re-scan the whole /proc tree to get which processes had died and appeared, mess
with /proc/pid/fd finding out what files were (re-)opened/closed/changed, talking
to sock_diag subsystem for sockets information and alike. However, we haven't yet
done careful analysis for what the slowest part is, but pagemap scans is definitely
not.

> This seems less likely to be possible if every iteration
> all PTEs have to be scanned by the checkpointer instead of (e.g.,)
> accessing a separate list of dirtied pages.

But we don't scan all the x64 virtual address space's PTEs, instead we first
analyze the /proc/pid/maps and scan only PTEs sitting in private mappings.

> David
> .
>

Thanks,
Pavel

2013-08-22 11:27:35

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Thu, Aug 22, 2013 at 08:27:45AM +0100, Jan Beulich wrote:
> >>> On 22.08.13 at 09:03, Cyrill Gorcunov <[email protected]> wrote:
> > Ok, how about this?
> >
> > static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
> > {
> > BUG_ON(pte_present(pte));
> > return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
> > }
>
> Sure, fine with me. Perhaps VM_BUG_ON() or some other similar
> construct limiting the scope when any extra code gets generated
> would do too.

Sorry for delay, the patch is below.

>
> But as said, even better would perhaps be to have it act on a
> swp_entry_t.

swp_entry_t is too small already to keep additional status bit,
unfortunately.
---
From: Cyrill Gorcunov <[email protected]>
Subject: [PATCH] mm: Make sure _PAGE_SWP_SOFT_DIRTY bit is not set on present pte

_PAGE_SOFT_DIRTY bit should never be set on present pte so add
VM_BUG_ON to catch any potential future abuse.

Also add a comment on _PAGE_SWP_SOFT_DIRTY definition explaining
scope of its usage.

Signed-off-by: Cyrill Gorcunov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Matt Mackall <[email protected]>
Cc: Xiao Guangrong <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: David Vrabel <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Jan Beulich <[email protected]>
---
arch/x86/include/asm/pgtable.h | 34 +++++++++++++++++++---------------
arch/x86/include/asm/pgtable_types.h | 3 +++
2 files changed, 22 insertions(+), 15 deletions(-)

Index: linux-2.6.git/arch/x86/include/asm/pgtable.h
===================================================================
--- linux-2.6.git.orig/arch/x86/include/asm/pgtable.h
+++ linux-2.6.git/arch/x86/include/asm/pgtable.h
@@ -314,21 +314,6 @@ static inline pmd_t pmd_mksoft_dirty(pmd
return pmd_set_flags(pmd, _PAGE_SOFT_DIRTY);
}

-static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
-{
- return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
-}
-
-static inline int pte_swp_soft_dirty(pte_t pte)
-{
- return pte_flags(pte) & _PAGE_SWP_SOFT_DIRTY;
-}
-
-static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
-{
- return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
-}
-
static inline pte_t pte_file_clear_soft_dirty(pte_t pte)
{
return pte_clear_flags(pte, _PAGE_SOFT_DIRTY);
@@ -445,6 +430,7 @@ pte_t *populate_extra_pte(unsigned long

#ifndef __ASSEMBLY__
#include <linux/mm_types.h>
+#include <linux/mmdebug.h>
#include <linux/log2.h>

static inline int pte_none(pte_t pte)
@@ -863,6 +849,24 @@ static inline void update_mmu_cache_pmd(
{
}

+static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
+{
+ VM_BUG_ON(pte_present(pte));
+ return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
+}
+
+static inline int pte_swp_soft_dirty(pte_t pte)
+{
+ VM_BUG_ON(pte_present(pte));
+ return pte_flags(pte) & _PAGE_SWP_SOFT_DIRTY;
+}
+
+static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
+{
+ VM_BUG_ON(pte_present(pte));
+ return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
+}
+
#include <asm-generic/pgtable.h>
#endif /* __ASSEMBLY__ */

Index: linux-2.6.git/arch/x86/include/asm/pgtable_types.h
===================================================================
--- linux-2.6.git.orig/arch/x86/include/asm/pgtable_types.h
+++ linux-2.6.git/arch/x86/include/asm/pgtable_types.h
@@ -75,6 +75,9 @@
* with swap entry format. On x86 bits 6 and 7 are *not* involved
* into swap entry computation, but bit 6 is used for nonlinear
* file mapping, so we borrow bit 7 for soft dirty tracking.
+ *
+ * Please note that this bit must be treated as swap dirty page
+ * mark if and only if the PTE has present bit clear!
*/
#ifdef CONFIG_MEM_SOFT_DIRTY
#define _PAGE_SWP_SOFT_DIRTY _PAGE_PSE

2013-08-22 11:34:04

by Jan Beulich

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

>>> On 22.08.13 at 13:27, Cyrill Gorcunov <[email protected]> wrote:
> On Thu, Aug 22, 2013 at 08:27:45AM +0100, Jan Beulich wrote:
>> >>> On 22.08.13 at 09:03, Cyrill Gorcunov <[email protected]> wrote:
>> > Ok, how about this?
>> >
>> > static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
>> > {
>> > BUG_ON(pte_present(pte));
>> > return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
>> > }
>>
>> Sure, fine with me. Perhaps VM_BUG_ON() or some other similar
>> construct limiting the scope when any extra code gets generated
>> would do too.
>
> Sorry for delay, the patch is below.
>
>>
>> But as said, even better would perhaps be to have it act on a
>> swp_entry_t.
>
> swp_entry_t is too small already to keep additional status bit,
> unfortunately.
> ---
> From: Cyrill Gorcunov <[email protected]>
> Subject: [PATCH] mm: Make sure _PAGE_SWP_SOFT_DIRTY bit is not set on
> present pte
>
> _PAGE_SOFT_DIRTY bit should never be set on present pte so add
> VM_BUG_ON to catch any potential future abuse.
>
> Also add a comment on _PAGE_SWP_SOFT_DIRTY definition explaining
> scope of its usage.
>
> Signed-off-by: Cyrill Gorcunov <[email protected]>

Acked-by: Jan Beulich <[email protected]>

Thanks, Jan

2013-08-22 12:18:32

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On 08/22/2013 03:33 PM, Jan Beulich wrote:

>> From: Cyrill Gorcunov <[email protected]>
>> Subject: [PATCH] mm: Make sure _PAGE_SWP_SOFT_DIRTY bit is not set on
>> present pte
>>
>> _PAGE_SOFT_DIRTY bit should never be set on present pte so add
>> VM_BUG_ON to catch any potential future abuse.
>>
>> Also add a comment on _PAGE_SWP_SOFT_DIRTY definition explaining
>> scope of its usage.
>>
>> Signed-off-by: Cyrill Gorcunov <[email protected]>
>
> Acked-by: Jan Beulich <[email protected]>

Acked-by: Pavel Emelyanov <[email protected]>

2013-08-22 13:12:26

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Thu, Aug 22, 2013 at 03:37:33PM +0900, Minchan Kim wrote:
> >
> > Look, good thing is that 7th bit also available on the 4level pages
> > (ie x86-64) without additional code modification, that's why I picked
> > it in first place. I prepared the patch locally which doesn't use
> > pse bit for tracking but it only makes code more complex.
>
> I'm not sure you read this thread.
> http://comments.gmane.org/gmane.linux.kernel.mm/101756
> In summary, I'd like to use it to track sub-ranges of some processes.
>
> I already had a time to investigate and it enhanced our workload x2 on ARM
> so I'd like to expand the concept for more general purpose.
>
> For it, arch-specific stuff would be hurdle for port.
> So, I support your non-arch-specific solutions.
> Just FYI. Such future plan shouldn't force you.

Hi Minchan, thanks for info, I'll read the thread (sorry for delay,
managed to miss your email).

2013-08-27 22:05:53

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

On Wed, 2013-08-21 at 09:30 -0700, Linus Torvalds wrote:
> I will be reverting the whole soft-dirty mess. I thought the
> bit-mapping games it played were already too complicated (the patch to
> pgtable-2level.h in commit 41bb3476b361 just makes me want to barf and
> came in very late, so I'm not positive about the whole soft-dirty mess
> in the first place). I really am not at all inclined to want to play
> games in this area any more. It's too damn late in the release window.

Anything that makes me try to scavenge a new PTE bits makes me
scream :-) Dunno if I'll manage to support this on power.

Also, it sort-of duplicates what KVM does for dirty tracking (for
migration, framebuffer updates, etc...). I wonder if KVM could consider
switching to this scheme, but then we end up with a user "break KVM"
file in /proc since the user can clear the refs.

I'd have been happier if the whole thing had essentially used a parallel
set of dirty tracking bitmaps (hooked up with the VMAs maybe). Add the
overhead there for as many "clients" as you want who will use the
facility and leave the PTE mostly alone basically. (I suppose we still
need to play PTE tricks to differenciate soft dirty RO vs. COW RO on
anonymous memory ?)

Cheers,
Ben.