The current implementation of memory hotremoval relies on that pages
can be unmapped from process spaces. After successful unmapping,
subsequent accesses to the pages are blocked and don't interfere
the hotremoval operation.
However, this code
if (PageSwapCache(page) &&
page_count(page) != page_mapcount(page) + 2) {
ret = SWAP_FAIL;
goto out_unmap;
}
in try_to_unmap_one() prevents unmapping pages that are referenced via
get_user_pages(), and such references can be held for a long time if
they are due to such as direct IO.
I've made a test program that issues multiple direct IO read requests
against a single read buffer, and pages that belong to the buffer
cannot be hotremoved because they aren't unmapped.
The following patch, which is against linux-2.6.11-rc1-mm1 and also
tested witch linux-2.6.11-rc2-mm2, fixes this issue. The purpose of
this patch is to be able to unmap pages that have incremented
page_count. To do that consistently, the COW detection logic needs to
be modified to not to rely on page_count. I'm aware that such
extensive use of page_mapcount is discouraged and there is a plan to
kill page_mapcount (*), but I cannot think of a better alternative
solution.
(*) c.f. http://www.ussg.iu.edu/hypermail/linux/kernel/0406.0/0483.html
Some notes about my code:
- I think it's safe to rely on page_mapcount in do_swap_page(),
because its use is protected by lock_page().
- The can_share_swap_page() call in do_swap_page() always returns
false. It is inefficient but should be harmless. Incrementing
page_mapcount before calling that function should fix the problem,
but it may cause bad side effects.
- Another obvious solution to this issue is to find the "offending"
process from a un-unmappable page and suspend it until the page is
unmapped. I'm afraid the implementation would be much more complicated.
- I could not test the following situation. It should be possible
to write some kernel code to do that, but please let me know if
you know any such test cases.
- A page_count is incremented by get_user_pages().
- The page gets unmapped.
- The process causes a write fault for the page, before the
incremented page_count is dropped.
Also, while I've tried carefully not to make mistakes and done some
testing, I'm not very sure this is bug free. Please comment.
--- mm/memory.c.orig 2005-01-17 14:47:11.000000000 +0900
+++ mm/memory.c 2005-01-17 14:55:51.000000000 +0900
@@ -1786,10 +1786,6 @@ static int do_swap_page(struct mm_struct
}
/* The page isn't present yet, go ahead with the fault. */
-
- swap_free(entry);
- if (vm_swap_full())
- remove_exclusive_swap_page(page);
mm->rss++;
acct_update_integrals();
@@ -1800,6 +1796,10 @@ static int do_swap_page(struct mm_struct
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
write_access = 0;
}
+
+ swap_free(entry);
+ if (vm_swap_full())
+ remove_exclusive_swap_page(page);
unlock_page(page);
flush_icache_page(vma, page);
--- mm/rmap.c.orig 2005-01-17 14:40:08.000000000 +0900
+++ mm/rmap.c 2005-01-21 12:34:06.000000000 +0900
@@ -569,8 +569,11 @@ static int try_to_unmap_one(struct page
*/
if (PageSwapCache(page) &&
page_count(page) != page_mapcount(page) + 2) {
- ret = SWAP_FAIL;
- goto out_unmap;
+ if (page_mapcount(page) > 1) { /* happens when COW is in progress */
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+ printk("unmapping GUPed page\n");
}
/* Nuke the page table entry. */
--- mm/swapfile.c.orig 2005-01-17 14:47:11.000000000 +0900
+++ mm/swapfile.c 2005-01-31 16:59:11.000000000 +0900
@@ -293,7 +293,7 @@ static int exclusive_swap_page(struct pa
if (p->swap_map[swp_offset(entry)] == 1) {
/* Recheck the page count with the swapcache lock held.. */
write_lock_irq(&swapper_space.tree_lock);
- if (page_count(page) == 2)
+ if (page_mapcount(page) == 1)
retval = 1;
write_unlock_irq(&swapper_space.tree_lock);
}
@@ -316,22 +316,17 @@ int can_share_swap_page(struct page *pag
if (!PageLocked(page))
BUG();
- switch (page_count(page)) {
- case 3:
- if (!PagePrivate(page))
- break;
- /* Fallthrough */
- case 2:
- if (!PageSwapCache(page))
- break;
- retval = exclusive_swap_page(page);
- break;
- case 1:
- if (PageReserved(page))
- break;
- retval = 1;
+
+ if (!PageSwapCache(page)) {
+ if (PageAnon(page) && page_mapcount(page) == 1 &&
+ !PageReserved(page))
+ return 1;
+ return 0;
}
- return retval;
+ if (page_mapcount(page) <= 1 && exclusive_swap_page(page))
+ return 1;
+
+ return 0;
}
/*
On Thu, 3 Feb 2005, IWAMOTO Toshihiro wrote:
> The current implementation of memory hotremoval relies on that pages
> can be unmapped from process spaces. After successful unmapping,
> subsequent accesses to the pages are blocked and don't interfere
> the hotremoval operation.
>
> However, this code
>
> if (PageSwapCache(page) &&
> page_count(page) != page_mapcount(page) + 2) {
> ret = SWAP_FAIL;
> goto out_unmap;
> }
Yes, that is odd code. It would be nice to have a solution without it.
> in try_to_unmap_one() prevents unmapping pages that are referenced via
> get_user_pages(), and such references can be held for a long time if
> they are due to such as direct IO.
> I've made a test program that issues multiple direct IO read requests
> against a single read buffer, and pages that belong to the buffer
> cannot be hotremoved because they aren't unmapped.
I haven't looked at the rest of your hotremoval, so it's not obvious
to me how a change here would help you - obviously you wouldn't want
to be migrating pages while direct IO to them was in progress.
I presume your patch works for you by letting the page count fall
to a point where migration moves it automatically as soon as the
got_user_pages are put, where without your patch the count is held
too high, and you keep doing scans which tend to miss the window
in which those pages are put?
> The following patch, which is against linux-2.6.11-rc1-mm1 and also
> tested witch linux-2.6.11-rc2-mm2, fixes this issue. The purpose of
> this patch is to be able to unmap pages that have incremented
> page_count. To do that consistently, the COW detection logic needs to
> be modified to not to rely on page_count. I'm aware that such
> extensive use of page_mapcount is discouraged and there is a plan to
> kill page_mapcount (*), but I cannot think of a better alternative
> solution.
>
> (*) c.f. http://www.ussg.iu.edu/hypermail/linux/kernel/0406.0/0483.html
I apologize for scaring you off page mapcount. I have no current
plans to scrap it, and feel a lot more satisfied with it than at the
time of that comment. Partly because it's now manipulated atomically
rather than under bitspin lock. Partly because I realize that although
64-bit systems are overdue for an atomic64 page count and page mapcount,
we can actually just use one atomic64 for them both, keeping, say, lower
24 bits for count and upper 40 for mapcount (and not repeating mapcount
in count on these arches), so mapcount won't increase struct page size.
Go right ahead and use page mapcount if it's appropriate.
> Some notes about my code:
>
> - I think it's safe to rely on page_mapcount in do_swap_page(),
> because its use is protected by lock_page().
I think so too.
> - The can_share_swap_page() call in do_swap_page() always returns
> false. It is inefficient but should be harmless. Incrementing
> page_mapcount before calling that function should fix the problem,
> but it may cause bad side effects.
Odd that your patch moves it if it now doesn't even work!
But I think some more movement should be able to solve that.
> - Another obvious solution to this issue is to find the "offending"
> process from a un-unmappable page and suspend it until the page is
> unmapped. I'm afraid the implementation would be much more complicated.
Agreed, let's not get into that.
> - I could not test the following situation. It should be possible
> to write some kernel code to do that, but please let me know if
> you know any such test cases.
> - A page_count is incremented by get_user_pages().
> - The page gets unmapped.
> - The process causes a write fault for the page, before the
> incremented page_count is dropped.
I confess I don't have such a test case ready myself.
> Also, while I've tried carefully not to make mistakes and done some
> testing, I'm not very sure this is bug free. Please comment.
>
> --- mm/memory.c.orig 2005-01-17 14:47:11.000000000 +0900
> +++ mm/memory.c 2005-01-17 14:55:51.000000000 +0900
> @@ -1786,10 +1786,6 @@ static int do_swap_page(struct mm_struct
> }
>
> /* The page isn't present yet, go ahead with the fault. */
> -
> - swap_free(entry);
> - if (vm_swap_full())
> - remove_exclusive_swap_page(page);
>
> mm->rss++;
> acct_update_integrals();
> @@ -1800,6 +1796,10 @@ static int do_swap_page(struct mm_struct
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> write_access = 0;
> }
> +
> + swap_free(entry);
> + if (vm_swap_full())
> + remove_exclusive_swap_page(page);
> unlock_page(page);
>
> flush_icache_page(vma, page);
> --- mm/rmap.c.orig 2005-01-17 14:40:08.000000000 +0900
> +++ mm/rmap.c 2005-01-21 12:34:06.000000000 +0900
> @@ -569,8 +569,11 @@ static int try_to_unmap_one(struct page
> */
> if (PageSwapCache(page) &&
> page_count(page) != page_mapcount(page) + 2) {
> - ret = SWAP_FAIL;
> - goto out_unmap;
> + if (page_mapcount(page) > 1) { /* happens when COW is in progress */
> + ret = SWAP_FAIL;
> + goto out_unmap;
> + }
> + printk("unmapping GUPed page\n");
> }
>
> /* Nuke the page table entry. */
I'm disappointed to see you making this more complicated, it's
even harder to understand than before. I believe that if we're
going to make good use of page_mapcount in can_share_swap_page,
it should be possible to delete this block from try_to_unmap_one
altogether. Or did you try that, and find it goes wrong?
(The swapoff progress case can be dealt with differently,
by shifting unuse_pmd's activate_page to the start of unuse_process,
before it might drop page lock.)
> --- mm/swapfile.c.orig 2005-01-17 14:47:11.000000000 +0900
> +++ mm/swapfile.c 2005-01-31 16:59:11.000000000 +0900
> @@ -293,7 +293,7 @@ static int exclusive_swap_page(struct pa
> if (p->swap_map[swp_offset(entry)] == 1) {
> /* Recheck the page count with the swapcache lock held.. */
> write_lock_irq(&swapper_space.tree_lock);
> - if (page_count(page) == 2)
> + if (page_mapcount(page) == 1)
> retval = 1;
> write_unlock_irq(&swapper_space.tree_lock);
> }
I think that the write_lock_irq stuff here is unnecessary,
given that we already have the page lock - do you agree?
But you're quite right not to change that in the same patch.
> @@ -316,22 +316,17 @@ int can_share_swap_page(struct page *pag
>
> if (!PageLocked(page))
> BUG();
> - switch (page_count(page)) {
> - case 3:
> - if (!PagePrivate(page))
> - break;
> - /* Fallthrough */
> - case 2:
> - if (!PageSwapCache(page))
> - break;
> - retval = exclusive_swap_page(page);
> - break;
> - case 1:
> - if (PageReserved(page))
> - break;
> - retval = 1;
> +
> + if (!PageSwapCache(page)) {
> + if (PageAnon(page) && page_mapcount(page) == 1 &&
> + !PageReserved(page))
> + return 1;
> + return 0;
> }
> - return retval;
> + if (page_mapcount(page) <= 1 && exclusive_swap_page(page))
> + return 1;
> +
> + return 0;
> }
>
> /*
This can_share_swap_page looks messier than I'd want.
I was hoping to append my own patch to this response, but I haven't
got it working right yet (swap seems too full). I hope tomorrow,
but thought I'd better not delay these comments any longer.
Hugh
On Mon, 7 Feb 2005, Hugh Dickins wrote:
> On Thu, 3 Feb 2005, IWAMOTO Toshihiro wrote:
> > The current implementation of memory hotremoval relies on that pages
> > can be unmapped from process spaces. After successful unmapping,
> > subsequent accesses to the pages are blocked and don't interfere
> > the hotremoval operation.
> >
> > However, this code
> >
> > if (PageSwapCache(page) &&
> > page_count(page) != page_mapcount(page) + 2) {
> > ret = SWAP_FAIL;
> > goto out_unmap;
> > }
>
> Yes, that is odd code. It would be nice to have a solution without it.
>
> > in try_to_unmap_one() prevents unmapping pages that are referenced via
> > get_user_pages(), and such references can be held for a long time if
> > they are due to such as direct IO.
> > I've made a test program that issues multiple direct IO read requests
> > against a single read buffer, and pages that belong to the buffer
> > cannot be hotremoved because they aren't unmapped.
>
> ....
>
> I was hoping to append my own patch to this response, but I haven't
> got it working right yet (swap seems too full). I hope tomorrow,
> but thought I'd better not delay these comments any longer.
Seems it was okay after all, I got confused by an unrelated issue.
Here's what I had in mind, what do you think? Does it indeed help
with your problem? I'm copying Andrea because it was he who devised
that fix to the get_user_pages issue, and also (I think, longer ago)
can_share_swap_page itself.
This does rely on us moving 1 from mapcount to swapcount or back only
while page is locked - there are places (e.g. exit) where mapcount
comes down without page being locked, but that's not an issue: we just
have to be sure that once we have mapcount, it can't go up while reading
swapcount (I've probably changed more than is strictly necessary, but
this seemed clearest to me).
I've left out how we ensure swapoff makes progress on a page with high
mapcount, haven't quite made my mind out about that: it's less of an
issue now there's an activate_page in unuse_pte, plus it's not an
issue which will much bother anyone but me, can wait until after.
That PageAnon check in do_wp_page: seems worthwhile to avoid locking
and unlocking the page if it's a file page, leaves can_share_swap_page
simpler (a PageReserved is never PageAnon). But the patch is against
2.6.11-rc3-mm1, so I appear to be breaking the intention of
do_wp_page_mk_pte_writable ("on a shared-writable page"),
believe that's already broken but need to study it more.
Hugh
--- 2.6.11-rc3-mm1/mm/memory.c 2005-02-05 07:09:40.000000000 +0000
+++ linux/mm/memory.c 2005-02-07 19:50:47.000000000 +0000
@@ -1339,7 +1339,7 @@ static int do_wp_page(struct mm_struct *
}
old_page = pfn_to_page(pfn);
- if (!TestSetPageLocked(old_page)) {
+ if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
int reuse = can_share_swap_page(old_page);
unlock_page(old_page);
if (reuse) {
@@ -1779,10 +1779,6 @@ static int do_swap_page(struct mm_struct
}
/* The page isn't present yet, go ahead with the fault. */
-
- swap_free(entry);
- if (vm_swap_full())
- remove_exclusive_swap_page(page);
mm->rss++;
pte = mk_pte(page, vma->vm_page_prot);
@@ -1790,12 +1786,16 @@ static int do_swap_page(struct mm_struct
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
write_access = 0;
}
- unlock_page(page);
flush_icache_page(vma, page);
set_pte(page_table, pte);
page_add_anon_rmap(page, vma, address);
+ swap_free(entry);
+ if (vm_swap_full())
+ remove_exclusive_swap_page(page);
+ unlock_page(page);
+
if (write_access) {
if (do_wp_page(mm, vma, address,
page_table, pmd, pte) == VM_FAULT_OOM)
--- 2.6.11-rc3-mm1/mm/rmap.c 2005-02-05 07:09:40.000000000 +0000
+++ linux/mm/rmap.c 2005-02-07 12:59:21.000000000 +0000
@@ -551,27 +551,6 @@ static int try_to_unmap_one(struct page
goto out_unmap;
}
- /*
- * Don't pull an anonymous page out from under get_user_pages.
- * GUP carefully breaks COW and raises page count (while holding
- * page_table_lock, as we have here) to make sure that the page
- * cannot be freed. If we unmap that page here, a user write
- * access to the virtual address will bring back the page, but
- * its raised count will (ironically) be taken to mean it's not
- * an exclusive swap page, do_wp_page will replace it by a copy
- * page, and the user never get to see the data GUP was holding
- * the original page for.
- *
- * This test is also useful for when swapoff (unuse_process) has
- * to drop page lock: its reference to the page stops existing
- * ptes from being unmapped, so swapoff can make progress.
- */
- if (PageSwapCache(page) &&
- page_count(page) != page_mapcount(page) + 2) {
- ret = SWAP_FAIL;
- goto out_unmap;
- }
-
/* Nuke the page table entry. */
flush_cache_page(vma, address);
pteval = ptep_clear_flush(vma, address, pte);
--- 2.6.11-rc3-mm1/mm/swapfile.c 2005-02-05 07:09:40.000000000 +0000
+++ linux/mm/swapfile.c 2005-02-07 19:48:50.000000000 +0000
@@ -257,61 +257,37 @@ void swap_free(swp_entry_t entry)
}
/*
- * Check if we're the only user of a swap page,
- * when the page is locked.
+ * How many references to page are currently swapped out?
*/
-static int exclusive_swap_page(struct page *page)
+static inline int page_swapcount(struct page *page)
{
- int retval = 0;
- struct swap_info_struct * p;
+ int count = 0;
+ struct swap_info_struct *p;
swp_entry_t entry;
entry.val = page->private;
p = swap_info_get(entry);
if (p) {
- /* Is the only swap cache user the cache itself? */
- if (p->swap_map[swp_offset(entry)] == 1) {
- /* Recheck the page count with the swapcache lock held.. */
- write_lock_irq(&swapper_space.tree_lock);
- if (page_count(page) == 2)
- retval = 1;
- write_unlock_irq(&swapper_space.tree_lock);
- }
+ /* Subtract the 1 for the swap cache itself */
+ count = p->swap_map[swp_offset(entry)] - 1;
swap_info_put(p);
}
- return retval;
+ return count;
}
/*
* We can use this swap cache entry directly
* if there are no other references to it.
- *
- * Here "exclusive_swap_page()" does the real
- * work, but we opportunistically check whether
- * we need to get all the locks first..
*/
int can_share_swap_page(struct page *page)
{
- int retval = 0;
+ int count;
- if (!PageLocked(page))
- BUG();
- switch (page_count(page)) {
- case 3:
- if (!PagePrivate(page))
- break;
- /* Fallthrough */
- case 2:
- if (!PageSwapCache(page))
- break;
- retval = exclusive_swap_page(page);
- break;
- case 1:
- if (PageReserved(page))
- break;
- retval = 1;
- }
- return retval;
+ BUG_ON(!PageLocked(page));
+ count = page_mapcount(page);
+ if (count <= 1 && PageSwapCache(page))
+ count += page_swapcount(page);
+ return count == 1;
}
/*
At Mon, 7 Feb 2005 21:24:59 +0000 (GMT),
Hugh Dickins wrote:
>
> On Thu, 3 Feb 2005, IWAMOTO Toshihiro wrote:
> > The current implementation of memory hotremoval relies on that pages
> > can be unmapped from process spaces. After successful unmapping,
> > subsequent accesses to the pages are blocked and don't interfere
> > the hotremoval operation.
> >
> > However, this code
> >
> > if (PageSwapCache(page) &&
> > page_count(page) != page_mapcount(page) + 2) {
> > ret = SWAP_FAIL;
> > goto out_unmap;
> > }
>
> Yes, that is odd code. It would be nice to have a solution without it.
>
> > in try_to_unmap_one() prevents unmapping pages that are referenced via
> > get_user_pages(), and such references can be held for a long time if
> > they are due to such as direct IO.
> > I've made a test program that issues multiple direct IO read requests
> > against a single read buffer, and pages that belong to the buffer
> > cannot be hotremoved because they aren't unmapped.
>
> I haven't looked at the rest of your hotremoval, so it's not obvious
> to me how a change here would help you - obviously you wouldn't want
> to be migrating pages while direct IO to them was in progress.
>
> I presume your patch works for you by letting the page count fall
> to a point where migration moves it automatically as soon as the
> got_user_pages are put, where without your patch the count is held
> too high, and you keep doing scans which tend to miss the window
> in which those pages are put?
Yes. And my test program has no such time window because IO requests
are issued without waiting for completion of older requests.
I think issuing IO requests in such a manner is nonsense, but
a misbehaving process shouldn't be able to prevent memory hotremoval.
If the hotremoval code can unmap a page from a process space, the
process will be blocked when it causes a page fault against the page.
So, a process cannot continuously issue direct IO requests to keep
page counts high. (get_user_pages() causes a (simulated) page fault.)
> > - The can_share_swap_page() call in do_swap_page() always returns
> > false. It is inefficient but should be harmless. Incrementing
> > page_mapcount before calling that function should fix the problem,
> > but it may cause bad side effects.
>
> Odd that your patch moves it if it now doesn't even work!
> But I think some more movement should be able to solve that.
Moving swap_free() is necessary to avoid can_share_swap_page()
returning true when it shouldn't. And, write access cases are handled
later with do_wp_page() call, anyway.
> > - Another obvious solution to this issue is to find the "offending"
> > process from a un-unmappable page and suspend it until the page is
> > unmapped. I'm afraid the implementation would be much more complicated.
>
> Agreed, let's not get into that.
Nice to hear that.
> > --- mm/memory.c.orig 2005-01-17 14:47:11.000000000 +0900
> > +++ mm/memory.c 2005-01-17 14:55:51.000000000 +0900
> > @@ -1786,10 +1786,6 @@ static int do_swap_page(struct mm_struct
> > }
> >
> > /* The page isn't present yet, go ahead with the fault. */
> > -
> > - swap_free(entry);
> > - if (vm_swap_full())
> > - remove_exclusive_swap_page(page);
> >
> > mm->rss++;
> > acct_update_integrals();
> > @@ -1800,6 +1796,10 @@ static int do_swap_page(struct mm_struct
> > pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > write_access = 0;
> > }
> > +
> > + swap_free(entry);
> > + if (vm_swap_full())
> > + remove_exclusive_swap_page(page);
> > unlock_page(page);
> >
> > flush_icache_page(vma, page);
> > --- mm/rmap.c.orig 2005-01-17 14:40:08.000000000 +0900
> > +++ mm/rmap.c 2005-01-21 12:34:06.000000000 +0900
> > @@ -569,8 +569,11 @@ static int try_to_unmap_one(struct page
> > */
> > if (PageSwapCache(page) &&
> > page_count(page) != page_mapcount(page) + 2) {
> > - ret = SWAP_FAIL;
> > - goto out_unmap;
> > + if (page_mapcount(page) > 1) { /* happens when COW is in progress */
> > + ret = SWAP_FAIL;
> > + goto out_unmap;
> > + }
> > + printk("unmapping GUPed page\n");
> > }
> >
> > /* Nuke the page table entry. */
>
> I'm disappointed to see you making this more complicated, it's
> even harder to understand than before. I believe that if we're
> going to make good use of page_mapcount in can_share_swap_page,
> it should be possible to delete this block from try_to_unmap_one
> altogether. Or did you try that, and find it goes wrong?
I just wanted to be conservative to get a working patch.
I think this block can be deleted.
> > --- mm/swapfile.c.orig 2005-01-17 14:47:11.000000000 +0900
> > +++ mm/swapfile.c 2005-01-31 16:59:11.000000000 +0900
> This can_share_swap_page looks messier than I'd want.
>
> I was hoping to append my own patch to this response, but I haven't
> got it working right yet (swap seems too full). I hope tomorrow,
> but thought I'd better not delay these comments any longer.
I saw your patch in the other mail, and it looks better and should fix
my problem. I'll try and report.
--
IWAMOTO Toshihiro
At Tue, 8 Feb 2005 16:26:26 +0000 (GMT),
Hugh Dickins wrote:
>
> On Mon, 7 Feb 2005, Hugh Dickins wrote:
> > On Thu, 3 Feb 2005, IWAMOTO Toshihiro wrote:
> > > The current implementation of memory hotremoval relies on that pages
> > > can be unmapped from process spaces. After successful unmapping,
> > > subsequent accesses to the pages are blocked and don't interfere
> > > the hotremoval operation.
> > >
> > > However, this code
> > >
> > > if (PageSwapCache(page) &&
> > > page_count(page) != page_mapcount(page) + 2) {
> > > ret = SWAP_FAIL;
> > > goto out_unmap;
> > > }
> >
> > Yes, that is odd code. It would be nice to have a solution without it.
> >
> > > in try_to_unmap_one() prevents unmapping pages that are referenced via
> > > get_user_pages(), and such references can be held for a long time if
> > > they are due to such as direct IO.
> > > I've made a test program that issues multiple direct IO read requests
> > > against a single read buffer, and pages that belong to the buffer
> > > cannot be hotremoved because they aren't unmapped.
> >
> > ....
> >
> > I was hoping to append my own patch to this response, but I haven't
> > got it working right yet (swap seems too full). I hope tomorrow,
> > but thought I'd better not delay these comments any longer.
>
> Seems it was okay after all, I got confused by an unrelated issue.
> Here's what I had in mind, what do you think? Does it indeed help
> with your problem? I'm copying Andrea because it was he who devised
> that fix to the get_user_pages issue, and also (I think, longer ago)
> can_share_swap_page itself.
I've tested with linux-2.6.10-rc2-mm3 + my hotremoval patch, and it
passed hotremoval tests including the direct IO related one.
Thanks.
> --- 2.6.11-rc3-mm1/mm/memory.c 2005-02-05 07:09:40.000000000 +0000
> +++ linux/mm/memory.c 2005-02-07 19:50:47.000000000 +0000
--
IWAMOTO Toshihiro
On Tue, Feb 08, 2005 at 04:26:26PM +0000, Hugh Dickins wrote:
> Seems it was okay after all, I got confused by an unrelated issue.
> Here's what I had in mind, what do you think? Does it indeed help
> with your problem? I'm copying Andrea because it was he who devised
> that fix to the get_user_pages issue, and also (I think, longer ago)
> can_share_swap_page itself.
>
> This does rely on us moving 1 from mapcount to swapcount or back only
> while page is locked - there are places (e.g. exit) where mapcount
> comes down without page being locked, but that's not an issue: we just
> have to be sure that once we have mapcount, it can't go up while reading
> swapcount (I've probably changed more than is strictly necessary, but
> this seemed clearest to me).
>
> I've left out how we ensure swapoff makes progress on a page with high
> mapcount, haven't quite made my mind out about that: it's less of an
> issue now there's an activate_page in unuse_pte, plus it's not an
> issue which will much bother anyone but me, can wait until after.
>
> That PageAnon check in do_wp_page: seems worthwhile to avoid locking
> and unlocking the page if it's a file page, leaves can_share_swap_page
> simpler (a PageReserved is never PageAnon). But the patch is against
> 2.6.11-rc3-mm1, so I appear to be breaking the intention of
> do_wp_page_mk_pte_writable ("on a shared-writable page"),
> believe that's already broken but need to study it more.
The reason pinned pages cannot be unmapped is that if they're being
unmapped while a rawio read is in progress, a cow on some shared
swapcache under I/O could happen.
If a try_to_unmap + COW over a shared swapcache happens during a rawio
read, then the rawio reads will get lost generating data corruption.
I had not much time to check the patch yet, but it's quite complex and
could you explain again how do you prevent a cow to happen while the
rawio is in flight if you don't pin the pte? The PG_locked bitflag
cannot matter at all because the rawio read won't lock the page. What
you have to prevent is the pte to be zeroed out, it must be kept
writeable during the whole I/O. I don't see how you can allow unmapping
without running into COWs later on.
This is the only thing I care to understand really, since it's the only
case that the pte pinning was fixing.
Overall I see nothing wrong in preventing memory removal while rawio is
in flight. rawio cannot be in flight forever (ptrace and core dump as
well should complete eventually). Why can't you simply drop pages from
the freelist one by one until all of them are removed from the freelist?
On Thu, 2005-02-10 at 20:05 +0100, Andrea Arcangeli wrote:
> Overall I see nothing wrong in preventing memory removal while rawio is
> in flight. rawio cannot be in flight forever (ptrace and core dump as
> well should complete eventually). Why can't you simply drop pages from
> the freelist one by one until all of them are removed from the freelist?
We actually do that, in addition to the more active methods of capturing
the memory that we're about to remove.
You're right, I don't really see a problem with ignoring those pages, at
least in the active migration code. We would, of course, like to keep
the number of things that we depend on good faith to get migrated to a
minimum, but things under I/O are the least of our problems.
The only thing we might want to do is put something in the rawio code to
look for the PG_capture pages to ensure that it gives the migration code
a shot at them every once in a while (when I/O is not in flight,
obviously).
-- Dave
On Thu, Feb 10, 2005 at 11:16:44AM -0800, Dave Hansen wrote:
> We actually do that, in addition to the more active methods of capturing
> the memory that we're about to remove.
This sounds the best way to go. btw, is this a different codebase from
the one that Toshihiro is testing?
> You're right, I don't really see a problem with ignoring those pages, at
> least in the active migration code. We would, of course, like to keep
> the number of things that we depend on good faith to get migrated to a
> minimum, but things under I/O are the least of our problems.
Indeed. It's very similar to locked pages. All pages can be pinned for a
transient amount of time, either in the pte or with
PG_pinned/PG_writeback (now perhaps Hugh found a way to drop the pin on
the pte [I'm still unconvinced about that], but sure you're left with
transient pinning with PG_locked or PG_writeback).
> The only thing we might want to do is put something in the rawio code to
> look for the PG_capture pages to ensure that it gives the migration code
> a shot at them every once in a while (when I/O is not in flight,
> obviously).
If there are persistent usages PG_capture sounds a good idea. Perhaps
the whole point that Toshihiro has problem with is that there are really
persistent users that require PG_capture? He mentioned direct IO, that's
not a long time, if something core dump could be a long time.
On Thu, 10 Feb 2005, Andrea Arcangeli wrote:
>
> The reason pinned pages cannot be unmapped is that if they're being
> unmapped while a rawio read is in progress, a cow on some shared
> swapcache under I/O could happen.
>
> If a try_to_unmap + COW over a shared swapcache happens during a rawio
> read, then the rawio reads will get lost generating data corruption.
Yes, but...
get_user_pages broke COW in advance of the I/O. The reason for
subsequent COW if the page gets unmapped is precisely because
can_share_swap_page used page_count to judge exclusivity, and
get_user_pages has bumped that up, so the page appears (in danger
of being) non-exclusive when actually it is exclusive. By changing
can_share_swap_page to use page_mapcount, that frustration vanishes.
(Of course, if the process forks while async I/O is in progress,
these I/O pages could still get COWed, and either parent or child
lose some of the data being read - but behaviour in that case is
anyway undefined, isn't it? Just so long as kernel doesn't crash.)
> I had not much time to check the patch yet, but it's quite complex and
> could you explain again how do you prevent a cow to happen while the
> rawio is in flight if you don't pin the pte? The PG_locked bitflag
> cannot matter at all because the rawio read won't lock the page. What
> you have to prevent is the pte to be zeroed out, it must be kept
> writeable during the whole I/O. I don't see how you can allow unmapping
> without running into COWs later on.
The page_mapcount+page_swapcount test for exclusivity is simply a
better test for exclusivity than the page_count test. Since it says
exclusive when the page is exclusive, no COW occurs (and I've just
remembered that we're actually talking of the odd case where the
process itself writes into another portion of the page under I/O,
to get that write fault: but that still works).
page_count does still play a vital role, in ensuring that the page
stays held in the swapcache, cannot actually be reused for something
else while it's unmapped.
> This is the only thing I care to understand really, since it's the only
> case that the pte pinning was fixing.
>
> Overall I see nothing wrong in preventing memory removal while rawio is
> in flight. rawio cannot be in flight forever (ptrace and core dump as
> well should complete eventually). Why can't you simply drop pages from
> the freelist one by one until all of them are removed from the freelist?
I did ask Toshihiro about that earlier in the thread. Best look up his
actual reply, but as I understand it, his test is sufficiently thorough
that it's actually doing direct I/Os in parallel, so there's never a
moment when the page is free. He believes that an unprivileged process
should not be allowed to hold pages indefinitely unmigratable.
I have little appreciation of the hotplug/memmigration issues here.
But whatever the merits of that, I do think the new can_share_swap_page
is an improvement over the old (written in the days before page_mapcount
existed), and that it's preferable to remove the page_count heuristic
from try_to_unmap_one.
Hugh
On Thu, Feb 10, 2005 at 08:19:47PM +0000, Hugh Dickins wrote:
> On Thu, 10 Feb 2005, Andrea Arcangeli wrote:
> >
> > The reason pinned pages cannot be unmapped is that if they're being
> > unmapped while a rawio read is in progress, a cow on some shared
> > swapcache under I/O could happen.
> >
> > If a try_to_unmap + COW over a shared swapcache happens during a rawio
> > read, then the rawio reads will get lost generating data corruption.
>
> Yes, but...
>
> get_user_pages broke COW in advance of the I/O. The reason for
> subsequent COW if the page gets unmapped is precisely because
> can_share_swap_page used page_count to judge exclusivity, and
> get_user_pages has bumped that up, so the page appears (in danger
> of being) non-exclusive when actually it is exclusive. By changing
> can_share_swap_page to use page_mapcount, that frustration vanishes.
What if a second thread does a fork() while the rawio is in progress?
The page will be again no shareable, and if you unmap it the rawio will
be currupt if the thread that executed the fork while the I/O is in
progress writes to a part of the page again after it has been unmapped
(obviously the part of the page that isn't under read-rawio, rawio works
with the hardblocksize, not on the whole page).
On Thu, 10 Feb 2005, Andrea Arcangeli wrote:
> On Thu, Feb 10, 2005 at 08:19:47PM +0000, Hugh Dickins wrote:
> >
> > get_user_pages broke COW in advance of the I/O. The reason for
> > subsequent COW if the page gets unmapped is precisely because
> > can_share_swap_page used page_count to judge exclusivity, and
> > get_user_pages has bumped that up, so the page appears (in danger
> > of being) non-exclusive when actually it is exclusive. By changing
> > can_share_swap_page to use page_mapcount, that frustration vanishes.
>
> What if a second thread does a fork() while the rawio is in progress?
> The page will be again no shareable, and if you unmap it the rawio will
> be currupt if the thread that executed the fork while the I/O is in
> progress writes to a part of the page again after it has been unmapped
> (obviously the part of the page that isn't under read-rawio, rawio works
> with the hardblocksize, not on the whole page).
I think there's no difference here between the two techniques.
With the new can_share_swap_page, yes, "page_swapcount" goes up with
the fork, the page will be judged non-exclusive, a user write to
another part of the page will replace it by a copy page, and it's
undefined how much of the I/O will be captured in the copy.
But even with the old can_share_swap_page, and try_to_unmap_one
refusing to unmap the page, copy_page_range's COW checking would
still remove write permission from the pte of an anonymous
page, so a user write to another part of the page would find
raised page_count and replace it by a copy page: same behaviour.
And it's fine for the behaviour to be somewhat undefined in this
peculiar case: the important thing is just that the page must not
be freed and reused while I/O occurs, hence get_user_page raising
the page_count - which I'm _not_ proposing to change!
Hugh
On Fri, Feb 11, 2005 at 07:23:09AM +0000, Hugh Dickins wrote:
> And it's fine for the behaviour to be somewhat undefined in this
> peculiar case: the important thing is just that the page must not
> be freed and reused while I/O occurs, hence get_user_page raising
> the page_count - which I'm _not_ proposing to change!
Ok, I'm quite convinced it's correct now. The only thing that can make
mapcount go up without the lock on the page without userspace
intervention (and userspace intervention would make it an undefined
behaviour like in my example with fork), was the swapin, and you covered
it by moving the unlock after page_add_anon_rmap (so mapcount changes
atomically with the page_swapcount there too). Swapoff was already doing
it under the page lock.
Then we should use the mapcount/swapcount in remove_exclusive_swap_page
too.
On Fri, 11 Feb 2005, Andrea Arcangeli wrote:
>
> Ok, I'm quite convinced it's correct now. The only thing that can make
> mapcount go up without the lock on the page without userspace
> intervention (and userspace intervention would make it an undefined
> behaviour like in my example with fork), was the swapin, and you covered
> it by moving the unlock after page_add_anon_rmap (so mapcount changes
> atomically with the page_swapcount there too). Swapoff was already doing
> it under the page lock.
Thanks a lot for thinking it through, yes, that's how it is.
(For a while I felt nervous about moving that unlock_page below
the arch-defined flush_icache_page; but then realized that since it's
already done with page_table spinlock, PG_locked cannot be an issue.)
> Then we should use the mapcount/swapcount in remove_exclusive_swap_page
> too.
Originally I thought so, but later wasn't so sure. There might be
somewhere which stabilizes PageSwapCache by incrementing page_count,
rechecks it, waits to get lock_page, then assumes still PageSwapCache?
(Though it's hard to see why it would need to make such an assumption,
and in the equivalent file case would have to allow for truncation.)
It just needs a wider audit than the simpler can_share_swap_page case,
and can be done independently later on.
By the way, while we're talking of remove_exclusive_swap_page:
a more functional issue I sometimes wonder about, why don't we
remove_exclusive_swap_page on write fault? Keeping the swap slot
is valuable if read fault, but once the page is dirtied, wouldn't
it usually be better to free that slot and allocate another later?
But I'm always scared of making such changes to swapping, because
I cannot imagine a good enough range of swap performance tests.
Hugh
> By the way, while we're talking of remove_exclusive_swap_page:
> a more functional issue I sometimes wonder about, why don't we
> remove_exclusive_swap_page on write fault? Keeping the swap slot
> is valuable if read fault, but once the page is dirtied, wouldn't
> it usually be better to free that slot and allocate another later?
Avoiding swap fragmentation is one reason to leave it allocated. So you
can swapin/swapout/swapin/swapout always in the same place on disk as
long as there's plenty of swap still available. I'm not sure how much
speedup this provides, but certainly it makes sense.
On Mon, 14 Feb 2005, Andrea Arcangeli wrote:
> > By the way, while we're talking of remove_exclusive_swap_page:
> > a more functional issue I sometimes wonder about, why don't we
> > remove_exclusive_swap_page on write fault? Keeping the swap slot
> > is valuable if read fault, but once the page is dirtied, wouldn't
> > it usually be better to free that slot and allocate another later?
>
> Avoiding swap fragmentation is one reason to leave it allocated. So you
> can swapin/swapout/swapin/swapout always in the same place on disk as
> long as there's plenty of swap still available. I'm not sure how much
> speedup this provides, but certainly it makes sense.
I rather thought it would tend to increase swap fragmentation: that
the next time (if) this page has to be written out to swap, the disk
has to seek back to some ancient position to write this page, when
the rest of the cluster being written is more likely to come from a
recently allocated block of contiguous swap pages (though if many of
them are being rewritten rather than newly allocated, they'll all be
all over the disk, no contiguity at all).
Of course, freeing as soon as dirty does leave a hole behind, which
tends towards swap fragmentation: but I thought the swap allocator
tried for contiguous clusters before it fell back on isolated pages
(I haven't checked, and akpm has changes to swap allocation in -mm).
Hmm, I think you're thinking of the overall fragmentation of swap,
and are correct about that; whereas I'm saying "fragmentation"
when what I'm really concerned about is increased seeking.
But only research would tell the true story. I suspect the change
from 2.4's linear vma scanning to 2.6's rmap against the bottom of
LRU may have changed the rules for swap layout strategy.
Hugh
On Mon, Feb 14, 2005 at 06:36:43PM +0000, Hugh Dickins wrote:
> On Mon, 14 Feb 2005, Andrea Arcangeli wrote:
> > > By the way, while we're talking of remove_exclusive_swap_page:
> > > a more functional issue I sometimes wonder about, why don't we
> > > remove_exclusive_swap_page on write fault? Keeping the swap slot
> > > is valuable if read fault, but once the page is dirtied, wouldn't
> > > it usually be better to free that slot and allocate another later?
> >
> > Avoiding swap fragmentation is one reason to leave it allocated. So you
> > can swapin/swapout/swapin/swapout always in the same place on disk as
> > long as there's plenty of swap still available. I'm not sure how much
> > speedup this provides, but certainly it makes sense.
>
> I rather thought it would tend to increase swap fragmentation: that
> the next time (if) this page has to be written out to swap, the disk
> has to seek back to some ancient position to write this page, when
> the rest of the cluster being written is more likely to come from a
> recently allocated block of contiguous swap pages (though if many of
> them are being rewritten rather than newly allocated, they'll all be
> all over the disk, no contiguity at all).
>
> Of course, freeing as soon as dirty does leave a hole behind, which
> tends towards swap fragmentation: but I thought the swap allocator
> tried for contiguous clusters before it fell back on isolated pages
> (I haven't checked, and akpm has changes to swap allocation in -mm).
>
> Hmm, I think you're thinking of the overall fragmentation of swap,
> and are correct about that; whereas I'm saying "fragmentation"
> when what I'm really concerned about is increased seeking.
Swapouts aren't the problem. The swapins with physical readahead are the
ones that benefits from the reduced overall fragmentation. Or at least
this was the case in 2.4, you're right something might be different now
that we don't follow a swapout virtual address space order anymore (but
there is probably still some localization effect during the ->nopage
faults).
Andrea Arcangeli <[email protected]> wrote:
>
> you're right something might be different now
> that we don't follow a swapout virtual address space order anymore
There's a patch in -mm which attempts to do so, and afair, succeeds.
However the performance seems to be crappy. Its main benefit at present
is in reducing worst-case scheduling latencies (scan_swap_map).
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc3/2.6.11-rc3-mm2/broken-out/swapspace-layout-improvements-fix.patch