Date: Thu, 03 Feb 2005 12:56:05 +0900
From: IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
To: linux-kernel@vger.kernel.org
Cc: lhms-devel@lists.sourceforge.net
Subject: [RFC] Changing COW detection to be memory hotplug friendly
User-Agent: Wanderlust/2.11.30 (Wonderwall) SEMI/1.14.6 (Maruoka) FLIM/1.14.6 (Marutamachi) APEL/10.6 Emacs/21.3 (i386-pc-linux-gnu) MULE/5.0 (SAKAKI)
MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka")
Content-Type: text/plain; charset=US-ASCII
Message-Id: <20050203035605.C981A7046E@sv1.valinux.co.jp>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4888
Lines: 142

The current implementation of memory hotremoval relies on that pages
can be unmapped from process spaces.  After successful unmapping,
subsequent accesses to the pages are blocked and don't interfere
the hotremoval operation.

However, this code

        if (PageSwapCache(page) &&
            page_count(page) != page_mapcount(page) + 2) {
                ret = SWAP_FAIL;
                goto out_unmap;
        }

in try_to_unmap_one() prevents unmapping pages that are referenced via
get_user_pages(), and such references can be held for a long time if
they are due to such as direct IO.
I've made a test program that issues multiple direct IO read requests
against a single read buffer, and pages that belong to the buffer
cannot be hotremoved because they aren't unmapped.

The following patch, which is against linux-2.6.11-rc1-mm1 and also
tested witch linux-2.6.11-rc2-mm2, fixes this issue.  The purpose of
this patch is to be able to unmap pages that have incremented
page_count.  To do that consistently, the COW detection logic needs to
be modified to not to rely on page_count.  I'm aware that such
extensive use of page_mapcount is discouraged and there is a plan to
kill page_mapcount (*), but I cannot think of a better alternative
solution.

(*) c.f. http://www.ussg.iu.edu/hypermail/linux/kernel/0406.0/0483.html

Some notes about my code:

  - I think it's safe to rely on page_mapcount in do_swap_page(),
    because its use is protected by lock_page().
  - The can_share_swap_page() call in do_swap_page() always returns
    false.  It is inefficient but should be harmless.  Incrementing
    page_mapcount before calling that function should fix the problem,
    but it may cause bad side effects.
  - Another obvious solution to this issue is to find the "offending"
    process from a un-unmappable page and suspend it until the page is
    unmapped.  I'm afraid the implementation would be much more complicated.
  - I could not test the following situation.  It should be possible
    to write some kernel code to do that, but please let me know if
    you know any such test cases.
    - A page_count is incremented by get_user_pages().
    - The page gets unmapped.
    - The process causes a write fault for the page, before the
      incremented page_count is dropped.

Also, while I've tried carefully not to make mistakes and done some
testing, I'm not very sure this is bug free.  Please comment.

--- mm/memory.c.orig	2005-01-17 14:47:11.000000000 +0900
+++ mm/memory.c	2005-01-17 14:55:51.000000000 +0900
@@ -1786,10 +1786,6 @@ static int do_swap_page(struct mm_struct
 	}
 
 	/* The page isn't present yet, go ahead with the fault. */
-		
-	swap_free(entry);
-	if (vm_swap_full())
-		remove_exclusive_swap_page(page);
 
 	mm->rss++;
 	acct_update_integrals();
@@ -1800,6 +1796,10 @@ static int do_swap_page(struct mm_struct
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		write_access = 0;
 	}
+		
+	swap_free(entry);
+	if (vm_swap_full())
+		remove_exclusive_swap_page(page);
 	unlock_page(page);
 
 	flush_icache_page(vma, page);
--- mm/rmap.c.orig	2005-01-17 14:40:08.000000000 +0900
+++ mm/rmap.c	2005-01-21 12:34:06.000000000 +0900
@@ -569,8 +569,11 @@ static int try_to_unmap_one(struct page 
 	 */
 	if (PageSwapCache(page) &&
 	    page_count(page) != page_mapcount(page) + 2) {
-		ret = SWAP_FAIL;
-		goto out_unmap;
+		if (page_mapcount(page) > 1) {	/* happens when COW is in progress */
+			ret = SWAP_FAIL;
+			goto out_unmap;
+		}
+	printk("unmapping GUPed page\n");
 	}
 
 	/* Nuke the page table entry. */
--- mm/swapfile.c.orig	2005-01-17 14:47:11.000000000 +0900
+++ mm/swapfile.c	2005-01-31 16:59:11.000000000 +0900
@@ -293,7 +293,7 @@ static int exclusive_swap_page(struct pa
 		if (p->swap_map[swp_offset(entry)] == 1) {
 			/* Recheck the page count with the swapcache lock held.. */
 			write_lock_irq(&swapper_space.tree_lock);
-			if (page_count(page) == 2)
+			if (page_mapcount(page) == 1)
 				retval = 1;
 			write_unlock_irq(&swapper_space.tree_lock);
 		}
@@ -316,22 +316,17 @@ int can_share_swap_page(struct page *pag
 
 	if (!PageLocked(page))
 		BUG();
-	switch (page_count(page)) {
-	case 3:
-		if (!PagePrivate(page))
-			break;
-		/* Fallthrough */
-	case 2:
-		if (!PageSwapCache(page))
-			break;
-		retval = exclusive_swap_page(page);
-		break;
-	case 1:
-		if (PageReserved(page))
-			break;
-		retval = 1;
+
+	if (!PageSwapCache(page)) {
+		if (PageAnon(page) && page_mapcount(page) == 1 &&
+		    !PageReserved(page))
+			return 1;
+		return 0;
 	}
-	return retval;
+	if (page_mapcount(page) <= 1 && exclusive_swap_page(page))
+		return 1;
+
+	return 0;
 }
 
 /*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/