Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S262724AbVBCD4t (ORCPT ); Wed, 2 Feb 2005 22:56:49 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262721AbVBCD4s (ORCPT ); Wed, 2 Feb 2005 22:56:48 -0500 Received: from sv1.valinux.co.jp ([210.128.90.2]:64743 "EHLO sv1.valinux.co.jp") by vger.kernel.org with ESMTP id S262523AbVBCD4H (ORCPT ); Wed, 2 Feb 2005 22:56:07 -0500 Date: Thu, 03 Feb 2005 12:56:05 +0900 From: IWAMOTO Toshihiro To: linux-kernel@vger.kernel.org Cc: lhms-devel@lists.sourceforge.net Subject: [RFC] Changing COW detection to be memory hotplug friendly User-Agent: Wanderlust/2.11.30 (Wonderwall) SEMI/1.14.6 (Maruoka) FLIM/1.14.6 (Marutamachi) APEL/10.6 Emacs/21.3 (i386-pc-linux-gnu) MULE/5.0 (SAKAKI) MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka") Content-Type: text/plain; charset=US-ASCII Message-Id: <20050203035605.C981A7046E@sv1.valinux.co.jp> Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4888 Lines: 142 The current implementation of memory hotremoval relies on that pages can be unmapped from process spaces. After successful unmapping, subsequent accesses to the pages are blocked and don't interfere the hotremoval operation. However, this code if (PageSwapCache(page) && page_count(page) != page_mapcount(page) + 2) { ret = SWAP_FAIL; goto out_unmap; } in try_to_unmap_one() prevents unmapping pages that are referenced via get_user_pages(), and such references can be held for a long time if they are due to such as direct IO. I've made a test program that issues multiple direct IO read requests against a single read buffer, and pages that belong to the buffer cannot be hotremoved because they aren't unmapped. The following patch, which is against linux-2.6.11-rc1-mm1 and also tested witch linux-2.6.11-rc2-mm2, fixes this issue. The purpose of this patch is to be able to unmap pages that have incremented page_count. To do that consistently, the COW detection logic needs to be modified to not to rely on page_count. I'm aware that such extensive use of page_mapcount is discouraged and there is a plan to kill page_mapcount (*), but I cannot think of a better alternative solution. (*) c.f. http://www.ussg.iu.edu/hypermail/linux/kernel/0406.0/0483.html Some notes about my code: - I think it's safe to rely on page_mapcount in do_swap_page(), because its use is protected by lock_page(). - The can_share_swap_page() call in do_swap_page() always returns false. It is inefficient but should be harmless. Incrementing page_mapcount before calling that function should fix the problem, but it may cause bad side effects. - Another obvious solution to this issue is to find the "offending" process from a un-unmappable page and suspend it until the page is unmapped. I'm afraid the implementation would be much more complicated. - I could not test the following situation. It should be possible to write some kernel code to do that, but please let me know if you know any such test cases. - A page_count is incremented by get_user_pages(). - The page gets unmapped. - The process causes a write fault for the page, before the incremented page_count is dropped. Also, while I've tried carefully not to make mistakes and done some testing, I'm not very sure this is bug free. Please comment. --- mm/memory.c.orig 2005-01-17 14:47:11.000000000 +0900 +++ mm/memory.c 2005-01-17 14:55:51.000000000 +0900 @@ -1786,10 +1786,6 @@ static int do_swap_page(struct mm_struct } /* The page isn't present yet, go ahead with the fault. */ - - swap_free(entry); - if (vm_swap_full()) - remove_exclusive_swap_page(page); mm->rss++; acct_update_integrals(); @@ -1800,6 +1796,10 @@ static int do_swap_page(struct mm_struct pte = maybe_mkwrite(pte_mkdirty(pte), vma); write_access = 0; } + + swap_free(entry); + if (vm_swap_full()) + remove_exclusive_swap_page(page); unlock_page(page); flush_icache_page(vma, page); --- mm/rmap.c.orig 2005-01-17 14:40:08.000000000 +0900 +++ mm/rmap.c 2005-01-21 12:34:06.000000000 +0900 @@ -569,8 +569,11 @@ static int try_to_unmap_one(struct page */ if (PageSwapCache(page) && page_count(page) != page_mapcount(page) + 2) { - ret = SWAP_FAIL; - goto out_unmap; + if (page_mapcount(page) > 1) { /* happens when COW is in progress */ + ret = SWAP_FAIL; + goto out_unmap; + } + printk("unmapping GUPed page\n"); } /* Nuke the page table entry. */ --- mm/swapfile.c.orig 2005-01-17 14:47:11.000000000 +0900 +++ mm/swapfile.c 2005-01-31 16:59:11.000000000 +0900 @@ -293,7 +293,7 @@ static int exclusive_swap_page(struct pa if (p->swap_map[swp_offset(entry)] == 1) { /* Recheck the page count with the swapcache lock held.. */ write_lock_irq(&swapper_space.tree_lock); - if (page_count(page) == 2) + if (page_mapcount(page) == 1) retval = 1; write_unlock_irq(&swapper_space.tree_lock); } @@ -316,22 +316,17 @@ int can_share_swap_page(struct page *pag if (!PageLocked(page)) BUG(); - switch (page_count(page)) { - case 3: - if (!PagePrivate(page)) - break; - /* Fallthrough */ - case 2: - if (!PageSwapCache(page)) - break; - retval = exclusive_swap_page(page); - break; - case 1: - if (PageReserved(page)) - break; - retval = 1; + + if (!PageSwapCache(page)) { + if (PageAnon(page) && page_mapcount(page) == 1 && + !PageReserved(page)) + return 1; + return 0; } - return retval; + if (page_mapcount(page) <= 1 && exclusive_swap_page(page)) + return 1; + + return 0; } /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/