Date: Sat, 20 Mar 2004 18:48:57 +0100
From: Andrea Arcangeli <andrea@suse.de>
To: William Lee Irwin III <wli@holomorphy.com>, linux-kernel@vger.kernel.org,
       Andrew Morton <akpm@osdl.org>, Linus Torvalds <torvalds@osdl.org>
Subject: Re: can device drivers return non-ram via vm_ops->nopage?
Message-ID: <20040320174857.GA9009@dualathlon.random>
References: <20040320133025.GH9009@dualathlon.random> <20040320144022.GC2045@holomorphy.com> <20040320150621.GO9009@dualathlon.random> <20040320154419.A6726@flint.arm.linux.org.uk> <20040320155739.GQ9009@dualathlon.random> <20040320161538.C6726@flint.arm.linux.org.uk> <20040320162534.GU9009@dualathlon.random>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20040320162534.GU9009@dualathlon.random>
User-Agent: Mutt/1.4.1i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5160
Lines: 148

I noticed there was effectively one bug in anon_vma that would have
caused a VM_FAULT_OOM/SIGBUS to be mistaken for non-ram (I checked
VM_FAULT after pfn_valid oh well). So it's possible what Jens
experienced was a sigbus and not a non-ram condition, sorry.

So it's not certain anymore that alsa was returning non-ram, but it's
still possible in theory and it would have worked in practice up to
2.6.4 (not anymore in 2.6.5-rc1). The non-ram page_t would fall into the
direct mapping (for a 4G phys address, the page_t would be at address
3G+120M triggering an oops only on machines with less than 128m of ram,
and if the mmio region would been at lower than 4G even less ram would
be needed to hide the bug). This untested patch should make it working
with non-ram too, so it sounds safer for the short term. I will test it
a little bit and I'll upload an update.

--- x/mm/memory.c.~1~	2004-03-20 15:47:31.000000000 +0100
+++ x/mm/memory.c	2004-03-20 18:35:19.000000000 +0100
@@ -1384,7 +1384,7 @@ do_no_page(struct mm_struct *mm, struct 
 	struct page * new_page;
 	struct address_space *mapping = NULL;
 	pte_t entry;
-	int sequence = 0, reserved, anon;
+	int sequence = 0, reserved, anon, pageable, ram, as;
 	int ret = VM_FAULT_MINOR;
 
 	if (!vma->vm_ops || !vma->vm_ops->nopage)
@@ -1401,36 +1401,40 @@ do_no_page(struct mm_struct *mm, struct 
 retry:
 	new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
 
+	/* no page was available -- either SIGBUS or OOM */
+	if (new_page == NOPAGE_SIGBUS)
+		return VM_FAULT_SIGBUS;
+	if (new_page == NOPAGE_OOM)
+		return VM_FAULT_OOM;
+
 	/*
-	 * non-ram cannot be mapped via ->nopage, it must
-	 * be mapped via remap_page_range instead synchronously
-	 * in the ->mmap device driver callback.
-	 *
-	 * PageReserved pages can be mapped as far as they're under
-	 * a VM_RESERVED vma.
+	 * ->nopage should return a PFN not a page_t if here we
+	 * wanted to handle non-ram, though we've to make non-ram
+	 * work with page_t too for a number of device drivers
+	 * that may return non-ram via ->nopage.
 	 */
-	BUG_ON(!pfn_valid(page_to_pfn(new_page)));
-
-	/* ->nopage cannot return swapcache */
-	BUG_ON(PageSwapCache(new_page));
-	/* ->nopage cannot return anonymous pages */
-	BUG_ON(PageAnon(new_page));
+	pageable = ram = pfn_valid(page_to_pfn(new_page));
+	if (likely(ram)) {
+		pageable = !PageReserved(new_page);
+		as = !!new_page->mapping;
+
+		BUG_ON(!pageable && as);
+
+		pageable &= as;
+
+		/* ->nopage cannot return swapcache */
+		BUG_ON(PageSwapCache(new_page));
+		/* ->nopage cannot return anonymous pages */
+		BUG_ON(PageAnon(new_page));
+	}
 
 	/*
 	 * This is the entry point for memory under VM_RESERVED vmas.
 	 * That memory will not be tracked by the vm. These aren't
 	 * real anonymous pages, they're "device" reserved pages instead.
-	 * These pages under VM_RESERVED vmas are the only pages mapped
-	 * by the VM into userspace with page->as.mapping = NULL.
 	 */
-	reserved = vma->vm_flags & VM_RESERVED;
-	BUG_ON(!reserved && (!new_page->mapping || PageReserved(new_page)));
-
-	/* no page was available -- either SIGBUS or OOM */
-	if (new_page == NOPAGE_SIGBUS)
-		return VM_FAULT_SIGBUS;
-	if (new_page == NOPAGE_OOM)
-		return VM_FAULT_OOM;
+	reserved = !!(vma->vm_flags & VM_RESERVED);
+	BUG_ON(reserved != pageable);
 
 	/*
 	 * Should we do an early C-O-W break?
@@ -1438,6 +1442,8 @@ retry:
 	anon = 0;
 	if (write_access && !(vma->vm_flags & VM_SHARED)) {
 		struct page * page;
+		if (unlikely(!ram))
+			return VM_FAULT_SIGBUS;
 		if (unlikely(anon_vma_prepare(vma)))
 			goto oom;
 		page = alloc_page(GFP_HIGHUSER);
@@ -1460,7 +1466,8 @@ retry:
 	      (unlikely(sequence != atomic_read(&mapping->truncate_count)))) {
 		sequence = atomic_read(&mapping->truncate_count);
 		spin_unlock(&mm->page_table_lock);
-		page_cache_release(new_page);
+		if (likely(ram))
+			page_cache_release(new_page);
 		goto retry;
 	}
 	page_table = pte_offset_map(pmd, address);
@@ -1477,20 +1484,21 @@ retry:
 	 */
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
-		if (!PageReserved(new_page))
+		if (likely(ram && !PageReserved(new_page)))
 			++mm->rss;
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		set_pte(page_table, entry);
-		if (likely(!reserved))
+		if (likely(ram))
 			page_add_rmap(new_page, vma, address, anon);
 		pte_unmap(page_table);
 	} else {
 		/* One of our sibling threads was faster, back out. */
 		pte_unmap(page_table);
-		page_cache_release(new_page);
+		if (likely(ram))
+			page_cache_release(new_page);
 		spin_unlock(&mm->page_table_lock);
 		goto out;
 	}
@@ -1502,7 +1510,8 @@ retry:
 	return ret;
 
  oom:
-	page_cache_release(new_page);
+	if (likely(ram))
+		page_cache_release(new_page);
 	ret = VM_FAULT_OOM;
 	goto out;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/