Hi!
I was preparing some lecture last night and stumbled onto this bug. Maybe
some of you can shed some light on it.
Basically, I just memory map /dev/mem at 0xb8000 (text mode - yes I know you
shouldn't do this, but it was to illustrate something), reads 4k, changes it
writes it back.
Now everything works fine unless I do an msync() in which case I get an oops:
------------< snip <------< snip <------< snip <------------
root@oasis:~/tuesday-clug-talk/text-mode# ./text-mode
Unable to handle kernel NULL pointer dereference at virtual address 00000008
printing eip:
c011e752
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c011e752>]
EFLAGS: 00010246
eax: c10030f0 ebx: c10030f0 ecx: 00000000 edx: c10030f0
esi: 00017000 edi: cc75d404 ebp: cc7cf05c esp: cc75ff38
ds: 0018 es: 0018 ss: 0018
Process text-mode (pid: 516, stackpage=cc75f000)
Stack: 00417000 c0120515 c10030f0 00000004 ccddf1a0 40018000 00000000 40017000
cc75d404 40417000 cc75e000 00018000 40000000 00018000 40000000 00000000
40018000 c0120627 cceed420 40017000 00001000 00000004 cceed420 40017000
Call Trace: [<c0120515>] [<c0120627>] [<c012072d>] [<c0106d6b>]
Code: 8b 51 08 89 42 04 89 10 8d 51 08 89 50 04 89 41 08 8b 41 20
Segmentation fault
root@oasis:~/tuesday-clug-talk/text-mode#
root@oasis:~# cat /proc/ksyms | grep c011f
c011f000 ___wait_on_page_Rf9de9e1d
c011fb74 generic_file_read_Reaf528e4
c011f614 do_generic_file_read_Rc634b32a
c011f1d8 __find_lock_page_R31595861
c011ff68 filemap_nopage_R50c6d0e7
c011f154 lock_page_R7c236ed9
root@oasis:~# cat /proc/ksyms | grep c0120
c012057c generic_file_mmap_Rf655e8fd
c0120368 filemap_sync_R86ba6830
root@oasis:~# cat /proc/ksyms | grep c0106
root@oasis:~# cat /proc/ksyms | grep c0105
c01051bc machine_real_restart_R3da1b07a
c01055ac dump_thread_Rae90b20c
c01053fc kernel_thread_R7e9ebb05
c0105a60 __down_failed
c0105a6c __down_failed_interruptible
c0105a78 __down_failed_trylock
c0105a84 __up_wakeup
c0105aac __down_write_failed
c0105a90 __down_read_failed
c0105cd0 __rwsem_wake
c0105840 get_wchan_R280eab06
c0105110 disable_hlt_R794487ee
c0105118 enable_hlt_R9c7077bd
c0105264 machine_restart_Re6e3ef70
c01052e4 machine_halt_R9aa32630
c01052e8 machine_power_off_R091c824a
root@oasis:~# cat /proc/ksyms | grep c011e
c011e1b4 do_brk_R9eecde16
c011e778 invalidate_inode_pages_R93da71b9
c011e98c truncate_inode_pages_R721e7180
c011eb28 generic_buffer_fdatasync_Ra1c05088
------------< snip <------< snip <------< snip <------------
So call trace translates to something like:
filemap_nopage(), generic_file_mmap(), __down_failed_interruptible()
and we were somewhere in do_brk() when it oopsed.
Any ideas why this happened? Also, if I don't use msync() the problem goes
away so my guess is something's wrong with the msync() system call.
I've attached the small C program that caused this. Also, I can only get it
to oops once (after that I have to reboot again to get it to oops).
--
Regards
Abraham
Best of all is never to have been born. Second best is to die soon.
__________________________________________________________
Abraham vd Merwe - 2d3D, Inc.
Device Driver Development, Outsourcing, Embedded Systems
Cell: +27 82 565 4451 Snailmail:
Tel: +27 21 761 7549 Block C, Antree Park
Fax: +27 21 761 7648 Doncaster Road
Email: [email protected] Kenilworth, 7700
Http: http://www.2d3d.com South Africa
Abraham vd Merwe wrote:
>
> Hi!
>
> I was preparing some lecture last night and stumbled onto this bug. Maybe
> some of you can shed some light on it.
>
> Basically, I just memory map /dev/mem at 0xb8000 (text mode - yes I know you
> shouldn't do this, but it was to illustrate something), reads 4k, changes it
> writes it back.
>
The actual call trace is:
__set_page_dirty
filemap_sync_pte
filemap_sync_pte_range
filemap_sync_pmd_range
filemap_sync
msync_interval
sys_msync
We're crashing because __set_page_dirty dereferences page->mapping,
but pages from a mmap() of /dev/mem seem to have a NULL ->mapping.
One of the very frustrating things about Linux kernel development
is that the main source of tuition is merely the source code. You
can stare at that for months (as I have) and still not have a firm
grasp on the big-picture semantic *meaning* behind something as
simple as a page having a null ->mapping. Sigh.
So one is reduced to mimicry:
--- linux-2.4.7-pre3/mm/filemap.c Wed Jul 4 18:21:32 2001
+++ linux-akpm/mm/filemap.c Mon Jul 9 22:22:46 2001
@@ -1652,7 +1652,8 @@ static inline int filemap_sync_pte(pte_t
if (pte_present(pte) && ptep_test_and_clear_dirty(ptep)) {
struct page *page = pte_page(pte);
flush_tlb_page(vma, address);
- set_page_dirty(page);
+ if (page->mapping)
+ set_page_dirty(page);
}
return 0;
}
-
On Mon, Jul 09, 2001 at 10:32:11PM +1000, Andrew Morton wrote:
> Abraham vd Merwe wrote:
> >
> > Hi!
> >
> > I was preparing some lecture last night and stumbled onto this bug. Maybe
> > some of you can shed some light on it.
> >
> > Basically, I just memory map /dev/mem at 0xb8000 (text mode - yes I know you
> > shouldn't do this, but it was to illustrate something), reads 4k, changes it
> > writes it back.
> >
>
> The actual call trace is:
>
> __set_page_dirty
> filemap_sync_pte
> filemap_sync_pte_range
> filemap_sync_pmd_range
> filemap_sync
> msync_interval
> sys_msync
>
> We're crashing because __set_page_dirty dereferences page->mapping,
> but pages from a mmap() of /dev/mem seem to have a NULL ->mapping.
>
> One of the very frustrating things about Linux kernel development
> is that the main source of tuition is merely the source code. You
> can stare at that for months (as I have) and still not have a firm
> grasp on the big-picture semantic *meaning* behind something as
> simple as a page having a null ->mapping. Sigh.
>
> So one is reduced to mimicry:
>
> --- linux-2.4.7-pre3/mm/filemap.c Wed Jul 4 18:21:32 2001
> +++ linux-akpm/mm/filemap.c Mon Jul 9 22:22:46 2001
> @@ -1652,7 +1652,8 @@ static inline int filemap_sync_pte(pte_t
> if (pte_present(pte) && ptep_test_and_clear_dirty(ptep)) {
> struct page *page = pte_page(pte);
> flush_tlb_page(vma, address);
> - set_page_dirty(page);
> + if (page->mapping)
> + set_page_dirty(page);
> }
> return 0;
> }
Wrong fix, `page' is just garbage if some non memory was mapped in
userspace (like framebuffers or similar mmio regions were mapped etc..).
I fixed it right ages ago (also sumbitted to Linus but got not
integrated and I forgotten to resend):
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.7pre3aa1/00_msync-fb0-1
Please Linus include this time:
--- 2.4.5pre1aa3/mm/filemap.c.~1~ Fri May 11 02:08:28 2001
+++ 2.4.5pre1aa3/mm/filemap.c Mon May 14 18:48:59 2001
@@ -1808,10 +1808,12 @@
{
pte_t pte = *ptep;
- if (pte_present(pte) && ptep_test_and_clear_dirty(ptep)) {
+ if (pte_present(pte)) {
struct page *page = pte_page(pte);
- flush_tlb_page(vma, address);
- set_page_dirty(page);
+ if (VALID_PAGE(page) && !PageReserved(page) && ptep_test_and_clear_dirty(ptep)) {
+ flush_tlb_page(vma, address);
+ set_page_dirty(page);
+ }
}
return 0;
}
Andrea
Andrea Arcangeli wrote:
>
> Wrong fix, `page' is just garbage if some non memory was mapped in
> userspace (like framebuffers or similar mmio regions were mapped etc..).
Now we're getting somewhere. Thanks. Tell me if this is right:
> if (VALID_PAGE(page)
If the physical address of the page is somewhere inside our
working RAM.
> !PageReserved(page)
And it's not a reserved page (discontigmem?)
> ptep_test_and_clear_dirty(ptep))
And if it was modified via this mapping
> + flush_tlb_page(vma, address);
> + set_page_dirty(page);
Question: What happens if a program mmap's a part of /dev/mem
which passes all of these tests? Couldn't it then pick some
arbitrary member of mem_map[] which may or may not have
a non-zero ->mapping?
On Tue, Jul 10, 2001 at 12:43:12AM +1000, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > Wrong fix, `page' is just garbage if some non memory was mapped in
> > userspace (like framebuffers or similar mmio regions were mapped etc..).
>
> Now we're getting somewhere. Thanks. Tell me if this is right:
>
>
> > if (VALID_PAGE(page)
>
> If the physical address of the page is somewhere inside our
> working RAM.
correct.
>
> > !PageReserved(page)
>
> And it's not a reserved page (discontigmem?)
yes, but it's not discontigmem issue, it is the other way around (page
structure is valid but it maps to non ram, like the 640k-1M region that
we have the page structure for, we don't use discontigmem for it because
the hole is too smalle, but it is non ram, or also normal ram mapped by
some device as dma region).
> > ptep_test_and_clear_dirty(ptep))
>
> And if it was modified via this mapping
yes.
>
> > + flush_tlb_page(vma, address);
> > + set_page_dirty(page);
>
> Question: What happens if a program mmap's a part of /dev/mem
> which passes all of these tests? Couldn't it then pick some
that cannot happen, remap_pte_range only maps invalid pages or reserved
pages.
> arbitrary member of mem_map[] which may or may not have
> a non-zero ->mapping?
Andrea
On Mon, 9 Jul 2001, Andrea Arcangeli wrote:
> On Tue, Jul 10, 2001 at 12:43:12AM +1000, Andrew Morton wrote:
> >
> > Question: What happens if a program mmap's a part of /dev/mem
> > which passes all of these tests? Couldn't it then pick some
>
> that cannot happen, remap_pte_range only maps invalid pages or reserved
> pages.
>
> > arbitrary member of mem_map[] which may or may not have
> > a non-zero ->mapping?
Anyone know why mmap() of /dev/mem behaves in this way - solves the
problem Andrew raises, but surely that could be solved in a better way?
Seems strange that mmap() cannot give you what read() and write() can.
Hugh
On Mon, 9 Jul 2001, Hugh Dickins wrote:
> >
> > that cannot happen, remap_pte_range only maps invalid pages or reserved
> > pages.
>
> Anyone know why mmap() of /dev/mem behaves in this way - solves the
> problem Andrew raises, but surely that could be solved in a better way?
> Seems strange that mmap() cannot give you what read() and write() can.
Simple: we MUST NOT muck around with the page counts of random pages.
And if the pages aren't marked Reserved, the page counts _would_ get
corrupted by the VM page handling.
Ergo: you can only mmap Reserved pages (or you have to create a nice
mapping and allocate the pages properly, which in the case of /dev/mem is
obviously not a possibility)
Linus
On Mon, 9 Jul 2001, Linus Torvalds wrote:
> On Mon, 9 Jul 2001, Hugh Dickins wrote:
> > >
> > > that cannot happen, remap_pte_range only maps invalid pages or reserved
> > > pages.
> >
> > Anyone know why mmap() of /dev/mem behaves in this way - solves the
> > problem Andrew raises, but surely that could be solved in a better way?
> > Seems strange that mmap() cannot give you what read() and write() can.
>
> Simple: we MUST NOT muck around with the page counts of random pages.
Of course.
> And if the pages aren't marked Reserved, the page counts _would_ get
> corrupted by the VM page handling.
As it stands, yes. But shouldn't there be some kind of VM_ASIFRESERVED
vma flag to treat all its pages as if they were reserved, for /dev/mem?
Hugh
On Mon, 9 Jul 2001, Hugh Dickins wrote:
>
> As it stands, yes. But shouldn't there be some kind of VM_ASIFRESERVED
> vma flag to treat all its pages as if they were reserved, for /dev/mem?
Sure, that is doable. We do in fact already have the flag - you could
think of the VM_IO as that kind of flag already. I wouldn't object to that
kind of change in the 2.5.x timeframe. We already have vmscan ignoring
VM_IO objects, we could do the same to copy_mm() and to mmdrop().
Linus
(cc' list snipped)
This stuff is gold.
Anyone clueful want to comment things so this will end up in the
documentation when you do a make docbook or something?
Having access to simple facts like this is of so much value to those
of us who simply just don't have the time and/or exposure and many of
you (oh, and of course some of us just aren't as clever and need a
little more help!)
--cw
On Mon, Jul 09, 2001 at 05:08:35PM +0200, Andrea Arcangeli wrote:
On Tue, Jul 10, 2001 at 12:43:12AM +1000, Andrew Morton wrote:
> If the physical address of the page is somewhere inside our
> working RAM.
correct.
> > !PageReserved(page)
>
> And it's not a reserved page (discontigmem?)
yes, but it's not discontigmem issue, it is the other way around (page
structure is valid but it maps to non ram, like the 640k-1M region that
we have the page structure for, we don't use discontigmem for it because
the hole is too smalle, but it is non ram, or also normal ram mapped by
some device as dma region).
> > ptep_test_and_clear_dirty(ptep))
>
> And if it was modified via this mapping
yes.
>
> > + flush_tlb_page(vma, address);
> > + set_page_dirty(page);
>
> Question: What happens if a program mmap's a part of /dev/mem
> which passes all of these tests? Couldn't it then pick some
that cannot happen, remap_pte_range only maps invalid pages or reserved
pages.
Chris Wedgwood wrote:
>
> (cc' list snipped)
>
> This stuff is gold.
>
> Anyone clueful want to comment things so this will end up in the
> documentation when you do a make docbook or something?
>
One can but try.
Linus included the test for non-null page->mapping
as well. I wonder why.
Also, the change means that the TLB flush is not performed
for invalid or reserved pages. Is this correct? (Why is there
a TLB flush there in the first place?)
There's a flush_tlb_range() down in filemap_sync() as well, so
we appear to end up flushing the affected TLBs twice?
--- linux-2.4.7-pre5/mm/memory.c Tue Jul 10 22:32:53 2001
+++ linux-akpm/mm/memory.c Tue Jul 10 23:37:45 2001
@@ -766,6 +766,8 @@ int zeromap_page_range(unsigned long add
* maps a range of physical memory into the requested pages. the old
* mappings are removed. any references to nonexistent pages results
* in null mappings (currently treated as "copy-on-access")
+ * We forbid mapping of valid, unreserved pages because that would
+ * allow corruption of their reference counts via this additional mapping.
*/
static inline void remap_pte_range(pte_t * pte, unsigned long address, unsigned long size,
unsigned long phys_addr, pgprot_t prot)
--- linux-2.4.7-pre5/mm/filemap.c Tue Jul 10 22:32:53 2001
+++ linux-akpm/mm/filemap.c Tue Jul 10 23:45:28 2001
@@ -1643,6 +1643,8 @@ page_not_uptodate:
/* Called with mm->page_table_lock held to protect against other
* threads/the swapper from ripping pte's out from under us.
+ * Mappings from remap_pte_range() can cover invalid or reserved
+ * pages, so we must check for that here.
*/
static inline int filemap_sync_pte(pte_t * ptep, struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
On Wed, Jul 11, 2001 at 12:03:11AM +1000, Andrew Morton wrote:
> Linus included the test for non-null page->mapping
> as well. I wonder why.
I think it is because if you try hard you can map a dma page via
/dev/mem and then the driver can release the dma page (at rmmod for
example) before you munmap /dev/mem, and then the page goes in the
freelist and it is allocated as pagecache.
Andrea