LinuxLists.cc - [patch 0/5] 2.6.25-stable review

2008-06-22 19:04:58

Subject: [patch 0/5] 2.6.25-stable review

This is the start of the stable review cycle for the 2.6.25.9 release.
There are 5 patches in this series, all will be posted as a response to
this one. If anyone has any issues with these being applied, please let
us know. If anyone is a maintainer of the proper subsystem, and wants
to add a Signed-off-by: line to the patch, please respond with it.

These patches are sent out with a number of different people on the Cc:
line. If you wish to be a reviewer, please email [email protected] to
add your name to the list. If you want to be off the reviewer list,
also email us.

Responses should be made by Tuesday, June 24, 18:00:00 UTC. Anything
received after that time might be too late.

The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v2.6/stable-review/patch-2.6.25.9-rc1.gz
and the diffstat can be found below.

thanks,

the -stable release team

Makefile | 2 +-
arch/powerpc/kernel/vdso.c | 2 +-
arch/x86/kernel/setup_32.c | 10 ++++++++--
drivers/net/atl1/atl1_hw.c | 1 -
include/asm-x86/page_32.h | 3 ++-
mm/memory.c | 17 +++++++++++++----
mm/migrate.c | 10 ++++++++++
net/sctp/socket.c | 4 +++-
8 files changed, 38 insertions(+), 11 deletions(-)

2008-06-22 19:05:26

by Greg KH

[permalink] [raw]

Subject: [patch 4/5] x86: use BOOTMEM_EXCLUSIVE on 32-bit

2.6.25-stable review patch. If anyone has any objections, please let us
know.

------------------
From: Bernhard Walle <[email protected]>

commit d3942cff620bea073fc4e3c8ed878eb1e84615ce upstream

This patch uses the BOOTMEM_EXCLUSIVE for crashkernel reservation also for
i386 and prints a error message on failure.

The patch is still for 2.6.26 since it is only bug fixing. The unification
of reserve_crashkernel() between i386 and x86_64 should be done for 2.6.27.

Signed-off-by: Bernhard Walle <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/x86/kernel/setup_32.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/setup_32.c
+++ b/arch/x86/kernel/setup_32.c
@@ -483,10 +483,16 @@ static void __init reserve_crashkernel(v
(unsigned long)(crash_size >> 20),
(unsigned long)(crash_base >> 20),
(unsigned long)(total_mem >> 20));
+
+ if (reserve_bootmem(crash_base, crash_size,
+ BOOTMEM_EXCLUSIVE) < 0) {
+ printk(KERN_INFO "crashkernel reservation "
+ "failed - memory is in use\n");
+ return;
+ }
+
crashk_res.start = crash_base;
crashk_res.end = crash_base + crash_size - 1;
- reserve_bootmem(crash_base, crash_size,
- BOOTMEM_DEFAULT);
} else
printk(KERN_INFO "crashkernel reservation failed - "
"you have to specify a base address\n");

--

2008-06-22 19:05:52

by Greg KH

[permalink] [raw]

Subject: [patch 3/5] sctp: Make sure N * sizeof(union sctp_addr) does not overflow.

2.6.25-stable review patch. If anyone has any objections, please let us
know.

------------------
From: David S. Miller <[email protected]>

commit 735ce972fbc8a65fb17788debd7bbe7b4383cc62 upstream

As noticed by Gabriel Campana, the kmalloc() length arg
passed in by sctp_getsockopt_local_addrs_old() can overflow
if ->addr_num is large enough.

Therefore, enforce an appropriate limit.

Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
net/sctp/socket.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4421,7 +4421,9 @@ static int sctp_getsockopt_local_addrs_o
if (copy_from_user(&getaddrs, optval, len))
return -EFAULT;

- if (getaddrs.addr_num <= 0) return -EINVAL;
+ if (getaddrs.addr_num <= 0 ||
+ getaddrs.addr_num >= (INT_MAX / sizeof(union sctp_addr)))
+ return -EINVAL;
/*
* For UDP-style sockets, id specifies the association to query.
* If the id field is set to the value '0' then the locally bound

--

2008-06-22 19:06:15

by Greg KH

[permalink] [raw]

Subject: [patch 2/5] Reinstate ZERO_PAGE optimization in get_user_pages() and fix XIP

2.6.25-stable review patch. If anyone has any objections, please let us
know.

------------------
From: Linus Torvalds <[email protected]>

commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f upstream

KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed
the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
generally now populate the VM with real empty pages needlessly.

We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
since fault handling no longer uses ZERO_PAGE for new anonymous pages,
we now need to handle that special case in follow_page() instead.

In particular, the removal of ZERO_PAGE effectively removed the core
file writing optimization where we would skip writing pages that had not
been populated at all, and increased memory pressure a lot by allocating
all those useless newly zeroed pages.

This reinstates the optimization by making the unmapped PTE case the
same as for a non-existent page table, which already did this correctly.

While at it, this also fixes the XIP case for follow_page(), where the
caller could not differentiate between the case of a page that simply
could not be used (because it had no "struct page" associated with it)
and a page that just wasn't mapped.

We do that by simply returning an error pointer for pages that could not
be turned into a "struct page *". The error is arbitrarily picked to be
EFAULT, since that was what get_user_pages() already used for the
equivalent IO-mapped page case.

[ Also removed an impossible test for pte_offset_map_lock() failing:
that's not how that function works ]

Acked-by: Oleg Nesterov <[email protected]>
Acked-by: Nick Piggin <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Roland McGrath <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/powerpc/kernel/vdso.c | 2 +-
mm/memory.c | 17 +++++++++++++----
mm/migrate.c | 10 ++++++++++
3 files changed, 24 insertions(+), 5 deletions(-)

--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -141,7 +141,7 @@ static void dump_one_vdso_page(struct pa
printk("kpg: %p (c:%d,f:%08lx)", __va(page_to_pfn(pg) << PAGE_SHIFT),
page_count(pg),
pg->flags);
- if (upg/* && pg != upg*/) {
+ if (upg && !IS_ERR(upg) /* && pg != upg*/) {
printk(" upg: %p (c:%d,f:%08lx)", __va(page_to_pfn(upg)
<< PAGE_SHIFT),
page_count(upg),
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -943,17 +943,15 @@ struct page *follow_page(struct vm_area_
}

ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!ptep)
- goto out;

pte = *ptep;
if (!pte_present(pte))
- goto unlock;
+ goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte))
goto unlock;
page = vm_normal_page(vma, address, pte);
if (unlikely(!page))
- goto unlock;
+ goto bad_page;

if (flags & FOLL_GET)
get_page(page);
@@ -968,6 +966,15 @@ unlock:
out:
return page;

+bad_page:
+ pte_unmap_unlock(ptep, ptl);
+ return ERR_PTR(-EFAULT);
+
+no_page:
+ pte_unmap_unlock(ptep, ptl);
+ if (!pte_none(pte))
+ return page;
+ /* Fall through to ZERO_PAGE handling */
no_page_table:
/*
* When core dumping an enormous anonymous area that nobody
@@ -1104,6 +1111,8 @@ int get_user_pages(struct task_struct *t

cond_resched();
}
+ if (IS_ERR(page))
+ return i ? i : PTR_ERR(page);
if (pages) {
pages[i] = page;

--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -858,6 +858,11 @@ static int do_move_pages(struct mm_struc
goto set_status;

page = follow_page(vma, pp->addr, FOLL_GET);
+
+ err = PTR_ERR(page);
+ if (IS_ERR(page))
+ goto set_status;
+
err = -ENOENT;
if (!page)
goto set_status;
@@ -921,6 +926,11 @@ static int do_pages_stat(struct mm_struc
goto set_status;

page = follow_page(vma, pm->addr, 0);
+
+ err = PTR_ERR(page);
+ if (IS_ERR(page))
+ goto set_status;
+
err = -ENOENT;
/* Use PageReserved to check for zero page */
if (!page || PageReserved(page))

--

2008-06-22 19:06:34

by Greg KH

[permalink] [raw]

Subject: [patch 1/5] atl1: relax eeprom mac address error check

2.6.25-stable review patch. If anyone has any objections, please let us know.

------------------

From: Radu Cristescu <[email protected]>

upstream commit: 58c7821c4264a7ddd6f0c31c5caaf393b3897f10

The atl1 driver tries to determine the MAC address thusly:

- If an EEPROM exists, read the MAC address from EEPROM and
validate it.
- If an EEPROM doesn't exist, try to read a MAC address from
SPI flash.
- If that fails, try to read a MAC address directly from the
MAC Station Address register.
- If that fails, assign a random MAC address provided by the
kernel.

We now have a report of a system fitted with an EEPROM containing all
zeros where we expect the MAC address to be, and we currently handle
this as an error condition. Turns out, on this system the BIOS writes
a valid MAC address to the NIC's MAC Station Address register, but we
never try to read it because we return an error when we find the all-
zeros address in EEPROM.

This patch relaxes the error check and continues looking for a MAC
address even if it finds an illegal one in EEPROM.

http://ubuntuforums.org/showthread.php?t=562617

[[email protected]: backport to 2.6.25.7]

Signed-off-by: Radu Cristescu <[email protected]>
Signed-off-by: Jay Cliburn <[email protected]>
Signed-off-by: Jeff Garzik <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/net/atl1/atl1_hw.c | 1 -
1 file changed, 1 deletion(-)

--- a/drivers/net/atl1/atl1_hw.c
+++ b/drivers/net/atl1/atl1_hw.c
@@ -250,7 +250,6 @@ static int atl1_get_permanent_address(st
memcpy(hw->perm_mac_addr, eth_addr, ETH_ALEN);
return 0;
}
- return 1;
}

/* see if SPI FLAGS exist ? */

--

2008-06-22 19:06:47

by Greg KH

[permalink] [raw]

Subject: [patch 5/5] x86: set PAE PHYSICAL_MASK_SHIFT to 44 bits.

2.6.25-stable review patch. If anyone has any objections, please let us
know.

------------------
From: Jeremy Fitzhardinge <[email protected]>

commit ad524d46f36bbc32033bb72ba42958f12bf49b06 upstream

When a 64-bit x86 processor runs in 32-bit PAE mode, a pte can
potentially have the same number of physical address bits as the
64-bit host ("Enhanced Legacy PAE Paging"). This means, in theory,
we could have up to 52 bits of physical address in a pte.

The 32-bit kernel uses a 32-bit unsigned long to represent a pfn.
This means that it can only represent physical addresses up to 32+12=44
bits wide. Rather than widening pfns everywhere, just set 2^44 as the
Linux x86_32-PAE architectural limit for physical address size.

This is a bugfix for two cases:
1. running a 32-bit PAE kernel on a machine with
more than 64GB RAM.
2. running a 32-bit PAE Xen guest on a host machine with
more than 64GB RAM

In both cases, a pte could need to have more than 36 bits of physical,
and masking it to 36-bits will cause fairly severe havoc.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Jan Beulich <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
include/asm-x86/page_32.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

--- a/include/asm-x86/page_32.h
+++ b/include/asm-x86/page_32.h
@@ -14,7 +14,8 @@
#define __PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)

#ifdef CONFIG_X86_PAE
-#define __PHYSICAL_MASK_SHIFT 36
+/* 44=32+12, the limit we can fit into an unsigned long pfn */
+#define __PHYSICAL_MASK_SHIFT 44
#define __VIRTUAL_MASK_SHIFT 32
#define PAGETABLE_LEVELS 3

--

2008-06-22 19:23:45

by David Miller

[permalink] [raw]

Subject: Re: [patch 3/5] sctp: Make sure N * sizeof(union sctp_addr) does not overflow.

From: Greg KH <[email protected]>
Date: Sun, 22 Jun 2008 12:01:36 -0700

> 2.6.25-stable review patch. If anyone has any objections, please let us
> know.

Unfortunately, Vlad found another case in SCTP which has
an overflow bug similar to this one. I'll work on a
fix for that today and submit.

2008-06-22 19:26:41

by Linus Torvalds

[permalink] [raw]

Subject: Re: [patch 2/5] Reinstate ZERO_PAGE optimization in get_user_pages() and fix XIP

On Sun, 22 Jun 2008, Greg KH wrote:
>
> 2.6.25-stable review patch. If anyone has any objections, please let us
> know.

Let's wait for the vmware breakage report to sort out first.

http://lkml.org/lkml/2008/6/22/10

before moving it to -stable.

Linus

2008-06-22 20:23:20

by Johannes Weiner

[permalink] [raw]

Subject: Re: [patch 4/5] x86: use BOOTMEM_EXCLUSIVE on 32-bit

Hi,

Greg KH <[email protected]> writes:

> 2.6.25-stable review patch. If anyone has any objections, please let us
> know.
>
> ------------------
> From: Bernhard Walle <[email protected]>
>
> commit d3942cff620bea073fc4e3c8ed878eb1e84615ce upstream
>
> This patch uses the BOOTMEM_EXCLUSIVE for crashkernel reservation also for
> i386 and prints a error message on failure.
>
> The patch is still for 2.6.26 since it is only bug fixing. The unification
> of reserve_crashkernel() between i386 and x86_64 should be done for 2.6.27.
>
> Signed-off-by: Bernhard Walle <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>
> ---
> arch/x86/kernel/setup_32.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> --- a/arch/x86/kernel/setup_32.c
> +++ b/arch/x86/kernel/setup_32.c
> @@ -483,10 +483,16 @@ static void __init reserve_crashkernel(v
> (unsigned long)(crash_size >> 20),
> (unsigned long)(crash_base >> 20),
> (unsigned long)(total_mem >> 20));
> +
> + if (reserve_bootmem(crash_base, crash_size,
> + BOOTMEM_EXCLUSIVE) < 0) {
> + printk(KERN_INFO "crashkernel reservation "
> + "failed - memory is in use\n");
> + return;
> + }

You will also need the patch from http://lkml.org/lkml/2008/6/21/103 to
make sure reserve_bootmem() is not void (*)().

Hannes

2008-06-22 20:32:24

by Greg KH

[permalink] [raw]

Subject: Re: [patch 3/5] sctp: Make sure N * sizeof(union sctp_addr) does not overflow.

On Sun, Jun 22, 2008 at 12:23:32PM -0700, David Miller wrote:
> From: Greg KH <[email protected]>
> Date: Sun, 22 Jun 2008 12:01:36 -0700
>
> > 2.6.25-stable review patch. If anyone has any objections, please let us
> > know.
>
> Unfortunately, Vlad found another case in SCTP which has
> an overflow bug similar to this one. I'll work on a
> fix for that today and submit.

Thanks for letting me know, I'll wait for that one as well before doing
this release.

greg k-h

2008-06-22 20:32:49

by Greg KH

[permalink] [raw]

Subject: Re: [patch 2/5] Reinstate ZERO_PAGE optimization in get_user_pages() and fix XIP

On Sun, Jun 22, 2008 at 12:22:47PM -0700, Linus Torvalds wrote:
>
>
> On Sun, 22 Jun 2008, Greg KH wrote:
> >
> > 2.6.25-stable review patch. If anyone has any objections, please let us
> > know.
>
> Let's wait for the vmware breakage report to sort out first.
>
> http://lkml.org/lkml/2008/6/22/10
>
> before moving it to -stable.

Sure, thanks for pointing that out to me, I'll track it as well.

greg k-h

2008-06-22 20:33:08

by Greg KH

[permalink] [raw]

Subject: Re: [patch 4/5] x86: use BOOTMEM_EXCLUSIVE on 32-bit

On Sun, Jun 22, 2008 at 10:22:58PM +0200, Johannes Weiner wrote:
> Hi,
>
> Greg KH <[email protected]> writes:
>
> > 2.6.25-stable review patch. If anyone has any objections, please let us
> > know.
> >
> > ------------------
> > From: Bernhard Walle <[email protected]>
> >
> > commit d3942cff620bea073fc4e3c8ed878eb1e84615ce upstream
> >
> > This patch uses the BOOTMEM_EXCLUSIVE for crashkernel reservation also for
> > i386 and prints a error message on failure.
> >
> > The patch is still for 2.6.26 since it is only bug fixing. The unification
> > of reserve_crashkernel() between i386 and x86_64 should be done for 2.6.27.
> >
> > Signed-off-by: Bernhard Walle <[email protected]>
> > Signed-off-by: Ingo Molnar <[email protected]>
> > Signed-off-by: Greg Kroah-Hartman <[email protected]>
> >
> > ---
> > arch/x86/kernel/setup_32.c | 10 ++++++++--
> > 1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > --- a/arch/x86/kernel/setup_32.c
> > +++ b/arch/x86/kernel/setup_32.c
> > @@ -483,10 +483,16 @@ static void __init reserve_crashkernel(v
> > (unsigned long)(crash_size >> 20),
> > (unsigned long)(crash_base >> 20),
> > (unsigned long)(total_mem >> 20));
> > +
> > + if (reserve_bootmem(crash_base, crash_size,
> > + BOOTMEM_EXCLUSIVE) < 0) {
> > + printk(KERN_INFO "crashkernel reservation "
> > + "failed - memory is in use\n");
> > + return;
> > + }
>
> You will also need the patch from http://lkml.org/lkml/2008/6/21/103 to
> make sure reserve_bootmem() is not void (*)().

Ok, let me know when that goes into Linus's tree please.

thanks,

greg k-h

2008-06-22 20:38:18

by Adrian Bunk

[permalink] [raw]

Subject: Re: [patch 4/5] x86: use BOOTMEM_EXCLUSIVE on 32-bit

On Sun, Jun 22, 2008 at 01:30:47PM -0700, Greg KH wrote:
> On Sun, Jun 22, 2008 at 10:22:58PM +0200, Johannes Weiner wrote:
> > Hi,
> >
> > Greg KH <[email protected]> writes:
> >
> > > 2.6.25-stable review patch. If anyone has any objections, please let us
> > > know.
> > >
> > > ------------------
> > > From: Bernhard Walle <[email protected]>
> > >
> > > commit d3942cff620bea073fc4e3c8ed878eb1e84615ce upstream
> > >
> > > This patch uses the BOOTMEM_EXCLUSIVE for crashkernel reservation also for
> > > i386 and prints a error message on failure.
> > >
> > > The patch is still for 2.6.26 since it is only bug fixing. The unification
> > > of reserve_crashkernel() between i386 and x86_64 should be done for 2.6.27.
> > >
> > > Signed-off-by: Bernhard Walle <[email protected]>
> > > Signed-off-by: Ingo Molnar <[email protected]>
> > > Signed-off-by: Greg Kroah-Hartman <[email protected]>
> > >
> > > ---
> > > arch/x86/kernel/setup_32.c | 10 ++++++++--
> > > 1 file changed, 8 insertions(+), 2 deletions(-)
> > >
> > > --- a/arch/x86/kernel/setup_32.c
> > > +++ b/arch/x86/kernel/setup_32.c
> > > @@ -483,10 +483,16 @@ static void __init reserve_crashkernel(v
> > > (unsigned long)(crash_size >> 20),
> > > (unsigned long)(crash_base >> 20),
> > > (unsigned long)(total_mem >> 20));
> > > +
> > > + if (reserve_bootmem(crash_base, crash_size,
> > > + BOOTMEM_EXCLUSIVE) < 0) {
> > > + printk(KERN_INFO "crashkernel reservation "
> > > + "failed - memory is in use\n");
> > > + return;
> > > + }
> >
> > You will also need the patch from http://lkml.org/lkml/2008/6/21/103 to
> > make sure reserve_bootmem() is not void (*)().
>
> Ok, let me know when that goes into Linus's tree please.

It's commit 71c2742f5e6348d76ee62085cf0a13e5eff0f00e

> thanks,
>
> greg k-h

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2008-06-22 20:40:52

by Linus Torvalds

[permalink] [raw]

Subject: Re: [patch 4/5] x86: use BOOTMEM_EXCLUSIVE on 32-bit

On Sun, 22 Jun 2008, Greg KH wrote:
>
> > You will also need the patch from http://lkml.org/lkml/2008/6/21/103 to
> > make sure reserve_bootmem() is not void (*)().
>
> Ok, let me know when that goes into Linus's tree please.

It already is: 71c2742f5e6348d76ee62085cf0a13e5eff0f00e.

Linus

2008-06-23 08:11:20

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch 4/5] x86: use BOOTMEM_EXCLUSIVE on 32-bit

* Linus Torvalds <[email protected]> wrote:

> On Sun, 22 Jun 2008, Greg KH wrote:
> >
> > > You will also need the patch from http://lkml.org/lkml/2008/6/21/103 to
> > > make sure reserve_bootmem() is not void (*)().
> >
> > Ok, let me know when that goes into Linus's tree please.
>
> It already is: 71c2742f5e6348d76ee62085cf0a13e5eff0f00e.

thanks. This patch (which was not a build fix but an infrastructure fix
that the kexec fix in arch/x86 depended on) is well-tested as well, it
was queued in -tip on June 10th:

| commit 91d48fc80f22817332170082e10de60a75851640
| Author: Bernhard Walle <[email protected]>
| Date: Sun Jun 8 15:46:29 2008 +0200
| CommitDate: Tue Jun 10 14:41:56 2008 +0200
|
| bootmem: add return value to reserve_bootmem_node()
|
| This patch changes the function reserve_bootmem_node() from void to
| int, returning -ENOMEM if the allocation fails.
|
| Signed-off-by: Bernhard Walle <[email protected]>
| Signed-off-by: Ingo Molnar <[email protected]>

so it is a -stable candidate just as much as the kexec fix. (These are
all fixes for long-standing problems so i guess it can go all the way
back to all stable kernels that are being maintained.)

Ingo

2008-06-23 10:34:00

by Bernhard Walle

[permalink] [raw]

Subject: Re: [patch 4/5] x86: use BOOTMEM_EXCLUSIVE on 32-bit

* Ingo Molnar <[email protected]> [2008-06-23 10:09]:

>
> | commit 91d48fc80f22817332170082e10de60a75851640
> | Author: Bernhard Walle <[email protected]>
> | Date: Sun Jun 8 15:46:29 2008 +0200
> | CommitDate: Tue Jun 10 14:41:56 2008 +0200
> |
> | bootmem: add return value to reserve_bootmem_node()
> |
> | This patch changes the function reserve_bootmem_node() from void to
> | int, returning -ENOMEM if the allocation fails.
> |
> | Signed-off-by: Bernhard Walle <[email protected]>
> | Signed-off-by: Ingo Molnar <[email protected]>
>
> so it is a -stable candidate just as much as the kexec fix. (These are
> all fixes for long-standing problems so i guess it can go all the way
> back to all stable kernels that are being maintained.)

Ingo,

shouldn't we add the reserve_bootmem_generic() fix [1] to 2.6.26-* at
least?

Bernhard

[1] 62b5ebe062c2801f6d40480ae3b91a64c8c8e6cb
--
Bernhard Walle, SUSE LINUX Products GmbH, Architecture Development

2008-06-23 10:54:20

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch 4/5] x86: use BOOTMEM_EXCLUSIVE on 32-bit

* Bernhard Walle <[email protected]> wrote:

> * Ingo Molnar <[email protected]> [2008-06-23 10:09]:
>
> >
> > | commit 91d48fc80f22817332170082e10de60a75851640
> > | Author: Bernhard Walle <[email protected]>
> > | Date: Sun Jun 8 15:46:29 2008 +0200
> > | CommitDate: Tue Jun 10 14:41:56 2008 +0200
> > |
> > | bootmem: add return value to reserve_bootmem_node()
> > |
> > | This patch changes the function reserve_bootmem_node() from void to
> > | int, returning -ENOMEM if the allocation fails.
> > |
> > | Signed-off-by: Bernhard Walle <[email protected]>
> > | Signed-off-by: Ingo Molnar <[email protected]>
> >
> > so it is a -stable candidate just as much as the kexec fix. (These are
> > all fixes for long-standing problems so i guess it can go all the way
> > back to all stable kernels that are being maintained.)
>
> Ingo,
>
> shouldn't we add the reserve_bootmem_generic() fix [1] to 2.6.26-* at
> least?
>
>
> Bernhard
>
> [1] 62b5ebe062c2801f6d40480ae3b91a64c8c8e6cb

but note that this too has dependencies, it relies on:

# tip/x86/numa: ddeb8ef: x86: add flags parameter to reserve_bootmem_generic()
# tip/x86/numa: 62b5ebe: x86: use reserve_bootmem_generic() to reserve crashkernel memory on x86_64

so i've initially delayed the whole topic to v2.6.27.

I've attached both patches below - are they really urgent enough to be
propagated to tip/x86/urgent and be sent to Linus? AFAICS these are
ancient issues with kernel crashdumping.

Ingo

---------------------->
commit ddeb8ef812cbe41739ea3d836681005e9646f922
Author: Bernhard Walle <[email protected]>
Date: Sun Jun 8 15:46:30 2008 +0200

x86: add flags parameter to reserve_bootmem_generic()

This patch adds a 'flags' parameter to reserve_bootmem_generic() like it
already has been added in reserve_bootmem() with commit
72a7fe3967dbf86cb34e24fbf1d957fe24d2f246.

It also changes all users to use BOOTMEM_DEFAULT, which doesn't effectively
change the behaviour. Since the change is x86-specific, I don't think it's
necessary to add a new API for migration. There are only 4 users of that
function.

The change is necessary for the next patch, using reserve_bootmem_generic()
for crashkernel reservation.

Signed-off-by: Bernhard Walle <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

diff --git a/arch/x86/kernel/mpparse.c b/arch/x86/kernel/mpparse.c
index 404683b..4901ae3 100644
--- a/arch/x86/kernel/mpparse.c
+++ b/arch/x86/kernel/mpparse.c
@@ -729,10 +729,11 @@ static int __init smp_scan_config(unsigned long base, unsigned long length,
if (!reserve)
return 1;

- reserve_bootmem_generic(virt_to_phys(mpf), PAGE_SIZE);
+ reserve_bootmem_generic(virt_to_phys(mpf), PAGE_SIZE,
+ BOOTMEM_DEFAULT);
if (mpf->mpf_physptr)
reserve_bootmem_generic(mpf->mpf_physptr,
- PAGE_SIZE);
+ PAGE_SIZE, BOOTMEM_DEFAULT);
#endif
return 1;
}
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 32ba13b..747c351 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -798,12 +798,13 @@ void free_initrd_mem(unsigned long start, unsigned long end)
}
#endif

-void __init reserve_bootmem_generic(unsigned long phys, unsigned len)
+int __init reserve_bootmem_generic(unsigned long phys, unsigned len, int flags)
{
#ifdef CONFIG_NUMA
int nid, next_nid;
#endif
unsigned long pfn = phys >> PAGE_SHIFT;
+ int ret;

if (pfn >= end_pfn) {
/*
@@ -811,11 +812,11 @@ void __init reserve_bootmem_generic(unsigned long phys, unsigned len)
* firmware tables:
*/
if (pfn < max_pfn_mapped)
- return;
+ return -EFAULT;

printk(KERN_ERR "reserve_bootmem: illegal reserve %lx %u\n",
phys, len);
- return;
+ return -EFAULT;
}

/* Should check here against the e820 map to avoid double free */
@@ -823,9 +824,13 @@ void __init reserve_bootmem_generic(unsigned long phys, unsigned len)
nid = phys_to_nid(phys);
next_nid = phys_to_nid(phys + len - 1);
if (nid == next_nid)
- reserve_bootmem_node(NODE_DATA(nid), phys, len, BOOTMEM_DEFAULT);
+ ret = reserve_bootmem_node(NODE_DATA(nid), phys, len, flags);
else
- reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
+ ret = reserve_bootmem(phys, len, flags);
+
+ if (ret != 0)
+ return ret;
+
#else
reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
#endif
@@ -834,6 +839,8 @@ void __init reserve_bootmem_generic(unsigned long phys, unsigned len)
dma_reserve += len / PAGE_SIZE;
set_dma_reserve(dma_reserve);
}
+
+ return 0;
}

int kern_addr_valid(unsigned long addr)
diff --git a/include/asm-x86/proto.h b/include/asm-x86/proto.h
index 6c8b41b..a9f5147 100644
--- a/include/asm-x86/proto.h
+++ b/include/asm-x86/proto.h
@@ -14,7 +14,7 @@ extern void ia32_syscall(void);
extern void ia32_cstar_target(void);
extern void ia32_sysenter_target(void);

-extern void reserve_bootmem_generic(unsigned long phys, unsigned len);
+extern int reserve_bootmem_generic(unsigned long phys, unsigned len, int flags);

extern void syscall32_cpu_init(void);

--------------->
commit 62b5ebe062c2801f6d40480ae3b91a64c8c8e6cb
Author: Bernhard Walle <[email protected]>
Date: Sun Jun 8 15:46:31 2008 +0200

x86: use reserve_bootmem_generic() to reserve crashkernel memory on x86_64

This patch uses reserve_bootmem_generic() instead of reserve_bootmem()
to reserve the crashkernel memory on x86_64. That's necessary for NUMA
machines, see 00212fef814612245ed0261cbac8426d0c9a31a5:

[PATCH] Fix kdump Crash Kernel boot memory reservation for NUMA machines

This patch will fix a boot memory reservation bug that trashes memory on
the ES7000 when loading the kdump crash kernel.

The code in arch/x86_64/kernel/setup.c to reserve boot memory for the crash
kernel uses the non-numa aware "reserve_bootmem" function instead of the
NUMA aware "reserve_bootmem_generic". I checked to make sure that no other
function was using "reserve_bootmem" and found none, except the ones that
had NUMA ifdef'ed out.

I have tested this patch only on an ES7000 with NUMA on and off (numa=off)
in a single (non-NUMA) and multi-cell (NUMA) configurations.

Signed-off-by: Amul Shah <[email protected]>
Looks-good-to: Vivek Goyal <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

The switch-back to reserve_bootmem() was accidentally introduced in
5c3391f9f749023a49c64d607da4fb49263690eb when adding the BOOTMEM_EXCLUSIVE
parameter.

Signed-off-by: Bernhard Walle <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
index e8df64f..4a666cd 100644
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -243,7 +243,7 @@ static void __init reserve_crashkernel(void)
return;
}

- if (reserve_bootmem(crash_base, crash_size,
+ if (reserve_bootmem_generic(crash_base, crash_size,
BOOTMEM_EXCLUSIVE) < 0) {
printk(KERN_INFO "crashkernel reservation failed - "
"memory is in use\n");

2008-06-23 11:19:25

by S.Çağlar Onur

[permalink] [raw]

Subject: Re: [patch 0/5] 2.6.25-stable review

Hi Greg and -stable team;

22 Haz 2008 Paz tarihinde, Greg KH şunları yazmıştı:
> Responses should be made by Tuesday, June 24, 18:00:00 UTC. Anything
> received after that time might be too late.

Please consider following commit for -stable also, it definetly fixes a boot failure caused by reported opps

commit 1f6ef2342972dc7fd623f360f84006e2304eb935
Author: Linus Torvalds <[email protected]>
Date: Fri Jun 20 12:19:28 2008 -0700

[watchdog] hpwdt: fix use of inline assembly

The inline assembly in drivers/watchdog/hpwdt.c was incredibly broken,
and included all the function prologue and epilogue stuff, even though
it was itself then inside a C function where the compiler would add its
own prologue and epilogue on top of it all.

This then just _happened_ to work if you had exactly the right compiler
version and exactly the right compiler flags, so that gcc just happened
to not create any prologue at all (the gcc-generated epilogue wouldn't
matter, since it would never be reached).

But the more proper way to fix it is to simply not do this. Move the
inline asm to the top level, with no surrounding function at all (the
better alternative would be to remove the prologue and make it actually
use proper description of the arguments to the inline asm, but that's a
bigger change than the one I'm willing to make right now).

Tested-by: S.Çağlar Onur <[email protected]>
Acked-by: Thomas Mingarelli <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Cheers
--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

2008-06-23 13:21:34

by Bernhard Walle

[permalink] [raw]

Subject: Re: [patch 4/5] x86: use BOOTMEM_EXCLUSIVE on 32-bit

* Ingo Molnar <[email protected]> [2008-06-23 12:53]:
>
> > [1] 62b5ebe062c2801f6d40480ae3b91a64c8c8e6cb
>
> but note that this too has dependencies, it relies on:
>
> # tip/x86/numa: ddeb8ef: x86: add flags parameter to reserve_bootmem_generic()
> # tip/x86/numa: 62b5ebe: x86: use reserve_bootmem_generic() to reserve crashkernel memory on x86_64

The 2nd is not the dependency but the commit itself.

> so i've initially delayed the whole topic to v2.6.27.

Ok, you have more experience which patches should go into 2.6.26 at
that point of time, so it's acceptable for me.

> I've attached both patches below - are they really urgent enough to be
> propagated to tip/x86/urgent and be sent to Linus? AFAICS these are
> ancient issues with kernel crashdumping.

I only brought up that topic again because it's a regression between
2.6.22 and 2.6.23 caused by 5c3391f9f749023a49c64d607da4fb49263690eb.

Bernhard
--
Bernhard Walle, SUSE LINUX Products GmbH, Architecture Development

2008-06-23 15:32:51

by Jeff Chua

[permalink] [raw]

Subject: Re: [patch 2/5] Reinstate ZERO_PAGE optimization in get_user_pages() and fix XIP

On Mon, Jun 23, 2008 at 4:29 AM, Greg KH <[email protected]> wrote:
> On Sun, Jun 22, 2008 at 12:22:47PM -0700, Linus Torvalds wrote:
>> Let's wait for the vmware breakage report to sort out first.
>> http://lkml.org/lkml/2008/6/22/10
>> before moving it to -stable.
>
> Sure, thanks for pointing that out to me, I'll track it as well.

I can confirm that the 2nd patch from Linus fixed the problem.

http://lkml.org/lkml/2008/6/22/107

Sorry it took so long. Traveling.

Thanks,
Jeff.

2008-06-23 16:09:12

by Hugh Dickins

[permalink] [raw]

Subject: Re: [patch 2/5] Reinstate ZERO_PAGE optimization in get_user_pages() and fix XIP

On Mon, 23 Jun 2008, Jeff Chua wrote:
> On Mon, Jun 23, 2008 at 4:29 AM, Greg KH <[email protected]> wrote:
> > On Sun, Jun 22, 2008 at 12:22:47PM -0700, Linus Torvalds wrote:
> >> Let's wait for the vmware breakage report to sort out first.
> >> http://lkml.org/lkml/2008/6/22/10
> >> before moving it to -stable.
> >
> > Sure, thanks for pointing that out to me, I'll track it as well.
>
> I can confirm that the 2nd patch from Linus fixed the problem.
>
> http://lkml.org/lkml/2008/6/22/107
>
> Sorry it took so long. Traveling.

Long?! That was very quick, thanks for reporting back.

But I'm afraid you've pushed me into taking another look at that
patch, and I see a problem with it. To be honest, I've lost the
plot on this issue, and didn't really get what your problem is,
nor how Linus expected to be fixing it.

The problem is that "insane" VM_LOCKED test which he has removed.
I've remembered now what that's about: it's for make_pages_present.
We do want mlocking a readonly area to make its pages present, even
if they're not at this moment writable: we don't want the ZERO_PAGE
substitution in that case.

So I think Linus needs to factor that into the final patch,
whilst at the same time solving whatever is the vmware breakage.

Hugh

2008-06-23 16:43:26

by Linus Torvalds

[permalink] [raw]

Subject: Re: [patch 2/5] Reinstate ZERO_PAGE optimization in get_user_pages() and fix XIP

On Mon, 23 Jun 2008, Hugh Dickins wrote:

> On Mon, 23 Jun 2008, Jeff Chua wrote:
> >
> > I can confirm that the 2nd patch from Linus fixed the problem.
> >
> > http://lkml.org/lkml/2008/6/22/107
>
> But I'm afraid you've pushed me into taking another look at that
> patch, and I see a problem with it. To be honest, I've lost the
> plot on this issue, and didn't really get what your problem is,
> nor how Linus expected to be fixing it.

The problem is that the old code said:

- we can use FOLL_ANON, assuming that the vma has no vm_ops, or has no
"fault" callback.

That was funcamentally broken. Because you can have a "nopfn" callback.
But it's hard to notice, since the whole FOLL_ANON code only _used_ to
trigger if a whole page table was missing.

The VM_LOCKED test was just crazy, but I doubt it was the cause of the
bug.

> The problem is that "insane" VM_LOCKED test which he has removed.
> I've remembered now what that's about: it's for make_pages_present.

That's still crazy. make_pages_present() already does:

write = (vma->vm_flags & VM_WRITE) != 0;

and passes that in to "get_user_pages()". So for a writable mapping, we'll
elide the FOLL_ANON case anyway, and for a read-only mapping we should
have used ZERO_PAGE. Damn. Oh, well.

We can certainly re-instate the insane behaviour for mlock(). Not that we
historically used to - we used to just map in ZERO_PAGE.

> So I think Linus needs to factor that into the final patch,
> whilst at the same time solving whatever is the vmware breakage.

So here's a third patch to test. It removes the VM_SHARED thing just to
get us closer to the original code (and because do_no_page() didn't do it
historically, so let's not do it either), and it re-instates the insane
VM_LOCKED test with a comment.

Jeff, does this still work with vmware?

Linus

---
mm/memory.c | 20 ++++++++++++++++++--
1 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 9aefaae..a2ce28d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1045,6 +1045,23 @@ no_page_table:
return page;
}

+/* Can we do the FOLL_ANON optimization? */
+static inline int use_zero_page(struct vm_area_struct *vma)
+{
+ /*
+ * We don't want to optimize FOLL_ANON for make_pages_present()
+ * when it tries to page in a VM_LOCKED region.
+ */
+ if (vma->vm_flags & VM_LOCKED)
+ return 0;
+ /*
+ * And if we have a fault or a nopfn routine, it's not an
+ * anonymous region.
+ */
+ return !vma->vm_ops ||
+ (!vma->vm_ops->fault && !vma->vm_ops->nopfn);
+}
+
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int write, int force,
struct page **pages, struct vm_area_struct **vmas)
@@ -1119,8 +1136,7 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
foll_flags = FOLL_TOUCH;
if (pages)
foll_flags |= FOLL_GET;
- if (!write && !(vma->vm_flags & VM_LOCKED) &&
- (!vma->vm_ops || !vma->vm_ops->fault))
+ if (!write && use_zero_page(vma))
foll_flags |= FOLL_ANON;

do {

2008-06-23 17:05:24

by Jeff Chua

[permalink] [raw]

Subject: Re: [patch 2/5] Reinstate ZERO_PAGE optimization in get_user_pages() and fix XIP

On Tue, Jun 24, 2008 at 12:39 AM, Linus Torvalds
<[email protected]> wrote:

> So here's a third patch to test. It removes the VM_SHARED thing just to
> get us closer to the original code (and because do_no_page() didn't do it
> historically, so let's not do it either), and it re-instates the insane
> VM_LOCKED test with a comment.
>
> Jeff, does this still work with vmware?

No, this breaks vmware. Does this trace help?

Jun 24 00:54:49.325: vmx| NOT_IMPLEMENTED
/build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774
Jun 24 00:54:49.325: vmx| Backtrace:
Jun 24 00:54:49.325: vmx| Backtrace[0] 0xbffc30c8 eip 0x8052f10
Jun 24 00:54:49.325: vmx| Backtrace[1] 0xbffc34f8 eip 0x80f2f7d
Jun 24 00:54:49.325: vmx| Backtrace[2] 0xbffc3548 eip 0x80e4b15
Jun 24 00:54:49.325: vmx| Backtrace[3] 0xbffc3638 eip 0x837b341
Jun 24 00:54:49.325: vmx| Backtrace[4] 0xbffc3688 eip 0x837cde4
Jun 24 00:54:49.325: vmx| Backtrace[5] 0xbffc36b8 eip 0x80fda89
Jun 24 00:54:49.325: vmx| Backtrace[6] 0xbffc36e8 eip 0x80f36f5
Jun 24 00:54:49.325: vmx| Backtrace[7] 0xbffc3728 eip 0x80f3bd4
Jun 24 00:54:49.325: vmx| Backtrace[8] 0xbffc3788 eip 0x80511be
Jun 24 00:54:49.325: vmx| Backtrace[9] 0xbffc3878 eip 0x8051561
Jun 24 00:54:49.325: vmx| Backtrace[10] 0xbffc38e8 eip 0xb7e374c0
Jun 24 00:54:49.325: vmx| Backtrace[11] 00000000 eip 0x804e7b1
Jun 24 00:54:49.325: vmx| Core dump limit is 0 kb.
Jun 24 00:54:49.326: vmx| Cannot remap region MonWired (addr=(nil),
size=0x13000, offset=0x19000)
Jun 24 00:54:49.326: vmx| Cannot remap region PShareMPN (addr=(nil),
size=0x1000, offset=0x18000)
Jun 24 00:54:49.326: vmx| Remapping region BusMemFrame1 as MAP_PRIVATE
(addr=0xb7f9c000, size=0x1000, offset=0x17000)
Jun 24 00:54:49.326: vmx| Remapping region BusMemFrame0 as MAP_PRIVATE
(addr=0xb7f9d000, size=0x1000, offset=0x16000)
Jun 24 00:54:49.326: vmx| Cannot remap region PhysRegion0 (addr=(nil),
size=0x1000, offset=0x15000)
Jun 24 00:54:49.326: vmx| Msg_Post: Error
Jun 24 00:54:49.326: vmx| [msg.log.error.unrecoverable] VMware
Workstation unrecoverable error: (vmx)
Jun 24 00:54:49.326: vmx| NOT_IMPLEMENTED
/build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774

Thanks,
Jeff.

> mm/memory.c | 20 ++++++++++++++++++--
> 1 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 9aefaae..a2ce28d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1045,6 +1045,23 @@ no_page_table:
> return page;
> }
>
> +/* Can we do the FOLL_ANON optimization? */
> +static inline int use_zero_page(struct vm_area_struct *vma)
> +{
> + /*
> + * We don't want to optimize FOLL_ANON for make_pages_present()
> + * when it tries to page in a VM_LOCKED region.
> + */
> + if (vma->vm_flags & VM_LOCKED)
> + return 0;
> + /*
> + * And if we have a fault or a nopfn routine, it's not an
> + * anonymous region.
> + */
> + return !vma->vm_ops ||
> + (!vma->vm_ops->fault && !vma->vm_ops->nopfn);
> +}
> +
> int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> unsigned long start, int len, int write, int force,
> struct page **pages, struct vm_area_struct **vmas)
> @@ -1119,8 +1136,7 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> foll_flags = FOLL_TOUCH;
> if (pages)
> foll_flags |= FOLL_GET;
> - if (!write && !(vma->vm_flags & VM_LOCKED) &&
> - (!vma->vm_ops || !vma->vm_ops->fault))
> + if (!write && use_zero_page(vma))
> foll_flags |= FOLL_ANON;
>
> do {
>

2008-06-23 17:30:40

by Linus Torvalds

[permalink] [raw]

Subject: Re: [patch 2/5] Reinstate ZERO_PAGE optimization in get_user_pages() and fix XIP

On Tue, 24 Jun 2008, Jeff Chua wrote:
> On Tue, Jun 24, 2008 at 12:39 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Jeff, does this still work with vmware?
>
>
> No, this breaks vmware. Does this trace help?

Not really. I have no idea what vmware does, so any traces from vmware are
pretty useless.

On the other hand, if you add a trace to the "use_zero_page()" function to
print out the vm_flags and other details, that probably would help.

That said, since the previous patch _did_ work, I bet that one that does
both VM_LOCKED and VM_SHARED works too. There was a reason I wanted to do
that VM_SHARED test. I think the VM_SHARED test is sane, unlike the
VM_LOCKED test (that is a fairly dubious hack for mlock).

So here's the final version. I bet it works.

Linus
---
mm/memory.c | 23 +++++++++++++++++++++--
1 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 9aefaae..423e0e7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1045,6 +1045,26 @@ no_page_table:
return page;
}

+/* Can we do the FOLL_ANON optimization? */
+static inline int use_zero_page(struct vm_area_struct *vma)
+{
+ /*
+ * We don't want to optimize FOLL_ANON for make_pages_present()
+ * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
+ * we want to get the page from the page tables to make sure
+ * that we serialize and update with any other user of that
+ * mapping.
+ */
+ if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
+ return 0;
+ /*
+ * And if we have a fault or a nopfn routine, it's not an
+ * anonymous region.
+ */
+ return !vma->vm_ops ||
+ (!vma->vm_ops->fault && !vma->vm_ops->nopfn);
+}
+
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int write, int force,
struct page **pages, struct vm_area_struct **vmas)
@@ -1119,8 +1139,7 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
foll_flags = FOLL_TOUCH;
if (pages)
foll_flags |= FOLL_GET;
- if (!write && !(vma->vm_flags & VM_LOCKED) &&
- (!vma->vm_ops || !vma->vm_ops->fault))
+ if (!write && use_zero_page(vma))
foll_flags |= FOLL_ANON;

do {

2008-06-23 18:15:52

by Jeff Chua

[permalink] [raw]

Subject: Re: [patch 2/5] Reinstate ZERO_PAGE optimization in get_user_pages() and fix XIP

On Tue, Jun 24, 2008 at 1:27 AM, Linus Torvalds
<[email protected]> wrote:

> On the other hand, if you add a trace to the "use_zero_page()" function to
> print out the vm_flags and other details, that probably would help.

Let me know if you still want me to test this.

> That said, since the previous patch _did_ work, I bet that one that does
> both VM_LOCKED and VM_SHARED works too. There was a reason I wanted > to do that VM_SHARED test. I think the VM_SHARED test is sane, unlike the
> VM_LOCKED test (that is a fairly dubious hack for mlock).
> So here's the final version. I bet it works.

Yeh, it works great! Thank you.

Jeff.

> mm/memory.c | 23 +++++++++++++++++++++--
> 1 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 9aefaae..423e0e7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1045,6 +1045,26 @@ no_page_table:
> return page;
> }
>
> +/* Can we do the FOLL_ANON optimization? */
> +static inline int use_zero_page(struct vm_area_struct *vma)
> +{
> + /*
> + * We don't want to optimize FOLL_ANON for make_pages_present()
> + * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
> + * we want to get the page from the page tables to make sure
> + * that we serialize and update with any other user of that
> + * mapping.
> + */
> + if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
> + return 0;
> + /*
> + * And if we have a fault or a nopfn routine, it's not an
> + * anonymous region.
> + */
> + return !vma->vm_ops ||
> + (!vma->vm_ops->fault && !vma->vm_ops->nopfn);
> +}
> +
> int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> unsigned long start, int len, int write, int force,
> struct page **pages, struct vm_area_struct **vmas)
> @@ -1119,8 +1139,7 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> foll_flags = FOLL_TOUCH;
> if (pages)
> foll_flags |= FOLL_GET;
> - if (!write && !(vma->vm_flags & VM_LOCKED) &&
> - (!vma->vm_ops || !vma->vm_ops->fault))
> + if (!write && use_zero_page(vma))
> foll_flags |= FOLL_ANON;
>
> do {
>

2008-06-23 18:37:27

by Linus Torvalds

[permalink] [raw]

Subject: Re: [patch 2/5] Reinstate ZERO_PAGE optimization in get_user_pages() and fix XIP

On Tue, 24 Jun 2008, Jeff Chua wrote:

> On Tue, Jun 24, 2008 at 1:27 AM, Linus Torvalds
> <[email protected]> wrote:
>
> > On the other hand, if you add a trace to the "use_zero_page()" function to
> > print out the vm_flags and other details, that probably would help.
>
> Let me know if you still want me to test this.

No, it's fine. It really was a bug, and a long-standing one, just one that
was probably practically impossible to hit before (because we used to only
do the FOLL_ANON logic on missing whole page tables, and just about any
access to any mapping even nearby the one you care about will fill in the
page tables - so you would have had to be really unlucky to trigger the
case before).

The patch clearly fixes an issue, and makes the code more readable and
maintainable too, so I don't care what the exact mapping flags etc were.

> Yeh, it works great! Thank you.

Thanks for bisecting, reporting and testing.

Linus

2008-06-23 19:31:42

by Greg KH

[permalink] [raw]

Subject: Re: [stable] [patch 4/5] x86: use BOOTMEM_EXCLUSIVE on 32-bit

On Mon, Jun 23, 2008 at 10:09:40AM +0200, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > On Sun, 22 Jun 2008, Greg KH wrote:
> > >
> > > > You will also need the patch from http://lkml.org/lkml/2008/6/21/103 to
> > > > make sure reserve_bootmem() is not void (*)().
> > >
> > > Ok, let me know when that goes into Linus's tree please.
> >
> > It already is: 71c2742f5e6348d76ee62085cf0a13e5eff0f00e.

Thanks, I'll go add that one as well.

> thanks. This patch (which was not a build fix but an infrastructure fix
> that the kexec fix in arch/x86 depended on) is well-tested as well, it
> was queued in -tip on June 10th:
>
> | commit 91d48fc80f22817332170082e10de60a75851640
> | Author: Bernhard Walle <[email protected]>
> | Date: Sun Jun 8 15:46:29 2008 +0200
> | CommitDate: Tue Jun 10 14:41:56 2008 +0200
> |
> | bootmem: add return value to reserve_bootmem_node()
> |
> | This patch changes the function reserve_bootmem_node() from void to
> | int, returning -ENOMEM if the allocation fails.
> |
> | Signed-off-by: Bernhard Walle <[email protected]>
> | Signed-off-by: Ingo Molnar <[email protected]>
>
> so it is a -stable candidate just as much as the kexec fix. (These are
> all fixes for long-standing problems so i guess it can go all the way
> back to all stable kernels that are being maintained.)

Hm, but it's not in Linus's tree yet, so I can't take it for stable at
this time :(

thanks,

greg k-h

2008-06-23 19:32:18

by Greg KH

[permalink] [raw]

Subject: Re: [stable] [patch 0/5] 2.6.25-stable review

On Mon, Jun 23, 2008 at 02:19:05PM +0300, S.Çağlar Onur wrote:
> Hi Greg and -stable team;
>
> 22 Haz 2008 Paz tarihinde, Greg KH şunları yazmıştı:
> > Responses should be made by Tuesday, June 24, 18:00:00 UTC. Anything
> > received after that time might be too late.
>
> Please consider following commit for -stable also, it definetly fixes a boot failure caused by reported opps
>
> commit 1f6ef2342972dc7fd623f360f84006e2304eb935

Thanks, I've added that one now as well.

greg k-h

2008-06-23 19:38:20

by Ingo Molnar

[permalink] [raw]

Subject: Re: [stable] [patch 4/5] x86: use BOOTMEM_EXCLUSIVE on 32-bit

* Greg KH <[email protected]> wrote:

> > thanks. This patch (which was not a build fix but an infrastructure
> > fix that the kexec fix in arch/x86 depended on) is well-tested as
> > well, it was queued in -tip on June 10th:
> >
> > | commit 91d48fc80f22817332170082e10de60a75851640
> > | Author: Bernhard Walle <[email protected]>
> > | Date: Sun Jun 8 15:46:29 2008 +0200
> > | CommitDate: Tue Jun 10 14:41:56 2008 +0200
> > |
> > | bootmem: add return value to reserve_bootmem_node()
> > |
> > | This patch changes the function reserve_bootmem_node() from void to
> > | int, returning -ENOMEM if the allocation fails.
> > |
> > | Signed-off-by: Bernhard Walle <[email protected]>
> > | Signed-off-by: Ingo Molnar <[email protected]>
> >
> > so it is a -stable candidate just as much as the kexec fix. (These
> > are all fixes for long-standing problems so i guess it can go all
> > the way back to all stable kernels that are being maintained.)
>
> Hm, but it's not in Linus's tree yet, so I can't take it for stable at
> this time :(

it's all fine already: it's the very same patch you just added, but
different sha1. I just pointed out the lineage and the testing status of
the patch.

Ingo

2008-06-23 21:36:33

by David Miller

[permalink] [raw]

Subject: Re: [patch 3/5] sctp: Make sure N * sizeof(union sctp_addr) does not overflow.

From: Greg KH <[email protected]>
Date: Sun, 22 Jun 2008 13:28:47 -0700

> On Sun, Jun 22, 2008 at 12:23:32PM -0700, David Miller wrote:
> > From: Greg KH <[email protected]>
> > Date: Sun, 22 Jun 2008 12:01:36 -0700
> >
> > > 2.6.25-stable review patch. If anyone has any objections, please let us
> > > know.
> >
> > Unfortunately, Vlad found another case in SCTP which has
> > an overflow bug similar to this one. I'll work on a
> > fix for that today and submit.
>
> Thanks for letting me know, I'll wait for that one as well before doing
> this release.

This one turned out to be a false alarm, and Vlad confirmed my
analysis today. So there is no other SCTP patch you need to
wait for.

Thanks!

2008-06-23 21:59:20

by Greg KH

[permalink] [raw]

Subject: Re: [patch 3/5] sctp: Make sure N * sizeof(union sctp_addr) does not overflow.

On Mon, Jun 23, 2008 at 02:36:21PM -0700, David Miller wrote:
> From: Greg KH <[email protected]>
> Date: Sun, 22 Jun 2008 13:28:47 -0700
>
> > On Sun, Jun 22, 2008 at 12:23:32PM -0700, David Miller wrote:
> > > From: Greg KH <[email protected]>
> > > Date: Sun, 22 Jun 2008 12:01:36 -0700
> > >
> > > > 2.6.25-stable review patch. If anyone has any objections, please let us
> > > > know.
> > >
> > > Unfortunately, Vlad found another case in SCTP which has
> > > an overflow bug similar to this one. I'll work on a
> > > fix for that today and submit.
> >
> > Thanks for letting me know, I'll wait for that one as well before doing
> > this release.
>
> This one turned out to be a false alarm, and Vlad confirmed my
> analysis today. So there is no other SCTP patch you need to
> wait for.

Great, thanks for letting me know.

greg k-h