From: KOSAKI Motohiro <[email protected]>
Hi Linus and all
Sorry for the long long delay.
Maybe I've finished a code auditing of mempolicy.c. It was hard work, multiple
mistakes hided real issues. :-/
Yes, my code auditing is too slow as usual, I know. But probably a slow work is
better than do nothing, maybe.
Sadly, I've found multiple long living bugs and almost patches in this area were
committed with no review. That's sad. Therefore, I'd like to maintain this area
a while (hopefully short) until known breakage will be fixed or finding another
volunteer. The purpose is not developing new feature. It is preventing more
corruption until known issue is fixed. (Of course, if anyone want to take the role,
I'm really happy and drop patch 6/6.
Here are some bandaid fixing patch set for mempolicy. It may fix all known uservisible
corruptions. But, I know, we have a lot of remained work. Below are unfixed issues.
(because of, they are broken by design. I can't fix them w/o design change)
1) mpol mount option of tmpfs doesn't work correctly
When not using =static mount option modifier, a nodemask is affected cpuset restriction
of *inode creation* thread and ignore page allocation thread.
2) cpuset rebinding doesn't work correctly if shmem are shared from multiple processes.
mbind(2) (and set_mempolicy(2)) stores pre-calculated mask, say,
(mbind-mask & cpuset-mask & online-nodes-mask) into struct mempolicy.
and the mempolicy attaches vma and shmem regions. But, of course,
cpuset-mask depend on a thread. Thus we can't calculate final node mask
on ahead.
3) mbind(2) doesn't work correctly if shmem are shared from multiple processes.
The same of (2). mbind(2) stores final nodemask, but another thread have another
final node mask. The calculation is incorrect.
So, I think we should reconsider about shared mempolicy completely. If my remember
is correct, Hugh said he worry about this feature because nobody seems
to use and lots bug about 4 years ago. As usually he is, he is correct. now, I'm
convinced nobody uses this feature. Current all of my patches are trivial and
we can easily reproduce kernel memory corruption. And it don't work proper
functionality (see above todo list). Thus, I'd like to hear his opinion. if he
still dislike shared mempolicy feature, I'm ok to remove it entirely instead of
applying current bandaid patch set.
Thank you.
-----------------------------------------------
KOSAKI Motohiro (6):
Revert "mm: mempolicy: Let vma_merge and vma_split handle
vma->vm_policy linkages"
mempolicy: Kill all mempolicy sharing
mempolicy: fix a race in shared_policy_replace()
mempolicy: fix refcount leak in mpol_set_shared_policy()
mempolicy: fix a memory corruption by refcount imbalance in
alloc_pages_vma()
MAINTAINERS: Added MEMPOLICY entry
MAINTAINERS | 7 +++
mm/mempolicy.c | 151 ++++++++++++++++++++++++++++++++++++++++----------------
mm/shmem.c | 9 ++--
3 files changed, 120 insertions(+), 47 deletions(-)
From: KOSAKI Motohiro <[email protected]>
commit 05f144a0d5 removed vma->vm_policy updates code and it is a purpose of
mbind_range(). Now, mbind_range() is virtually no-op. no-op function don't
makes any bugs, I agree. but maybe it is not right fix.
Cc: Dave Jones <[email protected]>,
Cc: Mel Gorman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Christoph Lameter <[email protected]>,
Cc: Andrew Morton <[email protected]>
Cc: <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/mempolicy.c | 41 ++++++++++++++++++++++++-----------------
1 files changed, 24 insertions(+), 17 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f15c1b2..0a60def 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -607,6 +607,27 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
return first;
}
+/* Apply policy to a single VMA */
+static int policy_vma(struct vm_area_struct *vma, struct mempolicy *new)
+{
+ int err = 0;
+ struct mempolicy *old = vma->vm_policy;
+
+ pr_debug("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n",
+ vma->vm_start, vma->vm_end, vma->vm_pgoff,
+ vma->vm_ops, vma->vm_file,
+ vma->vm_ops ? vma->vm_ops->set_policy : NULL);
+
+ if (vma->vm_ops && vma->vm_ops->set_policy)
+ err = vma->vm_ops->set_policy(vma, new);
+ if (!err) {
+ mpol_get(new);
+ vma->vm_policy = new;
+ mpol_put(old);
+ }
+ return err;
+}
+
/* Step 2: apply policy to a range and do splits. */
static int mbind_range(struct mm_struct *mm, unsigned long start,
unsigned long end, struct mempolicy *new_pol)
@@ -655,23 +676,9 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
if (err)
goto out;
}
-
- /*
- * Apply policy to a single VMA. The reference counting of
- * policy for vma_policy linkages has already been handled by
- * vma_merge and split_vma as necessary. If this is a shared
- * policy then ->set_policy will increment the reference count
- * for an sp node.
- */
- pr_debug("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n",
- vma->vm_start, vma->vm_end, vma->vm_pgoff,
- vma->vm_ops, vma->vm_file,
- vma->vm_ops ? vma->vm_ops->set_policy : NULL);
- if (vma->vm_ops && vma->vm_ops->set_policy) {
- err = vma->vm_ops->set_policy(vma, new_pol);
- if (err)
- goto out;
- }
+ err = policy_vma(vma, new_pol);
+ if (err)
+ goto out;
}
out:
--
1.7.1
From: KOSAKI Motohiro <[email protected]>
Dave Jones' system call fuzz testing tool "trinity" triggered the following
bug error with slab debugging enabled
[ 7613.229315] =============================================================================
[ 7613.229955] BUG numa_policy (Not tainted): Poison overwritten
[ 7613.230560] -----------------------------------------------------------------------------
[ 7613.230560]
[ 7613.231834] INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
[ 7613.232518] INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
[ 7613.233188] __slab_alloc+0x3d3/0x445
[ 7613.233877] kmem_cache_alloc+0x29d/0x2b0
[ 7613.234564] mpol_new+0xa3/0x140
[ 7613.235236] sys_mbind+0x142/0x620
[ 7613.235929] system_call_fastpath+0x16/0x1b
[ 7613.236640] INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
[ 7613.237354] __slab_free+0x2e/0x1de
[ 7613.238080] kmem_cache_free+0x25a/0x260
[ 7613.238799] __mpol_put+0x27/0x30
[ 7613.239515] remove_vma+0x68/0x90
[ 7613.240223] exit_mmap+0x118/0x140
[ 7613.240939] mmput+0x73/0x110
[ 7613.241651] exit_mm+0x108/0x130
[ 7613.242367] do_exit+0x162/0xb90
[ 7613.243074] do_group_exit+0x4f/0xc0
[ 7613.243790] sys_exit_group+0x17/0x20
[ 7613.244507] system_call_fastpath+0x16/0x1b
[ 7613.245212] INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x (null) flags=0x20000000004080
[ 7613.246000] INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0
The problem was created by a reference count imbalance. Example, In following case,
mbind(addr, len) try to replace mempolicies of both vma1 and vma2 and thus they will
become to share the same mempolicy, _and_ the new mempolicy has MPOL_F_SHARED flag.
+-------------------+-------------------+
| vma1 | vma2(shmem) |
+-------------------+-------------------+
| |
addr addr+len
Look at alloc_pages_vma(), it uses get_vma_policy() and mpol_cond_put() pair
for maintaining mempolicy refcount. The current rule is, get_vma_policy() does
NOT increment refcount if the policy is not attached shmem vma and mpol_cond_put()
DOES decrement refcount if mpol has MPOL_F_SHARED.
See? In above case, vma1 is not shmem vma and vma->policy has MPOL_F_SHARED! then,
refcount will be decreased even though was not increased whenever alloc_page_vma()
is called. As you know, mere mbind(MPOL_MF_MOVE) calls alloc_page_vma().
Oh, Oh my god.. Who can imagine alloc_pages_vma() was broken!? It is one of Linux
memory management central code! This bug was introduced by commit 52cd3b0740
(mempolicy: rework mempolicy Reference Counting) at 2008. I.e. it was living 4
years!
More unfortunately mempolicy has one another serious broken. Currently,
mempolicy rebind logic (it is called from cpuset rebinding) ignore a refcount
of mempolicy and override it forcibly. Thus, any mempolicy sharing may
cause mempolicy corruption. The bug was introduced by commit 68860ec10b
(cpusets: automatic numa mempolicy rebinding) at 2005!
Maybe, we need to rewrite MPOL_F_SHARED and mempolicy rebinding code at all.
But at first step, to disable any policy sharing hide the issue from a userland.
Reported-by: Dave Jones <[email protected]>,
Cc: Mel Gorman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Christoph Lameter <[email protected]>,
Cc: Andrew Morton <[email protected]>
Cc: <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/mempolicy.c | 49 ++++++++++++++++++++++++++++++++++++-------------
1 files changed, 36 insertions(+), 13 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0a60def..9505cb9 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -607,24 +607,38 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
return first;
}
-/* Apply policy to a single VMA */
-static int policy_vma(struct vm_area_struct *vma, struct mempolicy *new)
+/*
+ * Apply policy to a single VMA
+ * This must be called with the mmap_sem held for writing.
+ */
+static int policy_vma(struct vm_area_struct *vma, struct mempolicy *pol)
{
- int err = 0;
- struct mempolicy *old = vma->vm_policy;
+ int err;
+ struct mempolicy *old;
+ struct mempolicy *new;
pr_debug("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n",
vma->vm_start, vma->vm_end, vma->vm_pgoff,
vma->vm_ops, vma->vm_file,
vma->vm_ops ? vma->vm_ops->set_policy : NULL);
- if (vma->vm_ops && vma->vm_ops->set_policy)
+ new = mpol_dup(pol);
+ if (IS_ERR(new))
+ return PTR_ERR(new);
+
+ if (vma->vm_ops && vma->vm_ops->set_policy) {
err = vma->vm_ops->set_policy(vma, new);
- if (!err) {
- mpol_get(new);
- vma->vm_policy = new;
- mpol_put(old);
+ if (err)
+ goto err_out;
}
+
+ old = vma->vm_policy;
+ vma->vm_policy = new; /* protected by mmap_sem */
+ mpol_put(old);
+
+ return 0;
+ err_out:
+ mpol_put(new);
return err;
}
@@ -2147,15 +2161,24 @@ static void sp_delete(struct shared_policy *sp, struct sp_node *n)
static struct sp_node *sp_alloc(unsigned long start, unsigned long end,
struct mempolicy *pol)
{
- struct sp_node *n = kmem_cache_alloc(sn_cache, GFP_KERNEL);
+ struct sp_node *n;
+ struct mempolicy *newpol;
+ n = kmem_cache_alloc(sn_cache, GFP_KERNEL);
if (!n)
return NULL;
+
+ newpol = mpol_dup(pol);
+ if (IS_ERR(newpol)) {
+ kmem_cache_free(sn_cache, n);
+ return NULL;
+ }
+ newpol->flags |= MPOL_F_SHARED;
+
n->start = start;
n->end = end;
- mpol_get(pol);
- pol->flags |= MPOL_F_SHARED; /* for unref */
- n->policy = pol;
+ n->policy = newpol;
+
return n;
}
--
1.7.1
From: KOSAKI Motohiro <[email protected]>
When shared_policy_replace() failure, new->policy is not freed correctly.
The real problem is, shared mempolicy codes directly call kmem_cache_free()
in multiple place. It can be easily mistake.
This patch creates proper wrapper function and uses it. The bug was introduced
pre-git age (IOW, before 2.6.12-rc2).
Cc: Dave Jones <[email protected]>,
Cc: Mel Gorman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Christoph Lameter <[email protected]>,
Cc: Andrew Morton <[email protected]>
Cc: <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/mempolicy.c | 15 +++++++++------
1 files changed, 9 insertions(+), 6 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d97d2db..7fb7d51 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2150,12 +2150,17 @@ mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
return pol;
}
+static void sp_free(struct sp_node *n)
+{
+ mpol_put(n->policy);
+ kmem_cache_free(sn_cache, n);
+}
+
static void sp_delete(struct shared_policy *sp, struct sp_node *n)
{
pr_debug("deleting %lx-l%lx\n", n->start, n->end);
rb_erase(&n->nd, &sp->root);
- mpol_put(n->policy);
- kmem_cache_free(sn_cache, n);
+ sp_free(n);
}
static void sp_node_init(struct sp_node *node, unsigned long start,
@@ -2320,7 +2325,7 @@ int mpol_set_shared_policy(struct shared_policy *info,
}
err = shared_policy_replace(info, vma->vm_pgoff, vma->vm_pgoff+sz, new);
if (err && new)
- kmem_cache_free(sn_cache, new);
+ sp_free(new);
return err;
}
@@ -2337,9 +2342,7 @@ void mpol_free_shared_policy(struct shared_policy *p)
while (next) {
n = rb_entry(next, struct sp_node, nd);
next = rb_next(&n->nd);
- rb_erase(&n->nd, &p->root);
- mpol_put(n->policy);
- kmem_cache_free(sn_cache, n);
+ sp_delete(p, n);
}
spin_unlock(&p->lock);
}
--
1.7.1
From: KOSAKI Motohiro <[email protected]>
commit cc9a6c8776 (cpuset: mm: reduce large amounts of memory barrier related
damage v3) introduced a memory corruption.
shmem_alloc_page() passes pseudo vma and it has one significant unique
combination, vma->vm_ops=NULL and (vma->policy->flags & MPOL_F_SHARED).
Now, get_vma_policy() does NOT increase a policy ref when vma->vm_ops=NULL
and mpol_cond_put() DOES decrease a policy ref when a policy has MPOL_F_SHARED.
Therefore, when cpuset race is happen and alloc_pages_vma() fall in
'goto retry_cpuset' path, a policy refcount will be decreased too much and
therefore it will make memory corruption.
This patch fixes it.
Cc: Dave Jones <[email protected]>,
Cc: Mel Gorman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Christoph Lameter <[email protected]>,
Cc: Andrew Morton <[email protected]>
Cc: <[email protected]>
Cc: Miao Xie <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/mempolicy.c | 13 ++++++++++++-
mm/shmem.c | 9 +++++----
2 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7fb7d51..0da0969 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1544,18 +1544,29 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
struct vm_area_struct *vma, unsigned long addr)
{
struct mempolicy *pol = task->mempolicy;
+ int got_ref;
if (vma) {
if (vma->vm_ops && vma->vm_ops->get_policy) {
struct mempolicy *vpol = vma->vm_ops->get_policy(vma,
addr);
- if (vpol)
+ if (vpol) {
pol = vpol;
+ got_ref = 1;
+ }
} else if (vma->vm_policy)
pol = vma->vm_policy;
}
if (!pol)
pol = &default_policy;
+
+ /*
+ * shmem_alloc_page() passes MPOL_F_SHARED policy with vma->vm_ops=NULL.
+ * Thus, we need to take additional ref for avoiding refcount imbalance.
+ */
+ if (!got_ref && mpol_needs_cond_ref(pol))
+ mpol_get(pol);
+
return pol;
}
diff --git a/mm/shmem.c b/mm/shmem.c
index d576b84..eb5f1eb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -919,6 +919,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
{
struct vm_area_struct pvma;
+ struct page *page;
/* Create a pseudo vma that just contains the policy */
pvma.vm_start = 0;
@@ -926,10 +927,10 @@ static struct page *shmem_alloc_page(gfp_t gfp,
pvma.vm_ops = NULL;
pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
- /*
- * alloc_page_vma() will drop the shared policy reference
- */
- return alloc_page_vma(gfp, &pvma, 0);
+ page = alloc_page_vma(gfp, &pvma, 0);
+
+ mpol_put(pvma.vm_policy);
+ return page;
}
#else /* !CONFIG_NUMA */
#ifdef CONFIG_TMPFS
--
1.7.1
From: KOSAKI Motohiro <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
MAINTAINERS | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/MAINTAINERS b/MAINTAINERS
index a246490..6f4a8e2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4467,6 +4467,13 @@ F: drivers/mtd/
F: include/linux/mtd/
F: include/mtd/
+MEMPOLICY
+M: KOSAKI Motohiro <[email protected]>
+L: [email protected]
+S: Maintained
+F: include/linux/mempolicy.h
+F: mm/mempolicy.c
+
MICROBLAZE ARCHITECTURE
M: Michal Simek <[email protected]>
L: [email protected] (moderated for non-subscribers)
--
1.7.1
From: KOSAKI Motohiro <[email protected]>
shared_policy_replace() uses sp_alloc() wrongly. 1) sp_node can't be dereferenced when
not holding sp->lock and 2) another thread can modify sp_node between spin_unlock for
restarting and next spin_lock, we shouldn't believe n->end and n->policy were not
modified.
This patch fixes them.
Note: The bug was introduced pre-git age (IOW, before 2.6.12-rc2). I believe nobody
uses this feature in production systems.
Cc: Dave Jones <[email protected]>,
Cc: Mel Gorman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Christoph Lameter <[email protected]>,
Cc: Andrew Morton <[email protected]>
Cc: <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/mempolicy.c | 55 ++++++++++++++++++++++++++++++++++++++-----------------
1 files changed, 38 insertions(+), 17 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9505cb9..d97d2db 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2158,6 +2158,14 @@ static void sp_delete(struct shared_policy *sp, struct sp_node *n)
kmem_cache_free(sn_cache, n);
}
+static void sp_node_init(struct sp_node *node, unsigned long start,
+ unsigned long end, struct mempolicy *pol)
+{
+ node->start = start;
+ node->end = end;
+ node->policy = pol;
+}
+
static struct sp_node *sp_alloc(unsigned long start, unsigned long end,
struct mempolicy *pol)
{
@@ -2174,10 +2182,7 @@ static struct sp_node *sp_alloc(unsigned long start, unsigned long end,
return NULL;
}
newpol->flags |= MPOL_F_SHARED;
-
- n->start = start;
- n->end = end;
- n->policy = newpol;
+ sp_node_init(n, start, end, newpol);
return n;
}
@@ -2186,7 +2191,9 @@ static struct sp_node *sp_alloc(unsigned long start, unsigned long end,
static int shared_policy_replace(struct shared_policy *sp, unsigned long start,
unsigned long end, struct sp_node *new)
{
+ int err;
struct sp_node *n, *new2 = NULL;
+ struct mempolicy *new2_pol = NULL;
restart:
spin_lock(&sp->lock);
@@ -2202,16 +2209,16 @@ restart:
} else {
/* Old policy spanning whole new range. */
if (n->end > end) {
- if (!new2) {
- spin_unlock(&sp->lock);
- new2 = sp_alloc(end, n->end, n->policy);
- if (!new2)
- return -ENOMEM;
- goto restart;
- }
- n->end = start;
+ if (!new2)
+ goto alloc_new2;
+
+ *new2_pol = *n->policy;
+ atomic_set(&new2_pol->refcnt, 1);
+ sp_node_init(new2, n->end, end, new2_pol);
sp_insert(sp, new2);
+ n->end = start;
new2 = NULL;
+ new2_pol = NULL;
break;
} else
n->end = start;
@@ -2223,11 +2230,25 @@ restart:
if (new)
sp_insert(sp, new);
spin_unlock(&sp->lock);
- if (new2) {
- mpol_put(new2->policy);
- kmem_cache_free(sn_cache, new2);
- }
- return 0;
+ err = 0;
+
+ err_out:
+ if (new2_pol)
+ mpol_put(new2_pol);
+ if (new2)
+ kmem_cache_free(sn_cache, new2);
+ return err;
+
+ alloc_new2:
+ spin_unlock(&sp->lock);
+ err = -ENOMEM;
+ new2 = kmem_cache_alloc(sn_cache, GFP_KERNEL);
+ if (!new2)
+ goto err_out;
+ new2_pol = kmem_cache_alloc(policy_cache, GFP_KERNEL);
+ if (!new2_pol)
+ goto err_out;
+ goto restart;
}
/**
--
1.7.1
On Wed, May 30, 2012 at 2:02 AM, <[email protected]> wrote:
>
> So, I think we should reconsider about shared mempolicy completely.
Quite frankly, I'd prefer that approach. The code is subtle and
horribly bug-fraught, and I absolutely detest the way it looks too.
Reading your patches was actually somewhat painful.
If we could just remove the support for it entirely, that would be
*much* preferable to continue working with this code.
Could we just try that removal, and see if anybody screams?
Linus
On Wed, 30 May 2012, Linus Torvalds wrote:
> On Wed, May 30, 2012 at 2:02 AM, <[email protected]> wrote:
> >
> > So, I think we should reconsider about shared mempolicy completely.
>
> Quite frankly, I'd prefer that approach. The code is subtle and
> horribly bug-fraught, and I absolutely detest the way it looks too.
> Reading your patches was actually somewhat painful.
It is so bad mostly because the integration of shared memory policies with
cpusets is not really working. Using either in isolation is ok especially
shared mempolicies do not play well with cpusets.
> If we could just remove the support for it entirely, that would be
> *much* preferable to continue working with this code.
Well shm support needs memory policies to spread data across nodes etc.
AFAICT support was put in due to requirements to support large database
vendors (oracle). Andi?
Its not going to be easy to remove.
On Wed, May 30, 2012 at 01:34:21PM -0500, Christoph Lameter wrote:
> On Wed, 30 May 2012, Linus Torvalds wrote:
>
> > On Wed, May 30, 2012 at 2:02 AM, <[email protected]> wrote:
> > >
> > > So, I think we should reconsider about shared mempolicy completely.
> >
> > Quite frankly, I'd prefer that approach. The code is subtle and
> > horribly bug-fraught, and I absolutely detest the way it looks too.
> > Reading your patches was actually somewhat painful.
>
> It is so bad mostly because the integration of shared memory policies with
> cpusets is not really working. Using either in isolation is ok especially
> shared mempolicies do not play well with cpusets.
Yes the cpusets did some horrible things.
I always regretted that cpusets were no done with custom node lists.
That would have been much cleaner and also likely faster than what we have.
> > If we could just remove the support for it entirely, that would be
> > *much* preferable to continue working with this code.
>
> Well shm support needs memory policies to spread data across nodes etc.
> AFAICT support was put in due to requirements to support large database
> vendors (oracle). Andi?
Yes we need shared policy for the big databases.
Maybe we could stop supporting cpusets with that though. Not sure they
really use that.
> Its not going to be easy to remove.
Shared policies? I don't think you can remove them.
cpusets+shared policy? maybe, but still will be hard.
-Andi
>
--
[email protected] -- Speaking for myself only.
On Wed, 30 May 2012, Andi Kleen wrote:
> I always regretted that cpusets were no done with custom node lists.
> That would have been much cleaner and also likely faster than what we have.
Could shared memory policies ignore cpuset constraints?
On Wed, May 30, 2012 at 11:34 AM, Christoph Lameter <[email protected]> wrote:
>
> Well shm support needs memory policies to spread data across nodes etc.
> AFAICT support was put in due to requirements to support large database
> vendors (oracle). Andi?
>
> Its not going to be easy to remove.
Ok. So can the people involved with this please review this patch-set
and comment on this one? I think it needs a bit of language editing
for the commit commentary, and I'd like it to see a *lot * of ack's
from the relevant vm people.
Please?
Linus
On Wed, 30 May 2012, [email protected] wrote:
> From: KOSAKI Motohiro <[email protected]>
>
> commit 05f144a0d5 removed vma->vm_policy updates code and it is a purpose of
> mbind_range(). Now, mbind_range() is virtually no-op. no-op function don't
> makes any bugs, I agree. but maybe it is not right fix.
I dont really understand the changelog. But to restore the policy_vma() is
the right thing to do since there are potential multiple use cases where
we want to apply a policy to a vma.
Proposed new changelog:
Commit 05f144a0d5 folded policy_vma() into mbind_range(). There are
other use cases of policy_vma(*) though and so revert a piece of
that commit in order to have a policy_vma() function again.
> @@ -655,23 +676,9 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
> if (err)
> goto out;
> }
> -
> - /*
> - * Apply policy to a single VMA. The reference counting of
> - * policy for vma_policy linkages has already been handled by
> - * vma_merge and split_vma as necessary. If this is a shared
> - * policy then ->set_policy will increment the reference count
> - * for an sp node.
> - */
You are dropping the nice comments by Mel that explain the refcounting.
Glad that someone is willing to take this on. Shared memory policies are
an ugly thing.
Adding Andi. He needs to be cced on future postings of this patchset.
Acked-by: Christoph Lameter <[email protected]>
On Wed, May 30, 2012 at 01:50:02PM -0500, Christoph Lameter wrote:
> On Wed, 30 May 2012, Andi Kleen wrote:
>
> > I always regretted that cpusets were no done with custom node lists.
> > That would have been much cleaner and also likely faster than what we have.
>
> Could shared memory policies ignore cpuset constraints?
Only if noone uses cpusets as a "security" mechanism, just for a "soft policy"
Even with soft policy you could well break someone's setup.
Maybe there are some better ways to do that now with memcg, not fully sure.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, May 30, 2012 at 3:17 PM, Christoph Lameter <[email protected]> wrote:
> On Wed, 30 May 2012, [email protected] wrote:
>
>> From: KOSAKI Motohiro <[email protected]>
>>
>> commit 05f144a0d5 removed vma->vm_policy updates code and it is a purpose of
>> mbind_range(). Now, mbind_range() is virtually no-op. no-op function don't
>> makes any bugs, I agree. but maybe it is not right fix.
>
> I dont really understand the changelog. But to restore the policy_vma() is
> the right thing to do since there are potential multiple use cases where
> we want to apply a policy to a vma.
>
> Proposed new changelog:
>
> Commit 05f144a0d5 folded policy_vma() into mbind_range(). There are
> other use cases of policy_vma(*) though and so revert a piece of
> that commit in order to have a policy_vma() function again.
>
>> @@ -655,23 +676,9 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
>> ? ? ? ? ? ? ? ? ? ? ? if (err)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? goto out;
>> ? ? ? ? ? ? ? }
>> -
>> - ? ? ? ? ? ? /*
>> - ? ? ? ? ? ? ?* Apply policy to a single VMA. The reference counting of
>> - ? ? ? ? ? ? ?* policy for vma_policy linkages has already been handled by
>> - ? ? ? ? ? ? ?* vma_merge and split_vma as necessary. If this is a shared
>> - ? ? ? ? ? ? ?* policy then ->set_policy will increment the reference count
>> - ? ? ? ? ? ? ?* for an sp node.
>> - ? ? ? ? ? ? ?*/
>
> You are dropping the nice comments by Mel that explain the refcounting.
Because this is not strictly correct. 1) vma_merge() and split_vma() don't
care mempolicy refcount. They only dup and drop it. 2) This mpol_get() is
for vma attaching. This function don't need to care sp_node internal.
On Wed, 30 May 2012, [email protected] wrote:
> refcount will be decreased even though was not increased whenever alloc_page_vma()
> is called. As you know, mere mbind(MPOL_MF_MOVE) calls alloc_page_vma().
Most of these issues are about memory migration and shared memory. If we
exempt shared memory from memory migration (after all that shared memory
has its own distinct memory policies already!) then a lot of these issues
wont arise.
On Wed, 30 May 2012, Andi Kleen wrote:
> On Wed, May 30, 2012 at 01:50:02PM -0500, Christoph Lameter wrote:
> > On Wed, 30 May 2012, Andi Kleen wrote:
> >
> > > I always regretted that cpusets were no done with custom node lists.
> > > That would have been much cleaner and also likely faster than what we have.
> >
> > Could shared memory policies ignore cpuset constraints?
>
> Only if noone uses cpusets as a "security" mechanism, just for a "soft policy"
> Even with soft policy you could well break someone's setup.
Well at least lets exempt shared memory from memory migration and memory
policy updates. That seems to be causing many of these issues.
On Wed, May 30, 2012 at 3:41 PM, Christoph Lameter <[email protected]> wrote:
> On Wed, 30 May 2012, [email protected] wrote:
>
>> refcount will be decreased even though was not increased whenever alloc_page_vma()
>> is called. As you know, mere mbind(MPOL_MF_MOVE) calls alloc_page_vma().
>
> Most of these issues are about memory migration and shared memory. If we
> exempt shared memory from memory migration (after all that shared memory
> has its own distinct memory policies already!) then a lot of these issues
> wont arise.
The final point is, to make proper cow for struct mempolicy. but until
fixing cpuset
bindings issue, we can't use mempolicy sharing anyway.
Moreover, now most core piece of mempolicy life cycle, say split_vma()
and dup_mmap(), don't use mempolicy sharing. Only mbind() does.
Thus, this patch don't increase normal workload memory usage.
On Wed, May 30, 2012 at 02:41:22PM -0500, Christoph Lameter wrote:
> On Wed, 30 May 2012, [email protected] wrote:
>
> > refcount will be decreased even though was not increased whenever alloc_page_vma()
> > is called. As you know, mere mbind(MPOL_MF_MOVE) calls alloc_page_vma().
>
> Most of these issues are about memory migration and shared memory. If we
> exempt shared memory from memory migration (after all that shared memory
> has its own distinct memory policies already!) then a lot of these issues
> wont arise.
Soft memory offlining needs migration. It's fairly important that this
works: on the database systems most memory is in shared memory and they
have a lot of memory, so predictive failure analysis and soft offlining
helps a lot.
Classic migration is probably not too important here, but they pretty
much rely on the same low level mechanism.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, May 30, 2012 at 02:42:42PM -0500, Christoph Lameter wrote:
> On Wed, 30 May 2012, Andi Kleen wrote:
>
> > On Wed, May 30, 2012 at 01:50:02PM -0500, Christoph Lameter wrote:
> > > On Wed, 30 May 2012, Andi Kleen wrote:
> > >
> > > > I always regretted that cpusets were no done with custom node lists.
> > > > That would have been much cleaner and also likely faster than what we have.
> > >
> > > Could shared memory policies ignore cpuset constraints?
> >
> > Only if noone uses cpusets as a "security" mechanism, just for a "soft policy"
> > Even with soft policy you could well break someone's setup.
>
> Well at least lets exempt shared memory from memory migration and memory
> policy updates. That seems to be causing many of these issues.
Migration on the page level is needed for the memory error handling.
Updates: you mean not allowing to set the policy when there are already
multiple mappers? I could see that causing some unexpected behaviour. Presumably
a standard database will only set it at the beginning, but I don't know
if that would work for all users.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, May 30, 2012 at 3:42 PM, Christoph Lameter <[email protected]> wrote:
> On Wed, 30 May 2012, Andi Kleen wrote:
>
>> On Wed, May 30, 2012 at 01:50:02PM -0500, Christoph Lameter wrote:
>> > On Wed, 30 May 2012, Andi Kleen wrote:
>> >
>> > > I always regretted that cpusets were no done with custom node lists.
>> > > That would have been much cleaner and also likely faster than what we have.
>> >
>> > Could shared memory policies ignore cpuset constraints?
>>
>> Only if noone uses cpusets as a "security" mechanism, just for a "soft policy"
>> Even with soft policy you could well break someone's setup.
>
> Well at least lets exempt shared memory from memory migration and memory
> policy updates. That seems to be causing many of these issues.
Yes, that's right direction, I think. Currently, shmem_set_policy() can't handle
nonlinear mapping. vma -> file offset transration is not so easy work
and I doubt
we should do.
On Wed, May 30, 2012 at 3:52 PM, Andi Kleen <[email protected]> wrote:
> On Wed, May 30, 2012 at 02:42:42PM -0500, Christoph Lameter wrote:
>> On Wed, 30 May 2012, Andi Kleen wrote:
>>
>> > On Wed, May 30, 2012 at 01:50:02PM -0500, Christoph Lameter wrote:
>> > > On Wed, 30 May 2012, Andi Kleen wrote:
>> > >
>> > > > I always regretted that cpusets were no done with custom node lists.
>> > > > That would have been much cleaner and also likely faster than what we have.
>> > >
>> > > Could shared memory policies ignore cpuset constraints?
>> >
>> > Only if noone uses cpusets as a "security" mechanism, just for a "soft policy"
>> > Even with soft policy you could well break someone's setup.
>>
>> Well at least lets exempt shared memory from memory migration and memory
>> policy updates. That seems to be causing many of these issues.
>
> Migration on the page level is needed for the memory error handling.
>
> Updates: you mean not allowing to set the policy when there are already
> multiple mappers? I could see that causing some unexpected behaviour. Presumably
> a standard database will only set it at the beginning, but I don't know
> if that would work for all users.
We don't need to kill migration core. We only need to kill that mbind(2) updates
vma->policy of shmem.
page migration for hwpoison is harmless. Because of, an attacker can't
inject hwpoison
intentntionally on production environment (HWPOISON_INJECTION=N).
On Wed, 30 May 2012, KOSAKI Motohiro wrote:
> > You are dropping the nice comments by Mel that explain the refcounting.
>
> Because this is not strictly correct. 1) vma_merge() and split_vma() don't
> care mempolicy refcount. They only dup and drop it. 2) This mpol_get() is
> for vma attaching. This function don't need to care sp_node internal.
Ok then say so in the changelog.
> Yes, that's right direction, I think. Currently, shmem_set_policy() can't handle
> nonlinear mapping.
I've been mulling for some time to just remove non linear mappings.
AFAIK they were only useful on 32bit and are obsolete and could be
emulated with VMAs instead.
-Andi
On Wed, May 30, 2012 at 2:26 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, May 30, 2012 at 2:02 AM, ?<[email protected]> wrote:
>>
>> So, I think we should reconsider about shared mempolicy completely.
>
> Quite frankly, I'd prefer that approach. The code is subtle and
> horribly bug-fraught, and I absolutely detest the way it looks too.
> Reading your patches was actually somewhat painful.
Oh, very sorry. I made effort to make smallest and simplest patches.
But I couldn't I do better. Current MPOL_F_SHARED is cra^H^H^H
complex. ;-)
> If we could just remove the support for it entirely, that would be
> *much* preferable to continue working with this code.
>
> Could we just try that removal, and see if anybody screams?
I'm keeping neutral a while and hope to hear other developers opinion.
Thank you.
On Wed, May 30, 2012 at 4:10 PM, Andi Kleen <[email protected]> wrote:
>> Yes, that's right direction, I think. Currently, shmem_set_policy() can't handle
>> nonlinear mapping.
>
> I've been mulling for some time to just remove non linear mappings.
> AFAIK they were only useful on 32bit and are obsolete and could be
> emulated with VMAs instead.
I agree. It is only userful on 32bit and current enterprise users don't use
32bit anymore. So, I don't think emulated by vmas cause user visible issue.
[email protected] writes:
> From: KOSAKI Motohiro <[email protected]>
>
> Dave Jones' system call fuzz testing tool "trinity" triggered the following
> bug error with slab debugging enabled
We have to fix it properly sorry. There are users who benefit from it
and just disabling it is not gonna fly.
-Andi
--
[email protected] -- Speaking for myself only
[email protected] writes:
> From: KOSAKI Motohiro <[email protected]>
>
> commit cc9a6c8776 (cpuset: mm: reduce large amounts of memory barrier related
> damage v3) introduced a memory corruption.
>
> shmem_alloc_page() passes pseudo vma and it has one significant unique
> combination, vma->vm_ops=NULL and (vma->policy->flags & MPOL_F_SHARED).
>
> Now, get_vma_policy() does NOT increase a policy ref when vma->vm_ops=NULL
> and mpol_cond_put() DOES decrease a policy ref when a policy has MPOL_F_SHARED.
> Therefore, when cpuset race is happen and alloc_pages_vma() fall in
> 'goto retry_cpuset' path, a policy refcount will be decreased too much and
> therefore it will make memory corruption.
>
> This patch fixes it.
Looks good.
Acked-by: Andi Kleen <[email protected]>
-Andi
--
[email protected] -- Speaking for myself only
(5/30/12 4:31 PM), Andi Kleen wrote:
> [email protected] writes:
>
>> From: KOSAKI Motohiro<[email protected]>
>>
>> Dave Jones' system call fuzz testing tool "trinity" triggered the following
>> bug error with slab debugging enabled
>
> We have to fix it properly sorry. There are users who benefit from it
> and just disabling it is not gonna fly.
Why? This patch doesn't make any user visible behavior change. I haven't
caught what does your "proper" mean.
On Wed, May 30, 2012 at 04:00:55PM -0400, KOSAKI Motohiro wrote:
> On Wed, May 30, 2012 at 3:52 PM, Andi Kleen <[email protected]> wrote:
> > On Wed, May 30, 2012 at 02:42:42PM -0500, Christoph Lameter wrote:
> >> On Wed, 30 May 2012, Andi Kleen wrote:
> >>
> >> > On Wed, May 30, 2012 at 01:50:02PM -0500, Christoph Lameter wrote:
> >> > > On Wed, 30 May 2012, Andi Kleen wrote:
> >> > >
> >> > > > I always regretted that cpusets were no done with custom node lists.
> >> > > > That would have been much cleaner and also likely faster than what we have.
> >> > >
> >> > > Could shared memory policies ignore cpuset constraints?
> >> >
> >> > Only if noone uses cpusets as a "security" mechanism, just for a "soft policy"
> >> > Even with soft policy you could well break someone's setup.
> >>
> >> Well at least lets exempt shared memory from memory migration and memory
> >> policy updates. That seems to be causing many of these issues.
> >
> > Migration on the page level is needed for the memory error handling.
> >
> > Updates: you mean not allowing to set the policy when there are already
> > multiple mappers? I could see that causing some unexpected behaviour. Presumably
> > a standard database will only set it at the beginning, but I don't know
> > if that would work for all users.
>
> We don't need to kill migration core. We only need to kill that mbind(2) updates
> vma->policy of shmem.
[...]
So should I (and Greg) drop 'mm: mempolicy: Let vma_merge and
vma_split handle vma->vm_policy linkages' from the pending stable
releases? Or is that OK as an interim fix until these changes go
into mainline?
Ben.
--
Ben Hutchings
We get into the habit of living before acquiring the habit of thinking.
- Albert Camus
(5/30/12 5:22 PM), Ben Hutchings wrote:
> On Wed, May 30, 2012 at 04:00:55PM -0400, KOSAKI Motohiro wrote:
>> On Wed, May 30, 2012 at 3:52 PM, Andi Kleen<[email protected]> wrote:
>>> On Wed, May 30, 2012 at 02:42:42PM -0500, Christoph Lameter wrote:
>>>> On Wed, 30 May 2012, Andi Kleen wrote:
>>>>
>>>>> On Wed, May 30, 2012 at 01:50:02PM -0500, Christoph Lameter wrote:
>>>>>> On Wed, 30 May 2012, Andi Kleen wrote:
>>>>>>
>>>>>>> I always regretted that cpusets were no done with custom node lists.
>>>>>>> That would have been much cleaner and also likely faster than what we have.
>>>>>>
>>>>>> Could shared memory policies ignore cpuset constraints?
>>>>>
>>>>> Only if noone uses cpusets as a "security" mechanism, just for a "soft policy"
>>>>> Even with soft policy you could well break someone's setup.
>>>>
>>>> Well at least lets exempt shared memory from memory migration and memory
>>>> policy updates. That seems to be causing many of these issues.
>>>
>>> Migration on the page level is needed for the memory error handling.
>>>
>>> Updates: you mean not allowing to set the policy when there are already
>>> multiple mappers? I could see that causing some unexpected behaviour. Presumably
>>> a standard database will only set it at the beginning, but I don't know
>>> if that would work for all users.
>>
>> We don't need to kill migration core. We only need to kill that mbind(2) updates
>> vma->policy of shmem.
> [...]
>
> So should I (and Greg) drop 'mm: mempolicy: Let vma_merge and
> vma_split handle vma->vm_policy linkages' from the pending stable
> releases? Or is that OK as an interim fix until these changes go
> into mainline?
Please drop. It screw up mbind(2).
(5/30/12 3:17 PM), Christoph Lameter wrote:
> On Wed, 30 May 2012, [email protected] wrote:
>
>> From: KOSAKI Motohiro<[email protected]>
>>
>> commit 05f144a0d5 removed vma->vm_policy updates code and it is a purpose of
>> mbind_range(). Now, mbind_range() is virtually no-op. no-op function don't
>> makes any bugs, I agree. but maybe it is not right fix.
>
> I dont really understand the changelog. But to restore the policy_vma() is
> the right thing to do since there are potential multiple use cases where
> we want to apply a policy to a vma.
>
> Proposed new changelog:
>
> Commit 05f144a0d5 folded policy_vma() into mbind_range(). There are
> other use cases of policy_vma(*) though and so revert a piece of
> that commit in order to have a policy_vma() function again.
sorry, I overlooked this. Commit 05f144a0d5 don't work neither regular vma
nor shmem vma. thus I can't take this proposal. sorry.
On Wed, 30 May 2012, KOSAKI Motohiro wrote:
> On Wed, May 30, 2012 at 4:10 PM, Andi Kleen <[email protected]> wrote:
>>> Yes, that's right direction, I think. Currently, shmem_set_policy() can't handle
>>> nonlinear mapping.
>>
>> I've been mulling for some time to just remove non linear mappings.
>> AFAIK they were only useful on 32bit and are obsolete and could be
>> emulated with VMAs instead.
>
> I agree. It is only userful on 32bit and current enterprise users don't use
> 32bit anymore. So, I don't think emulated by vmas cause user visible issue.
I wish this was true, there are a lot of systems out there still running
32 bit linux, even on 64 bit capible hardware. This is especially true in
enterprises where they have either homegrown or proprietary software that
isn't 64 bit clean.
David Lang
On Thu, 31 May 2012, [email protected] wrote:
> On Wed, 30 May 2012, KOSAKI Motohiro wrote:
>
> > On Wed, May 30, 2012 at 4:10 PM, Andi Kleen <[email protected]> wrote:
> > > > Yes, that's right direction, I think. Currently, shmem_set_policy()
> > > > can't handle
> > > > nonlinear mapping.
> > >
> > > I've been mulling for some time to just remove non linear mappings.
> > > AFAIK they were only useful on 32bit and are obsolete and could be
> > > emulated with VMAs instead.
> >
> > I agree. It is only userful on 32bit and current enterprise users don't use
> > 32bit anymore. So, I don't think emulated by vmas cause user visible issue.
>
> I wish this was true, there are a lot of systems out there still running 32
> bit linux, even on 64 bit capible hardware. This is especially true in
> enterprises where they have either homegrown or proprietary software that
> isn't 64 bit clean.
32 bit binaries (and entire distros) run fine under 64 bit kernels.
On Fri, 1 Jun 2012, Christoph Lameter wrote:
> Subject: Re: [PATCH 0/6] mempolicy memory corruption fixlet
>
> On Thu, 31 May 2012, [email protected] wrote:
>
>> On Wed, 30 May 2012, KOSAKI Motohiro wrote:
>>
>>> On Wed, May 30, 2012 at 4:10 PM, Andi Kleen <[email protected]> wrote:
>>>>> Yes, that's right direction, I think. Currently, shmem_set_policy()
>>>>> can't handle
>>>>> nonlinear mapping.
>>>>
>>>> I've been mulling for some time to just remove non linear mappings.
>>>> AFAIK they were only useful on 32bit and are obsolete and could be
>>>> emulated with VMAs instead.
>>>
>>> I agree. It is only userful on 32bit and current enterprise users don't use
>>> 32bit anymore. So, I don't think emulated by vmas cause user visible issue.
>>
>> I wish this was true, there are a lot of systems out there still running 32
>> bit linux, even on 64 bit capible hardware. This is especially true in
>> enterprises where they have either homegrown or proprietary software that
>> isn't 64 bit clean.
>
> 32 bit binaries (and entire distros) run fine under 64 bit kernels.
unfortunantly, not quite 100% of the time. It's very good, but the
automount bug a month or so ago is an example of how you can run into rare
problems. Many "enterprise" systems are not willing to risk it.
David Lang
>>>>>> Yes, that's right direction, I think. Currently, shmem_set_policy()
>>>>>> can't handle
>>>>>> nonlinear mapping.
>>>>>
>>>>> I've been mulling for some time to just remove non linear mappings.
>>>>> AFAIK they were only useful on 32bit and are obsolete and could be
>>>>> emulated with VMAs instead.
>>>>
>>>>
>>>> I agree. It is only userful on 32bit and current enterprise users don't
>>>> use
>>>> 32bit anymore. So, I don't think emulated by vmas cause user visible
>>>> issue.
>>>
>>>
>>> I wish this was true, there are a lot of systems out there still running
>>> 32
>>> bit linux, even on 64 bit capible hardware. This is especially true in
>>> enterprises where they have either homegrown or proprietary software that
>>> isn't 64 bit clean.
>>
>> 32 bit binaries (and entire distros) run fine under 64 bit kernels.
>
> unfortunantly, not quite 100% of the time. It's very good, but the automount
> bug a month or so ago is an example of how you can run into rare problems.
> Many "enterprise" systems are not willing to risk it.
Then we can remove a feature safely. Risk not taker continue to uses old distro
kernel. And some years after, distros discontinue to support 32bit. And then,
problem will vanish automatically.
I'm coming back to this email thread, because I didn't apply the
series due to all the ongoing discussion and hoping that somebody
would put changelog fixes and ack notices etc together.
I'd also really like to know that the people who saw the problem that
caused the current single patch (that this series reverts) would test
the whole series. Maybe that happened and I didn't notice it in the
threads, but I don't think so.
In fact, right now I'm assuming that the series will eventually come
to me through Andrew. Andrew, correct?
Linus
On Tue, 5 Jun 2012 12:02:25 -0700
Linus Torvalds <[email protected]> wrote:
> I'm coming back to this email thread, because I didn't apply the
> series due to all the ongoing discussion and hoping that somebody
> would put changelog fixes and ack notices etc together.
>
> I'd also really like to know that the people who saw the problem that
> caused the current single patch (that this series reverts) would test
> the whole series. Maybe that happened and I didn't notice it in the
> threads, but I don't think so.
>
> In fact, right now I'm assuming that the series will eventually come
> to me through Andrew. Andrew, correct?
>
yup.
I expect there will be a v2 series (at least). It's unclear what
we'll be doing with [2/6]: whether the patch will be reworked, or
whether Andi misunderstood its effects?
(6/5/12 3:17 PM), Andrew Morton wrote:
> On Tue, 5 Jun 2012 12:02:25 -0700
> Linus Torvalds<[email protected]> wrote:
>
>> I'm coming back to this email thread, because I didn't apply the
>> series due to all the ongoing discussion and hoping that somebody
>> would put changelog fixes and ack notices etc together.
>>
>> I'd also really like to know that the people who saw the problem that
>> caused the current single patch (that this series reverts) would test
>> the whole series. Maybe that happened and I didn't notice it in the
>> threads, but I don't think so.
I'm not surprised this. If many people are interesting to review this area,
mempolicy wouldn't have break so a lot.
>> In fact, right now I'm assuming that the series will eventually come
>> to me through Andrew. Andrew, correct?
>
> yup.
>
> I expect there will be a v2 series (at least). It's unclear what
> we'll be doing with [2/6]: whether the patch will be reworked, or
> whether Andi misunderstood its effects?
Maybe because Andi didn't join bug fix works in this area for several years?
Currently, mbind(2) is completely broken. A primary role of mbind(2) is to
update memory policy of some vmas and Mel's fix remvoed it. Then, mbind is
almostly no-op. it's a regression.
I'm not clear which point you seems unclear. So, let's repeat a description of
[2/6].
There are two problem now, alloc_pages_vma() has strong and wrong assumption.
vma->policy never have MPOL_F_SHARED and shared_policy->policy must have it.
And, cpusets rebinding ignore mpol->refcnt and updates it forcibly.
The final point is to implement cow. But for it, we need rewrite mpol->rebind
family completely. It doesn't fit for 3.5 timeframe.
The downside of patch [2/6] is very small. because,
A memplicy is only shared three cases, 1) mbind() updates multiple
vmas 2) mbind() updates shmem vma 3) A shared policy splits into two regions
by a part region update.
All of them are rare. Because nobody hit kernel panic until now. Then I don't
think my patch increase memory footprint.