At 09/14/2012 09:36 AM, Hugh Dickins Wrote:
> On Thu, 13 Sep 2012, Johannes Weiner wrote:
>> On Thu, Sep 13, 2012 at 03:14:28PM +0800, Wen Congyang wrote:
>>> root_mem_cgroup->info.nodeinfo is initialized when the system boots.
>>> But NODE_DATA(nid) is null if the node is not onlined, so
>>> root_mem_cgroup->info.nodeinfo[nid]->zoneinfo[zone].lruvec.zone contains
>>> an invalid pointer. If we use numactl to bind a program to the node
>>> after onlining the node and its memory, it will cause the kernel
>>> panicked:
>>
>> Is there any chance we could get rid of the zone backpointer in lruvec
>> again instead?
>
> It could be done, but it would make me sad :(
>
>> Adding new nodes is a rare event and so updating every
>> single memcg in the system might be just borderline crazy.
>
> Not horribly crazy, but rather ugly, yes.
>
>> But can't
>> we just go back to passing the zone along with the lruvec down
>> vmscan.c paths? I agree it's ugly to pass both, given their
>> relationship. But I don't think the backpointer is any cleaner but in
>> addition less robust.
>
> It's like how we use vma->mm: we could change everywhere to pass mm with
> vma, but it looks cleaner and cuts down on long arglists to have mm in vma.
>>From past experience, one of the things I worried about was adding extra
> args to the reclaim stack.
>
>>
>> That being said, the crashing code in particular makes me wonder:
>>
>> static __always_inline void add_page_to_lru_list(struct page *page,
>> struct lruvec *lruvec, enum lru_list lru)
>> {
>> int nr_pages = hpage_nr_pages(page);
>> mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
>> list_add(&page->lru, &lruvec->lists[lru]);
>> __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
>> }
>>
>> Why did we ever pass zone in here and then felt the need to replace it
>> with lruvec->zone in fa9add6 "mm/memcg: apply add/del_page to lruvec"?
>> A page does not roam between zones, its zone is a static property that
>> can be retrieved with page_zone().
>
> Just as in vmscan.c, we have the lruvec to hand, and that's what we
> mainly want to operate upon, but there is also some need for zone.
>
> (Both Konstantin and I were looking towards the day when we move the
> lru_lock into the lruvec, removing more dependence on "zone". Pretty
> much the only reason that hasn't happened yet, is that we have not found
> time to make a performance case convincingly - but that's another topic.)
>
> Yes, page_zone(page) is a static property of the page, but it's not
> necessarily cheap to evaluate: depends on how complex the memory model
> and the spare page flags space, doesn't it? We both preferred to
> derive zone from lruvec where convenient.
>
> How do you feel about this patch, and does it work for you guys?
>
> You'd be right if you guessed that I started out without the
> mem_cgroup_zone_lruvec part of it, but oops in get_scan_count
> told me that's needed too.
>
> Description to be filled in later: would it be needed for -stable,
> or is onlining already broken in other ways that you're now fixing up?
>
> Reported-by: Tang Chen <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
Hi, all:
What about the status of this patch?
Thanks
Wen Congyang
> ---
>
> include/linux/mmzone.h | 2 -
> mm/memcontrol.c | 40 ++++++++++++++++++++++++++++++++-------
> mm/mmzone.c | 6 -----
> mm/page_alloc.c | 2 -
> 4 files changed, 36 insertions(+), 14 deletions(-)
>
> --- 3.6-rc5/include/linux/mmzone.h 2012-08-03 08:31:26.892842267 -0700
> +++ linux/include/linux/mmzone.h 2012-09-13 17:07:51.893772372 -0700
> @@ -744,7 +744,7 @@ extern int init_currently_empty_zone(str
> unsigned long size,
> enum memmap_context context);
>
> -extern void lruvec_init(struct lruvec *lruvec, struct zone *zone);
> +extern void lruvec_init(struct lruvec *lruvec);
>
> static inline struct zone *lruvec_zone(struct lruvec *lruvec)
> {
> --- 3.6-rc5/mm/memcontrol.c 2012-08-03 08:31:27.060842270 -0700
> +++ linux/mm/memcontrol.c 2012-09-13 17:46:36.870804625 -0700
> @@ -1061,12 +1061,25 @@ struct lruvec *mem_cgroup_zone_lruvec(st
> struct mem_cgroup *memcg)
> {
> struct mem_cgroup_per_zone *mz;
> + struct lruvec *lruvec;
>
> - if (mem_cgroup_disabled())
> - return &zone->lruvec;
> + if (mem_cgroup_disabled()) {
> + lruvec = &zone->lruvec;
> + goto out;
> + }
>
> mz = mem_cgroup_zoneinfo(memcg, zone_to_nid(zone), zone_idx(zone));
> - return &mz->lruvec;
> + lruvec = &mz->lruvec;
> +out:
> + /*
> + * Since a node can be onlined after the mem_cgroup was created,
> + * we have to be prepared to initialize lruvec->zone here.
> + */
> + if (unlikely(lruvec->zone != zone)) {
> + VM_BUG_ON(lruvec->zone);
> + lruvec->zone = zone;
> + }
> + return lruvec;
> }
>
> /*
> @@ -1093,9 +1106,12 @@ struct lruvec *mem_cgroup_page_lruvec(st
> struct mem_cgroup_per_zone *mz;
> struct mem_cgroup *memcg;
> struct page_cgroup *pc;
> + struct lruvec *lruvec;
>
> - if (mem_cgroup_disabled())
> - return &zone->lruvec;
> + if (mem_cgroup_disabled()) {
> + lruvec = &zone->lruvec;
> + goto out;
> + }
>
> pc = lookup_page_cgroup(page);
> memcg = pc->mem_cgroup;
> @@ -1113,7 +1129,17 @@ struct lruvec *mem_cgroup_page_lruvec(st
> pc->mem_cgroup = memcg = root_mem_cgroup;
>
> mz = page_cgroup_zoneinfo(memcg, page);
> - return &mz->lruvec;
> + lruvec = &mz->lruvec;
> +out:
> + /*
> + * Since a node can be onlined after the mem_cgroup was created,
> + * we have to be prepared to initialize lruvec->zone here.
> + */
> + if (unlikely(lruvec->zone != zone)) {
> + VM_BUG_ON(lruvec->zone);
> + lruvec->zone = zone;
> + }
> + return lruvec;
> }
>
> /**
> @@ -4742,7 +4768,7 @@ static int alloc_mem_cgroup_per_zone_inf
>
> for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> mz = &pn->zoneinfo[zone];
> - lruvec_init(&mz->lruvec, &NODE_DATA(node)->node_zones[zone]);
> + lruvec_init(&mz->lruvec);
> mz->usage_in_excess = 0;
> mz->on_tree = false;
> mz->memcg = memcg;
> --- 3.6-rc5/mm/mmzone.c 2012-08-03 08:31:27.064842271 -0700
> +++ linux/mm/mmzone.c 2012-09-13 17:06:28.921766001 -0700
> @@ -87,7 +87,7 @@ int memmap_valid_within(unsigned long pf
> }
> #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
>
> -void lruvec_init(struct lruvec *lruvec, struct zone *zone)
> +void lruvec_init(struct lruvec *lruvec)
> {
> enum lru_list lru;
>
> @@ -95,8 +95,4 @@ void lruvec_init(struct lruvec *lruvec,
>
> for_each_lru(lru)
> INIT_LIST_HEAD(&lruvec->lists[lru]);
> -
> -#ifdef CONFIG_MEMCG
> - lruvec->zone = zone;
> -#endif
> }
> --- 3.6-rc5/mm/page_alloc.c 2012-08-22 14:25:39.508279046 -0700
> +++ linux/mm/page_alloc.c 2012-09-13 17:06:08.265763526 -0700
> @@ -4456,7 +4456,7 @@ static void __paginginit free_area_init_
> zone->zone_pgdat = pgdat;
>
> zone_pcp_init(zone);
> - lruvec_init(&zone->lruvec, zone);
> + lruvec_init(&zone->lruvec);
> if (!size)
> continue;
>
>
On Tue, 16 Oct 2012, Wen Congyang wrote:
> At 09/14/2012 09:36 AM, Hugh Dickins Wrote:
> >
> > Description to be filled in later: would it be needed for -stable,
> > or is onlining already broken in other ways that you're now fixing up?
> >
> > Reported-by: Tang Chen <[email protected]>
> > Signed-off-by: Hugh Dickins <[email protected]>
>
> Hi, all:
>
> What about the status of this patch?
Sorry I'm being so unresponsive at the moment (or, as usual).
When I sent the fixed version afterwards (minus mistaken VM_BUG_ON,
plus safer mem_cgroup_force_empty_list), I expected you or Konstantin
to respond with a patch to fix it as you preferred (at offline/online);
so this was on hold until we could compare and decide between them.
In the meantime, I assume, we've all come to feel that this way is
simple, and probably the best way for now; or at least good enough,
and we all have better things to do than play with alternatives.
I'll write up the description of the fixed version, and post it for
3.7, including the Acks from Hannes and KAMEZAWA (assuming they carry
forward to the second version) - but probably not today or tomorrow.
But please help me: I still don't know if it's needed for -stable.
We introduced the bug in 3.5, but I see lots of memory hotplug fixes
coming by from you and others, so I do not know if this lruvec->zone
fix is useful by itself in 3.5 and 3.6, or not - please tell me.
Thanks,
Hugh
>
> Thanks
> Wen Congyang
>
> > ---
> >
> > include/linux/mmzone.h | 2 -
> > mm/memcontrol.c | 40 ++++++++++++++++++++++++++++++++-------
> > mm/mmzone.c | 6 -----
> > mm/page_alloc.c | 2 -
> > 4 files changed, 36 insertions(+), 14 deletions(-)
> >
> > --- 3.6-rc5/include/linux/mmzone.h 2012-08-03 08:31:26.892842267 -0700
> > +++ linux/include/linux/mmzone.h 2012-09-13 17:07:51.893772372 -0700
> > @@ -744,7 +744,7 @@ extern int init_currently_empty_zone(str
> > unsigned long size,
> > enum memmap_context context);
> >
> > -extern void lruvec_init(struct lruvec *lruvec, struct zone *zone);
> > +extern void lruvec_init(struct lruvec *lruvec);
> >
> > static inline struct zone *lruvec_zone(struct lruvec *lruvec)
> > {
> > --- 3.6-rc5/mm/memcontrol.c 2012-08-03 08:31:27.060842270 -0700
> > +++ linux/mm/memcontrol.c 2012-09-13 17:46:36.870804625 -0700
> > @@ -1061,12 +1061,25 @@ struct lruvec *mem_cgroup_zone_lruvec(st
> > struct mem_cgroup *memcg)
> > {
> > struct mem_cgroup_per_zone *mz;
> > + struct lruvec *lruvec;
> >
> > - if (mem_cgroup_disabled())
> > - return &zone->lruvec;
> > + if (mem_cgroup_disabled()) {
> > + lruvec = &zone->lruvec;
> > + goto out;
> > + }
> >
> > mz = mem_cgroup_zoneinfo(memcg, zone_to_nid(zone), zone_idx(zone));
> > - return &mz->lruvec;
> > + lruvec = &mz->lruvec;
> > +out:
> > + /*
> > + * Since a node can be onlined after the mem_cgroup was created,
> > + * we have to be prepared to initialize lruvec->zone here.
> > + */
> > + if (unlikely(lruvec->zone != zone)) {
> > + VM_BUG_ON(lruvec->zone);
> > + lruvec->zone = zone;
> > + }
> > + return lruvec;
> > }
> >
> > /*
> > @@ -1093,9 +1106,12 @@ struct lruvec *mem_cgroup_page_lruvec(st
> > struct mem_cgroup_per_zone *mz;
> > struct mem_cgroup *memcg;
> > struct page_cgroup *pc;
> > + struct lruvec *lruvec;
> >
> > - if (mem_cgroup_disabled())
> > - return &zone->lruvec;
> > + if (mem_cgroup_disabled()) {
> > + lruvec = &zone->lruvec;
> > + goto out;
> > + }
> >
> > pc = lookup_page_cgroup(page);
> > memcg = pc->mem_cgroup;
> > @@ -1113,7 +1129,17 @@ struct lruvec *mem_cgroup_page_lruvec(st
> > pc->mem_cgroup = memcg = root_mem_cgroup;
> >
> > mz = page_cgroup_zoneinfo(memcg, page);
> > - return &mz->lruvec;
> > + lruvec = &mz->lruvec;
> > +out:
> > + /*
> > + * Since a node can be onlined after the mem_cgroup was created,
> > + * we have to be prepared to initialize lruvec->zone here.
> > + */
> > + if (unlikely(lruvec->zone != zone)) {
> > + VM_BUG_ON(lruvec->zone);
> > + lruvec->zone = zone;
> > + }
> > + return lruvec;
> > }
> >
> > /**
> > @@ -4742,7 +4768,7 @@ static int alloc_mem_cgroup_per_zone_inf
> >
> > for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> > mz = &pn->zoneinfo[zone];
> > - lruvec_init(&mz->lruvec, &NODE_DATA(node)->node_zones[zone]);
> > + lruvec_init(&mz->lruvec);
> > mz->usage_in_excess = 0;
> > mz->on_tree = false;
> > mz->memcg = memcg;
> > --- 3.6-rc5/mm/mmzone.c 2012-08-03 08:31:27.064842271 -0700
> > +++ linux/mm/mmzone.c 2012-09-13 17:06:28.921766001 -0700
> > @@ -87,7 +87,7 @@ int memmap_valid_within(unsigned long pf
> > }
> > #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
> >
> > -void lruvec_init(struct lruvec *lruvec, struct zone *zone)
> > +void lruvec_init(struct lruvec *lruvec)
> > {
> > enum lru_list lru;
> >
> > @@ -95,8 +95,4 @@ void lruvec_init(struct lruvec *lruvec,
> >
> > for_each_lru(lru)
> > INIT_LIST_HEAD(&lruvec->lists[lru]);
> > -
> > -#ifdef CONFIG_MEMCG
> > - lruvec->zone = zone;
> > -#endif
> > }
> > --- 3.6-rc5/mm/page_alloc.c 2012-08-22 14:25:39.508279046 -0700
> > +++ linux/mm/page_alloc.c 2012-09-13 17:06:08.265763526 -0700
> > @@ -4456,7 +4456,7 @@ static void __paginginit free_area_init_
> > zone->zone_pgdat = pgdat;
> >
> > zone_pcp_init(zone);
> > - lruvec_init(&zone->lruvec, zone);
> > + lruvec_init(&zone->lruvec);
> > if (!size)
> > continue;
> >
> >
>
>
On Thu, Oct 18, 2012 at 12:03:44PM -0700, Hugh Dickins wrote:
> On Tue, 16 Oct 2012, Wen Congyang wrote:
> > At 09/14/2012 09:36 AM, Hugh Dickins Wrote:
> > >
> > > Description to be filled in later: would it be needed for -stable,
> > > or is onlining already broken in other ways that you're now fixing up?
> > >
> > > Reported-by: Tang Chen <[email protected]>
> > > Signed-off-by: Hugh Dickins <[email protected]>
> >
> > Hi, all:
> >
> > What about the status of this patch?
>
> Sorry I'm being so unresponsive at the moment (or, as usual).
>
> When I sent the fixed version afterwards (minus mistaken VM_BUG_ON,
> plus safer mem_cgroup_force_empty_list), I expected you or Konstantin
> to respond with a patch to fix it as you preferred (at offline/online);
> so this was on hold until we could compare and decide between them.
>
> In the meantime, I assume, we've all come to feel that this way is
> simple, and probably the best way for now; or at least good enough,
> and we all have better things to do than play with alternatives.
>
> I'll write up the description of the fixed version, and post it for
> 3.7, including the Acks from Hannes and KAMEZAWA (assuming they carry
> forward to the second version) - but probably not today or tomorrow.
Mine does, thanks for asking :-)
When MEMCG is configured on (even when it's disabled by boot option),
when adding or removing a page to/from its lru list, the zone pointer
used for stats updates is nowadays taken from the struct lruvec.
(On many configurations, calculating zone from page is slower.)
But we have no code to update all the lruvecs (per zone, per memcg)
when a memory node is hotadded. Here's an extract from the oops which
results when running numactl to bind a program to a newly onlined node:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
IP: [<ffffffff811870b9>] __mod_zone_page_state+0x9/0x60
PGD 0
Oops: 0000 [#1] SMP
CPU 2
Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
Stack:
ffff880039abdaf8 ffffffff8117390f ffff880039abdaf8 000000008167c601
ffffffff81174162 ffff88003a480f00 0000000000000001 ffff8800395e0000
ffff88003dbd0e80 0000000000000282 ffff880039abdb48 ffffffff81174181
Call Trace:
[<ffffffff8117390f>] __pagevec_lru_add_fn+0xdf/0x140
[<ffffffff81174181>] pagevec_lru_move_fn+0xb1/0x100
[<ffffffff811741ec>] __pagevec_lru_add+0x1c/0x30
[<ffffffff81174383>] lru_add_drain_cpu+0xa3/0x130
[<ffffffff8117443f>] lru_add_drain+0x2f/0x40
...
The natural solution might be to use a memcg callback whenever memory
is hotadded; but that solution has not been scoped out, and it happens
that we do have an easy location at which to update lruvec->zone. The
lruvec pointer is discovered either by mem_cgroup_zone_lruvec() or by
mem_cgroup_page_lruvec(), and both of those do know the right zone.
So check and set lruvec->zone in those; and remove the inadequate
attempt to set lruvec->zone from lruvec_init(), which is called
before NODE_DATA(node) has been allocated in such cases.
Ah, there was one exceptionr. For no particularly good reason,
mem_cgroup_force_empty_list() has its own code for deciding lruvec.
Change it to use the standard mem_cgroup_zone_lruvec() and
mem_cgroup_get_lru_size() too. In fact it was already safe against
such an oops (the lru lists in danger could only be empty),
but we're better proofed against future changes this way.
Reported-by: Tang Chen <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Wen Congyang <[email protected]>
Cc: [email protected]
---
I've marked this for stable (3.6) since we introduced the problem
in 3.5 (now closed to stable); but I have no idea if this is the
only fix needed to get memory hotadd working with memcg in 3.6,
and received no answer when I enquired twice before.
include/linux/mmzone.h | 2 -
mm/memcontrol.c | 46 +++++++++++++++++++++++++++++----------
mm/mmzone.c | 6 -----
mm/page_alloc.c | 2 -
4 files changed, 38 insertions(+), 18 deletions(-)
--- 3.7-rc3/include/linux/mmzone.h 2012-10-14 16:16:57.665308933 -0700
+++ linux/include/linux/mmzone.h 2012-11-01 14:31:04.284185741 -0700
@@ -752,7 +752,7 @@ extern int init_currently_empty_zone(str
unsigned long size,
enum memmap_context context);
-extern void lruvec_init(struct lruvec *lruvec, struct zone *zone);
+extern void lruvec_init(struct lruvec *lruvec);
static inline struct zone *lruvec_zone(struct lruvec *lruvec)
{
--- 3.7-rc3/mm/memcontrol.c 2012-10-14 16:16:58.341309118 -0700
+++ linux/mm/memcontrol.c 2012-11-01 14:31:04.284185741 -0700
@@ -1055,12 +1055,24 @@ struct lruvec *mem_cgroup_zone_lruvec(st
struct mem_cgroup *memcg)
{
struct mem_cgroup_per_zone *mz;
+ struct lruvec *lruvec;
- if (mem_cgroup_disabled())
- return &zone->lruvec;
+ if (mem_cgroup_disabled()) {
+ lruvec = &zone->lruvec;
+ goto out;
+ }
mz = mem_cgroup_zoneinfo(memcg, zone_to_nid(zone), zone_idx(zone));
- return &mz->lruvec;
+ lruvec = &mz->lruvec;
+out:
+ /*
+ * Since a node can be onlined after the mem_cgroup was created,
+ * we have to be prepared to initialize lruvec->zone here;
+ * and if offlined then reonlined, we need to reinitialize it.
+ */
+ if (unlikely(lruvec->zone != zone))
+ lruvec->zone = zone;
+ return lruvec;
}
/*
@@ -1087,9 +1099,12 @@ struct lruvec *mem_cgroup_page_lruvec(st
struct mem_cgroup_per_zone *mz;
struct mem_cgroup *memcg;
struct page_cgroup *pc;
+ struct lruvec *lruvec;
- if (mem_cgroup_disabled())
- return &zone->lruvec;
+ if (mem_cgroup_disabled()) {
+ lruvec = &zone->lruvec;
+ goto out;
+ }
pc = lookup_page_cgroup(page);
memcg = pc->mem_cgroup;
@@ -1107,7 +1122,16 @@ struct lruvec *mem_cgroup_page_lruvec(st
pc->mem_cgroup = memcg = root_mem_cgroup;
mz = page_cgroup_zoneinfo(memcg, page);
- return &mz->lruvec;
+ lruvec = &mz->lruvec;
+out:
+ /*
+ * Since a node can be onlined after the mem_cgroup was created,
+ * we have to be prepared to initialize lruvec->zone here;
+ * and if offlined then reonlined, we need to reinitialize it.
+ */
+ if (unlikely(lruvec->zone != zone))
+ lruvec->zone = zone;
+ return lruvec;
}
/**
@@ -3688,17 +3712,17 @@ unsigned long mem_cgroup_soft_limit_recl
static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
int node, int zid, enum lru_list lru)
{
- struct mem_cgroup_per_zone *mz;
+ struct lruvec *lruvec;
unsigned long flags, loop;
struct list_head *list;
struct page *busy;
struct zone *zone;
zone = &NODE_DATA(node)->node_zones[zid];
- mz = mem_cgroup_zoneinfo(memcg, node, zid);
- list = &mz->lruvec.lists[lru];
+ lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+ list = &lruvec->lists[lru];
- loop = mz->lru_size[lru];
+ loop = mem_cgroup_get_lru_size(lruvec, lru);
/* give some margin against EBUSY etc...*/
loop += 256;
busy = NULL;
@@ -4736,7 +4760,7 @@ static int alloc_mem_cgroup_per_zone_inf
for (zone = 0; zone < MAX_NR_ZONES; zone++) {
mz = &pn->zoneinfo[zone];
- lruvec_init(&mz->lruvec, &NODE_DATA(node)->node_zones[zone]);
+ lruvec_init(&mz->lruvec);
mz->usage_in_excess = 0;
mz->on_tree = false;
mz->memcg = memcg;
--- 3.7-rc3/mm/mmzone.c 2012-09-30 16:47:46.000000000 -0700
+++ linux/mm/mmzone.c 2012-11-01 14:31:04.284185741 -0700
@@ -87,7 +87,7 @@ int memmap_valid_within(unsigned long pf
}
#endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
-void lruvec_init(struct lruvec *lruvec, struct zone *zone)
+void lruvec_init(struct lruvec *lruvec)
{
enum lru_list lru;
@@ -95,8 +95,4 @@ void lruvec_init(struct lruvec *lruvec,
for_each_lru(lru)
INIT_LIST_HEAD(&lruvec->lists[lru]);
-
-#ifdef CONFIG_MEMCG
- lruvec->zone = zone;
-#endif
}
--- 3.7-rc3/mm/page_alloc.c 2012-10-28 13:48:00.021774166 -0700
+++ linux/mm/page_alloc.c 2012-11-01 14:31:04.284185741 -0700
@@ -4505,7 +4505,7 @@ static void __paginginit free_area_init_
zone->zone_pgdat = pgdat;
zone_pcp_init(zone);
- lruvec_init(&zone->lruvec, zone);
+ lruvec_init(&zone->lruvec);
if (!size)
continue;
On Thu 01-11-12 18:28:02, Hugh Dickins wrote:
> When MEMCG is configured on (even when it's disabled by boot option),
> when adding or removing a page to/from its lru list, the zone pointer
> used for stats updates is nowadays taken from the struct lruvec.
> (On many configurations, calculating zone from page is slower.)
>
> But we have no code to update all the lruvecs (per zone, per memcg)
> when a memory node is hotadded. Here's an extract from the oops which
> results when running numactl to bind a program to a newly onlined node:
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
> IP: [<ffffffff811870b9>] __mod_zone_page_state+0x9/0x60
> PGD 0
> Oops: 0000 [#1] SMP
> CPU 2
> Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
> Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
> Stack:
> ffff880039abdaf8 ffffffff8117390f ffff880039abdaf8 000000008167c601
> ffffffff81174162 ffff88003a480f00 0000000000000001 ffff8800395e0000
> ffff88003dbd0e80 0000000000000282 ffff880039abdb48 ffffffff81174181
> Call Trace:
> [<ffffffff8117390f>] __pagevec_lru_add_fn+0xdf/0x140
> [<ffffffff81174181>] pagevec_lru_move_fn+0xb1/0x100
> [<ffffffff811741ec>] __pagevec_lru_add+0x1c/0x30
> [<ffffffff81174383>] lru_add_drain_cpu+0xa3/0x130
> [<ffffffff8117443f>] lru_add_drain+0x2f/0x40
> ...
>
> The natural solution might be to use a memcg callback whenever memory
> is hotadded; but that solution has not been scoped out, and it happens
> that we do have an easy location at which to update lruvec->zone. The
> lruvec pointer is discovered either by mem_cgroup_zone_lruvec() or by
> mem_cgroup_page_lruvec(), and both of those do know the right zone.
>
> So check and set lruvec->zone in those; and remove the inadequate
> attempt to set lruvec->zone from lruvec_init(), which is called
> before NODE_DATA(node) has been allocated in such cases.
>
> Ah, there was one exceptionr. For no particularly good reason,
> mem_cgroup_force_empty_list() has its own code for deciding lruvec.
> Change it to use the standard mem_cgroup_zone_lruvec() and
> mem_cgroup_get_lru_size() too. In fact it was already safe against
> such an oops (the lru lists in danger could only be empty),
> but we're better proofed against future changes this way.
>
> Reported-by: Tang Chen <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Konstantin Khlebnikov <[email protected]>
> Cc: Wen Congyang <[email protected]>
> Cc: [email protected]
OK, it adds "an overhead" also when there is no hotadd going on but this
is just one additional mem access&cmp&je so it shouldn't be noticable
(lruvec->zone is used most of the time later so it not a pointless
load).
It is also easier to backport for stable. But is there any reason to fix
it later properly in the hotadd hook?
Anyway
Acked-by: Michal Hocko <[email protected]>
Thanks!
> ---
> I've marked this for stable (3.6) since we introduced the problem
> in 3.5 (now closed to stable); but I have no idea if this is the
> only fix needed to get memory hotadd working with memcg in 3.6,
> and received no answer when I enquired twice before.
>
> include/linux/mmzone.h | 2 -
> mm/memcontrol.c | 46 +++++++++++++++++++++++++++++----------
> mm/mmzone.c | 6 -----
> mm/page_alloc.c | 2 -
> 4 files changed, 38 insertions(+), 18 deletions(-)
>
> --- 3.7-rc3/include/linux/mmzone.h 2012-10-14 16:16:57.665308933 -0700
> +++ linux/include/linux/mmzone.h 2012-11-01 14:31:04.284185741 -0700
> @@ -752,7 +752,7 @@ extern int init_currently_empty_zone(str
> unsigned long size,
> enum memmap_context context);
>
> -extern void lruvec_init(struct lruvec *lruvec, struct zone *zone);
> +extern void lruvec_init(struct lruvec *lruvec);
>
> static inline struct zone *lruvec_zone(struct lruvec *lruvec)
> {
> --- 3.7-rc3/mm/memcontrol.c 2012-10-14 16:16:58.341309118 -0700
> +++ linux/mm/memcontrol.c 2012-11-01 14:31:04.284185741 -0700
> @@ -1055,12 +1055,24 @@ struct lruvec *mem_cgroup_zone_lruvec(st
> struct mem_cgroup *memcg)
> {
> struct mem_cgroup_per_zone *mz;
> + struct lruvec *lruvec;
>
> - if (mem_cgroup_disabled())
> - return &zone->lruvec;
> + if (mem_cgroup_disabled()) {
> + lruvec = &zone->lruvec;
> + goto out;
> + }
>
> mz = mem_cgroup_zoneinfo(memcg, zone_to_nid(zone), zone_idx(zone));
> - return &mz->lruvec;
> + lruvec = &mz->lruvec;
> +out:
> + /*
> + * Since a node can be onlined after the mem_cgroup was created,
> + * we have to be prepared to initialize lruvec->zone here;
> + * and if offlined then reonlined, we need to reinitialize it.
> + */
> + if (unlikely(lruvec->zone != zone))
> + lruvec->zone = zone;
> + return lruvec;
> }
>
> /*
> @@ -1087,9 +1099,12 @@ struct lruvec *mem_cgroup_page_lruvec(st
> struct mem_cgroup_per_zone *mz;
> struct mem_cgroup *memcg;
> struct page_cgroup *pc;
> + struct lruvec *lruvec;
>
> - if (mem_cgroup_disabled())
> - return &zone->lruvec;
> + if (mem_cgroup_disabled()) {
> + lruvec = &zone->lruvec;
> + goto out;
> + }
>
> pc = lookup_page_cgroup(page);
> memcg = pc->mem_cgroup;
> @@ -1107,7 +1122,16 @@ struct lruvec *mem_cgroup_page_lruvec(st
> pc->mem_cgroup = memcg = root_mem_cgroup;
>
> mz = page_cgroup_zoneinfo(memcg, page);
> - return &mz->lruvec;
> + lruvec = &mz->lruvec;
> +out:
> + /*
> + * Since a node can be onlined after the mem_cgroup was created,
> + * we have to be prepared to initialize lruvec->zone here;
> + * and if offlined then reonlined, we need to reinitialize it.
> + */
> + if (unlikely(lruvec->zone != zone))
> + lruvec->zone = zone;
> + return lruvec;
> }
>
> /**
> @@ -3688,17 +3712,17 @@ unsigned long mem_cgroup_soft_limit_recl
> static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
> int node, int zid, enum lru_list lru)
> {
> - struct mem_cgroup_per_zone *mz;
> + struct lruvec *lruvec;
> unsigned long flags, loop;
> struct list_head *list;
> struct page *busy;
> struct zone *zone;
>
> zone = &NODE_DATA(node)->node_zones[zid];
> - mz = mem_cgroup_zoneinfo(memcg, node, zid);
> - list = &mz->lruvec.lists[lru];
> + lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> + list = &lruvec->lists[lru];
>
> - loop = mz->lru_size[lru];
> + loop = mem_cgroup_get_lru_size(lruvec, lru);
> /* give some margin against EBUSY etc...*/
> loop += 256;
> busy = NULL;
> @@ -4736,7 +4760,7 @@ static int alloc_mem_cgroup_per_zone_inf
>
> for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> mz = &pn->zoneinfo[zone];
> - lruvec_init(&mz->lruvec, &NODE_DATA(node)->node_zones[zone]);
> + lruvec_init(&mz->lruvec);
> mz->usage_in_excess = 0;
> mz->on_tree = false;
> mz->memcg = memcg;
> --- 3.7-rc3/mm/mmzone.c 2012-09-30 16:47:46.000000000 -0700
> +++ linux/mm/mmzone.c 2012-11-01 14:31:04.284185741 -0700
> @@ -87,7 +87,7 @@ int memmap_valid_within(unsigned long pf
> }
> #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
>
> -void lruvec_init(struct lruvec *lruvec, struct zone *zone)
> +void lruvec_init(struct lruvec *lruvec)
> {
> enum lru_list lru;
>
> @@ -95,8 +95,4 @@ void lruvec_init(struct lruvec *lruvec,
>
> for_each_lru(lru)
> INIT_LIST_HEAD(&lruvec->lists[lru]);
> -
> -#ifdef CONFIG_MEMCG
> - lruvec->zone = zone;
> -#endif
> }
> --- 3.7-rc3/mm/page_alloc.c 2012-10-28 13:48:00.021774166 -0700
> +++ linux/mm/page_alloc.c 2012-11-01 14:31:04.284185741 -0700
> @@ -4505,7 +4505,7 @@ static void __paginginit free_area_init_
> zone->zone_pgdat = pgdat;
>
> zone_pcp_init(zone);
> - lruvec_init(&zone->lruvec, zone);
> + lruvec_init(&zone->lruvec);
> if (!size)
> continue;
>
--
Michal Hocko
SUSE Labs
On Fri 02-11-12 11:21:59, Michal Hocko wrote:
> On Thu 01-11-12 18:28:02, Hugh Dickins wrote:
[...]
And I forgot to mention that the following hunk will clash with
"memcg: Simplify mem_cgroup_force_empty_list error handling" which is in
linux-next already (via Tejun's tree).
Would it be easier to split the patch into the real fix and the hunk
bellow? That one doesn't have to go into stable anyway and we would save
some merging conflicts. The updated fix on top of -mm tree is bellow for
your convinience.
> > /**
> > @@ -3688,17 +3712,17 @@ unsigned long mem_cgroup_soft_limit_recl
> > static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
> > int node, int zid, enum lru_list lru)
> > {
> > - struct mem_cgroup_per_zone *mz;
> > + struct lruvec *lruvec;
> > unsigned long flags, loop;
> > struct list_head *list;
> > struct page *busy;
> > struct zone *zone;
> >
> > zone = &NODE_DATA(node)->node_zones[zid];
> > - mz = mem_cgroup_zoneinfo(memcg, node, zid);
> > - list = &mz->lruvec.lists[lru];
> > + lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> > + list = &lruvec->lists[lru];
> >
> > - loop = mz->lru_size[lru];
> > + loop = mem_cgroup_get_lru_size(lruvec, lru);
> > /* give some margin against EBUSY etc...*/
> > loop += 256;
> > busy = NULL;
---
>From 0e55c305a050502a4b2f5167efed58bb6585520b Mon Sep 17 00:00:00 2001
From: Hugh Dickins <[email protected]>
Date: Fri, 2 Nov 2012 11:47:48 +0100
Subject: [PATCH] memcg: use mem_cgroup_zone_lruvec in
mem_cgroup_force_empty_list
For no particularly good reason, mem_cgroup_force_empty_list() has
its own code for deciding lruvec. Change it to use the standard
mem_cgroup_zone_lruvec() and mem_cgroup_get_lru_size() too.
In fact it was already safe against oops after memory hotadd (see
"memcg: fix hotplugged memory zone oops" for more details) when lruvec
is not initialized yet (the lru lists in danger could only be empty),
but we're better proofed against future changes this way.
Reported-by: Tang Chen <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Wen Congyang <[email protected]>
---
mm/memcontrol.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6c72d65..32f0ecf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3710,15 +3710,15 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
int node, int zid, enum lru_list lru)
{
- struct mem_cgroup_per_zone *mz;
+ struct lruvec *lruvec;
unsigned long flags;
struct list_head *list;
struct page *busy;
struct zone *zone;
zone = &NODE_DATA(node)->node_zones[zid];
- mz = mem_cgroup_zoneinfo(memcg, node, zid);
- list = &mz->lruvec.lists[lru];
+ lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+ list = &lruvec->lists[lru];
busy = NULL;
do {
--
1.7.10.4
--
Michal Hocko
SUSE Labs
On Fri, 2 Nov 2012, Michal Hocko wrote:
>
> OK, it adds "an overhead" also when there is no hotadd going on but this
> is just one additional mem access&cmp&je so it shouldn't be noticable
> (lruvec->zone is used most of the time later so it not a pointless
> load).
I think so too.
> It is also easier to backport for stable.
Yes.
> But is there any reason to fix it later properly in the hotadd hook?
I don't know. Not everybody liked it fixed this way: it's not
unreasonable to see this as a quick hack rather than the right fix.
I was expecting objectors to post alternative hotadd hook patches,
then we could compare and decide. That didn't happen; but we can
certainly look to taking out these lines later if something we
agree is better comes along. Not high on anyone's agenda, I think.
>
> Anyway
> Acked-by: Michal Hocko <[email protected]>
Thanks,
Hugh
On Fri, 2 Nov 2012, Michal Hocko wrote:
> On Fri 02-11-12 11:21:59, Michal Hocko wrote:
> > On Thu 01-11-12 18:28:02, Hugh Dickins wrote:
> [...]
>
> And I forgot to mention that the following hunk will clash with
> "memcg: Simplify mem_cgroup_force_empty_list error handling" which is in
> linux-next already (via Tejun's tree).
Oh, via Tejun's tree. Right, when I checked mmotm there was no problem.
> Would it be easier to split the patch into the real fix and the hunk
> bellow? That one doesn't have to go into stable anyway and we would save
> some merging conflicts. The updated fix on top of -mm tree is bellow for
> your convinience.
I'd prefer to leave it as one patch, so even the "future proof" part
of the fix goes into 3.7 and stable. But your point is that you have
already seen the future, and it forks in a slightly different direction!
Well, I don't want to be obstructive, but it doesn't look difficult
to resolve. Perhaps if I hold off on splitting them, and see if akpm
barks at me or not :)
Hugh
> > > /**
> > > @@ -3688,17 +3712,17 @@ unsigned long mem_cgroup_soft_limit_recl
> > > static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
> > > int node, int zid, enum lru_list lru)
> > > {
> > > - struct mem_cgroup_per_zone *mz;
> > > + struct lruvec *lruvec;
> > > unsigned long flags, loop;
> > > struct list_head *list;
> > > struct page *busy;
> > > struct zone *zone;
> > >
> > > zone = &NODE_DATA(node)->node_zones[zid];
> > > - mz = mem_cgroup_zoneinfo(memcg, node, zid);
> > > - list = &mz->lruvec.lists[lru];
> > > + lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> > > + list = &lruvec->lists[lru];
> > >
> > > - loop = mz->lru_size[lru];
> > > + loop = mem_cgroup_get_lru_size(lruvec, lru);
> > > /* give some margin against EBUSY etc...*/
> > > loop += 256;
> > > busy = NULL;
On Fri 02-11-12 16:37:37, Hugh Dickins wrote:
> On Fri, 2 Nov 2012, Michal Hocko wrote:
> > On Fri 02-11-12 11:21:59, Michal Hocko wrote:
> > > On Thu 01-11-12 18:28:02, Hugh Dickins wrote:
> > [...]
> >
> > And I forgot to mention that the following hunk will clash with
> > "memcg: Simplify mem_cgroup_force_empty_list error handling" which is in
> > linux-next already (via Tejun's tree).
>
> Oh, via Tejun's tree. Right, when I checked mmotm there was no problem.
Yeah, whole that thing goes through Tejun's tree because there are many
follow up clean ups depending on that change.
> > Would it be easier to split the patch into the real fix and the hunk
> > bellow? That one doesn't have to go into stable anyway and we would save
> > some merging conflicts. The updated fix on top of -mm tree is bellow for
> > your convinience.
>
> I'd prefer to leave it as one patch, so even the "future proof" part
> of the fix goes into 3.7 and stable. But your point is that you have
> already seen the future, and it forks in a slightly different direction!
>
> Well, I don't want to be obstructive, but it doesn't look difficult
> to resolve.
True.
> Perhaps if I hold off on splitting them, and see if akpm barks at me
> or not :)
>
> Hugh
>
> > > > /**
> > > > @@ -3688,17 +3712,17 @@ unsigned long mem_cgroup_soft_limit_recl
> > > > static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
> > > > int node, int zid, enum lru_list lru)
> > > > {
> > > > - struct mem_cgroup_per_zone *mz;
> > > > + struct lruvec *lruvec;
> > > > unsigned long flags, loop;
> > > > struct list_head *list;
> > > > struct page *busy;
> > > > struct zone *zone;
> > > >
> > > > zone = &NODE_DATA(node)->node_zones[zid];
> > > > - mz = mem_cgroup_zoneinfo(memcg, node, zid);
> > > > - list = &mz->lruvec.lists[lru];
> > > > + lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> > > > + list = &lruvec->lists[lru];
> > > >
> > > > - loop = mz->lru_size[lru];
> > > > + loop = mem_cgroup_get_lru_size(lruvec, lru);
> > > > /* give some margin against EBUSY etc...*/
> > > > loop += 256;
> > > > busy = NULL;
--
Michal Hocko
SUSE Labs