LinuxLists.cc - [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in

2009-12-28 07:47:27

Subject: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node

The zone->lock is one of performance critical locks. Then, it shouldn't
be hold for long time. Currently, we have four walk_zones_in_node()
usage and almost use-case don't need to hold zone->lock.

Thus, this patch move locking responsibility from walk_zones_in_node
to its sub function. Also this patch kill unnecessary zone->lock taking.

Cc: Mel Gorman <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmstat.c | 8 +++++---
1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6051fba..a5d45bc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -418,15 +418,12 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
{
struct zone *zone;
struct zone *node_zones = pgdat->node_zones;
- unsigned long flags;

for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
if (!populated_zone(zone))
continue;

- spin_lock_irqsave(&zone->lock, flags);
print(m, pgdat, zone);
- spin_unlock_irqrestore(&zone->lock, flags);
}
}

@@ -455,6 +452,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
pg_data_t *pgdat, struct zone *zone)
{
int order, mtype;
+ unsigned long flags;

for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
seq_printf(m, "Node %4d, zone %8s, type %12s ",
@@ -468,8 +466,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,

area = &(zone->free_area[order]);

+ spin_lock_irqsave(&zone->lock, flags);
list_for_each(curr, &area->free_list[mtype])
freecount++;
+ spin_unlock_irqrestore(&zone->lock, flags);
+
seq_printf(m, "%6lu ", freecount);
}
seq_putc(m, '\n');
@@ -709,6 +710,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
struct zone *zone)
{
int i;
+
seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
seq_printf(m,
"\n pages free %lu"
--
1.6.5.2

2009-12-28 07:48:12

by KOSAKI Motohiro

[permalink] [raw]

Subject: [PATCH 2/4] vmscan: get_scan_ratio cleanup

The get_scan_ratio() should have all scan-ratio related calculations.
Thus, this patch move some calculation into get_scan_ratio.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 23 ++++++++++++++---------
1 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2bbee91..640486b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1501,6 +1501,13 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
unsigned long ap, fp;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);

+ /* If we have no swap space, do not bother scanning anon pages. */
+ if (!sc->may_swap || (nr_swap_pages <= 0)) {
+ percent[0] = 0;
+ percent[1] = 100;
+ return;
+ }
+
anon = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
file = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
@@ -1598,22 +1605,20 @@ static void shrink_zone(int priority, struct zone *zone,
unsigned long nr_reclaimed = sc->nr_reclaimed;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
- int noswap = 0;

- /* If we have no swap space, do not bother scanning anon pages. */
- if (!sc->may_swap || (nr_swap_pages <= 0)) {
- noswap = 1;
- percent[0] = 0;
- percent[1] = 100;
- } else
- get_scan_ratio(zone, sc, percent);
+ get_scan_ratio(zone, sc, percent);

for_each_evictable_lru(l) {
int file = is_file_lru(l);
unsigned long scan;

+ if (percent[file] == 0) {
+ nr[l] = 0;
+ continue;
+ }
+
scan = zone_nr_lru_pages(zone, sc, l);
- if (priority || noswap) {
+ if (priority) {
scan >>= priority;
scan = (scan * percent[file]) / 100;
}
--
1.6.5.2

2009-12-28 07:48:59

by KOSAKI Motohiro

[permalink] [raw]

Subject: [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo

Vmscan folks was asked "why does my system makes so much swap-out?"
in lkml at several times.
At that time, I made the debug patch to show recent_anon_{scanned/rorated}
parameter at least three times.

Thus, its parameter should be showed on /proc/zoneinfo. It help
vmscan folks debugging.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
include/linux/swap.h | 2 ++
mm/vmscan.c | 15 +++++++++++++++
mm/vmstat.c | 7 +++++--
3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a2602a8..e95d7ed 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -280,6 +280,8 @@ extern void scan_unevictable_unregister_node(struct node *node);
extern int kswapd_run(int nid);
extern void kswapd_stop(int nid);

+unsigned long get_anon_scan_ratio(struct zone *zone, struct mem_cgroup *memcg, int swappiness);
+
#ifdef CONFIG_MMU
/* linux/mm/shmem.c */
extern int shmem_unuse(swp_entry_t entry, struct page *page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 640486b..1c39a74 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1572,6 +1572,21 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
percent[1] = 100 - percent[0];
}

+unsigned long get_anon_scan_ratio(struct zone *zone, struct mem_cgroup *memcg, int swappiness)
+{
+ unsigned long percent[2];
+ struct scan_control sc = {
+ .may_swap = 1,
+ .swappiness = swappiness,
+ .mem_cgroup = memcg,
+ };
+
+ get_scan_ratio(zone, &sc, percent);
+
+ return percent[0];
+}
+
+
/*
* Smallish @nr_to_scan's are deposited in @nr_saved_scan,
* until we collected @swap_cluster_max pages to scan.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a5d45bc..24383b4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -15,6 +15,7 @@
#include <linux/cpu.h>
#include <linux/vmstat.h>
#include <linux/sched.h>
+#include <linux/swap.h>

#ifdef CONFIG_VM_EVENT_COUNTERS
DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -762,11 +763,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n all_unreclaimable: %u"
"\n prev_priority: %i"
"\n start_pfn: %lu"
- "\n inactive_ratio: %u",
+ "\n inactive_ratio: %u"
+ "\n anon_scan_ratio: %lu",
zone_is_all_unreclaimable(zone),
zone->prev_priority,
zone->zone_start_pfn,
- zone->inactive_ratio);
+ zone->inactive_ratio,
+ get_anon_scan_ratio(zone, NULL, vm_swappiness));
seq_putc(m, '\n');
}

--
1.6.5.2

2009-12-28 07:49:31

by KOSAKI Motohiro

[permalink] [raw]

Subject: [PATCH 4/4] memcg: add anon_scan_ratio to memory.stat file

anon_scan_ratio feature doesn't only useful for global VM pressure
analysis, but it also useful for memcg memroy pressure analysis.

Then, this patch add anon_scan_ratio field to memory.stat file too.

Instead, following debug statistics was removed. It isn't so user and/or
developer friendly.

- recent_rotated_anon
- recent_rotated_file
- recent_scanned_anon
- recent_scanned_file

This removing don't cause ABI issue. because it was enclosed
CONFIG_DEBUG_VM.

Cc: Balbir Singh <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/memcontrol.c | 43 +++++++++++++++++++------------------------
1 files changed, 19 insertions(+), 24 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 325df12..daa027c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2950,6 +2950,9 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
{
struct mem_cgroup *mem_cont = mem_cgroup_from_cont(cont);
struct mcs_total_stat mystat;
+ struct zone *zone;
+ unsigned long total_anon = 0;
+ unsigned long total_scan_anon = 0;
int i;

memset(&mystat, 0, sizeof(mystat));
@@ -2978,34 +2981,26 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
cb->fill(cb, memcg_stat_strings[i].total_name, mystat.stat[i]);
}

-#ifdef CONFIG_DEBUG_VM
cb->fill(cb, "inactive_ratio", calc_inactive_ratio(mem_cont, NULL));

- {
- int nid, zid;
+ for_each_populated_zone(zone) {
+ int nid = zone->zone_pgdat->node_id;
+ int zid = zone_idx(zone);
struct mem_cgroup_per_zone *mz;
- unsigned long recent_rotated[2] = {0, 0};
- unsigned long recent_scanned[2] = {0, 0};
-
- for_each_online_node(nid)
- for (zid = 0; zid < MAX_NR_ZONES; zid++) {
- mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
-
- recent_rotated[0] +=
- mz->reclaim_stat.recent_rotated[0];
- recent_rotated[1] +=
- mz->reclaim_stat.recent_rotated[1];
- recent_scanned[0] +=
- mz->reclaim_stat.recent_scanned[0];
- recent_scanned[1] +=
- mz->reclaim_stat.recent_scanned[1];
- }
- cb->fill(cb, "recent_rotated_anon", recent_rotated[0]);
- cb->fill(cb, "recent_rotated_file", recent_rotated[1]);
- cb->fill(cb, "recent_scanned_anon", recent_scanned[0]);
- cb->fill(cb, "recent_scanned_file", recent_scanned[1]);
+ unsigned long anon;
+ unsigned long ratio;
+
+ mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
+
+ anon = MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
+ anon += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON);
+
+ ratio = get_anon_scan_ratio(zone, mem_cont, mem_cont->swappiness);
+
+ total_anon += anon;
+ total_scan_anon += anon * ratio;
}
-#endif
+ cb->fill(cb, "anon_scan_ratio", total_scan_anon / total_anon);

return 0;
}
--
1.6.5.2

2009-12-29 04:05:39

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node

On Mon, Dec 28, 2009 at 4:47 PM, KOSAKI Motohiro
<[email protected]> wrote:
> The zone->lock is one of performance critical locks. Then, it shouldn't
> be hold for long time. Currently, we have four walk_zones_in_node()
> usage and almost use-case don't need to hold zone->lock.

I agree.

We can use walk_zone_in_node freely to show the information related to zone.

- frag_show_print : the number of free pages per order.
- pagetypeinfo_showfree_print : the number of free page per migration type
- pagetypeinfo_showblockcount_print : the number of pages in zone per
migration type
- zoneinfo_show_print : many info about zone.

Do we want to show exact value? No.
If we want it, it's not enough zone->lock only.
After all, All of things would be transient value.

>
> Thus, this patch move locking responsibility from walk_zones_in_node
> to its sub function. Also this patch kill unnecessary zone->lock taking.
>
> Cc: Mel Gorman <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2009-12-29 06:34:46

by Balbir Singh

[permalink] [raw]

Subject: Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node

* KOSAKI Motohiro <[email protected]> [2009-12-28 16:47:22]:

> The zone->lock is one of performance critical locks. Then, it shouldn't
> be hold for long time. Currently, we have four walk_zones_in_node()
> usage and almost use-case don't need to hold zone->lock.
>
> Thus, this patch move locking responsibility from walk_zones_in_node
> to its sub function. Also this patch kill unnecessary zone->lock taking.
>
> Cc: Mel Gorman <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/vmstat.c | 8 +++++---
> 1 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6051fba..a5d45bc 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -418,15 +418,12 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
> {
> struct zone *zone;
> struct zone *node_zones = pgdat->node_zones;
> - unsigned long flags;
>
> for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
> if (!populated_zone(zone))
> continue;
>
> - spin_lock_irqsave(&zone->lock, flags);
> print(m, pgdat, zone);
> - spin_unlock_irqrestore(&zone->lock, flags);
> }
> }
>

> @@ -455,6 +452,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> pg_data_t *pgdat, struct zone *zone)
> {
> int order, mtype;
> + unsigned long flags;
>
> for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> seq_printf(m, "Node %4d, zone %8s, type %12s ",
> @@ -468,8 +466,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>
> area = &(zone->free_area[order]);
>
> + spin_lock_irqsave(&zone->lock, flags);
> list_for_each(curr, &area->free_list[mtype])
> freecount++;
> + spin_unlock_irqrestore(&zone->lock, flags);
> +
> seq_printf(m, "%6lu ", freecount);
> }
> seq_putc(m, '\n');
> @@ -709,6 +710,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> struct zone *zone)
> {
> int i;
> +
> seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
> seq_printf(m,
> "\n pages free %lu"

While this is a noble cause, is printing all this information correct
without the lock, I am not worried about old
information, but incorrect data. Should the read side be rcu'ed. Is
just removing the lock and accessing data safe across architectures?

--
Balbir

2009-12-29 07:34:20

by Balbir Singh

[permalink] [raw]

Subject: Re: [PATCH 2/4] vmscan: get_scan_ratio cleanup

* KOSAKI Motohiro <[email protected]> [2009-12-28 16:48:06]:

> The get_scan_ratio() should have all scan-ratio related calculations.
> Thus, this patch move some calculation into get_scan_ratio.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/vmscan.c | 23 ++++++++++++++---------
> 1 files changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2bbee91..640486b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1501,6 +1501,13 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
> unsigned long ap, fp;
> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
>
> + /* If we have no swap space, do not bother scanning anon pages. */
> + if (!sc->may_swap || (nr_swap_pages <= 0)) {
> + percent[0] = 0;
> + percent[1] = 100;
> + return;
> + }
> +

> anon = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
> zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
> file = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
> @@ -1598,22 +1605,20 @@ static void shrink_zone(int priority, struct zone *zone,
> unsigned long nr_reclaimed = sc->nr_reclaimed;
> unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> - int noswap = 0;
>
> - /* If we have no swap space, do not bother scanning anon pages. */
> - if (!sc->may_swap || (nr_swap_pages <= 0)) {
> - noswap = 1;
> - percent[0] = 0;
> - percent[1] = 100;
> - } else
> - get_scan_ratio(zone, sc, percent);
> + get_scan_ratio(zone, sc, percent);

Where do we set noswap? Is percent[0] == 0 used to indicate noswap =
1?

>
> for_each_evictable_lru(l) {
> int file = is_file_lru(l);
> unsigned long scan;
>
> + if (percent[file] == 0) {
> + nr[l] = 0;
> + continue;
> + }
> +

Is this really needed? Won't nr_scan_try_batch handle it correctly?

> scan = zone_nr_lru_pages(zone, sc, l);
> - if (priority || noswap) {
> + if (priority) {
> scan >>= priority;
> scan = (scan * percent[file]) / 100;
> }

--
Balbir

2009-12-29 14:09:23

by Balbir Singh

[permalink] [raw]

Subject: Re: [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo

* KOSAKI Motohiro <[email protected]> [2009-12-28 16:48:51]:

> Vmscan folks was asked "why does my system makes so much swap-out?"
> in lkml at several times.
> At that time, I made the debug patch to show recent_anon_{scanned/rorated}
> parameter at least three times.
>
> Thus, its parameter should be showed on /proc/zoneinfo. It help
> vmscan folks debugging.
>

Hmmm.. I think this should come under DEBUG_VM, a lot of tools use
/proc/zoneinfo, the additional overhead may be high.. no? Also,
I would recommend adding the additional details to the end, so
as to not break existing tools (specifically dump line # based
tools).

--
Balbir

2009-12-29 14:10:54

by Balbir Singh

[permalink] [raw]

Subject: Re: [PATCH 4/4] memcg: add anon_scan_ratio to memory.stat file

* KOSAKI Motohiro <[email protected]> [2009-12-28 16:49:27]:

> anon_scan_ratio feature doesn't only useful for global VM pressure
> analysis, but it also useful for memcg memroy pressure analysis.
>
> Then, this patch add anon_scan_ratio field to memory.stat file too.
>
> Instead, following debug statistics was removed. It isn't so user and/or
> developer friendly.
>
> - recent_rotated_anon
> - recent_rotated_file
> - recent_scanned_anon
> - recent_scanned_file
>

I've been using these to look at statistics - specifically reclaim
data on my developer kernels.

--
Balbir

2009-12-30 12:50:35

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node

> * KOSAKI Motohiro <[email protected]> [2009-12-28 16:47:22]:
>
> > The zone->lock is one of performance critical locks. Then, it shouldn't
> > be hold for long time. Currently, we have four walk_zones_in_node()
> > usage and almost use-case don't need to hold zone->lock.
> >
> > Thus, this patch move locking responsibility from walk_zones_in_node
> > to its sub function. Also this patch kill unnecessary zone->lock taking.
> >
> > Cc: Mel Gorman <[email protected]>
> > Signed-off-by: KOSAKI Motohiro <[email protected]>
> > ---
> > mm/vmstat.c | 8 +++++---
> > 1 files changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 6051fba..a5d45bc 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -418,15 +418,12 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
> > {
> > struct zone *zone;
> > struct zone *node_zones = pgdat->node_zones;
> > - unsigned long flags;
> >
> > for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
> > if (!populated_zone(zone))
> > continue;
> >
> > - spin_lock_irqsave(&zone->lock, flags);
> > print(m, pgdat, zone);
> > - spin_unlock_irqrestore(&zone->lock, flags);
> > }
> > }
> >
>
> > @@ -455,6 +452,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > pg_data_t *pgdat, struct zone *zone)
> > {
> > int order, mtype;
> > + unsigned long flags;
> >
> > for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > @@ -468,8 +466,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >
> > area = &(zone->free_area[order]);
> >
> > + spin_lock_irqsave(&zone->lock, flags);
> > list_for_each(curr, &area->free_list[mtype])
> > freecount++;
> > + spin_unlock_irqrestore(&zone->lock, flags);
> > +
> > seq_printf(m, "%6lu ", freecount);
> > }
> > seq_putc(m, '\n');
> > @@ -709,6 +710,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> > struct zone *zone)
> > {
> > int i;
> > +
> > seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
> > seq_printf(m,
> > "\n pages free %lu"
>
> While this is a noble cause, is printing all this information correct
> without the lock, I am not worried about old
> information, but incorrect data. Should the read side be rcu'ed. Is
> just removing the lock and accessing data safe across architectures?

Hm.
Actually,current memory-hotplug implementation change various zone data without zone->lock.
then, my patch doesn't cause regression. I'm not sure rcu protection is very worth or not.
but I think it can separate this patch.

2009-12-30 12:54:23

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 4/4] memcg: add anon_scan_ratio to memory.stat file

> * KOSAKI Motohiro <[email protected]> [2009-12-28 16:49:27]:
>
> > anon_scan_ratio feature doesn't only useful for global VM pressure
> > analysis, but it also useful for memcg memroy pressure analysis.
> >
> > Then, this patch add anon_scan_ratio field to memory.stat file too.
> >
> > Instead, following debug statistics was removed. It isn't so user and/or
> > developer friendly.
> >
> > - recent_rotated_anon
> > - recent_rotated_file
> > - recent_scanned_anon
> > - recent_scanned_file
>
> I've been using these to look at statistics - specifically reclaim
> data on my developer kernels.

ok, I'll drop removing part. just add anon_scan_ratio.

2009-12-30 13:07:24

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 2/4] vmscan: get_scan_ratio cleanup

> * KOSAKI Motohiro <[email protected]> [2009-12-28 16:48:06]:
>
> > The get_scan_ratio() should have all scan-ratio related calculations.
> > Thus, this patch move some calculation into get_scan_ratio.
> >
> > Signed-off-by: KOSAKI Motohiro <[email protected]>
> > ---
> > mm/vmscan.c | 23 ++++++++++++++---------
> > 1 files changed, 14 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2bbee91..640486b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1501,6 +1501,13 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
> > unsigned long ap, fp;
> > struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> >
> > + /* If we have no swap space, do not bother scanning anon pages. */
> > + if (!sc->may_swap || (nr_swap_pages <= 0)) {
> > + percent[0] = 0;
> > + percent[1] = 100;
> > + return;
> > + }
> > +
>
>
> > anon = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
> > zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
> > file = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
> > @@ -1598,22 +1605,20 @@ static void shrink_zone(int priority, struct zone *zone,
> > unsigned long nr_reclaimed = sc->nr_reclaimed;
> > unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> > struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> > - int noswap = 0;
> >
> > - /* If we have no swap space, do not bother scanning anon pages. */
> > - if (!sc->may_swap || (nr_swap_pages <= 0)) {
> > - noswap = 1;
> > - percent[0] = 0;
> > - percent[1] = 100;
> > - } else
> > - get_scan_ratio(zone, sc, percent);
> > + get_scan_ratio(zone, sc, percent);
>
> Where do we set noswap? Is percent[0] == 0 used to indicate noswap =
> 1?

Yes, I intended so. I guess your question can convert next sentence.

following case makes different result, is it intentional?
- there are free swap
- sc->may_swap == 1
- priority == 0
- percent[0] == 0

My answer is, it isn't happen on practical workload. if priority reach to 0, vmscan always
scan and reclaim some anon (please recall, now vmscan automatically enable lumpy_reclaim
if priority < 10), then recent_rotated_anon isn't 0 always.. practically.

Do you think this is wrong assumption?

> > for_each_evictable_lru(l) {
> > int file = is_file_lru(l);
> > unsigned long scan;
> >
> > + if (percent[file] == 0) {
> > + nr[l] = 0;
> > + continue;
> > + }
> > +
>
> Is this really needed? Won't nr_scan_try_batch handle it correctly?

this two if branch is nicer than "priority || noswap", I think.
it clearly explain what do it.

>
> > scan = zone_nr_lru_pages(zone, sc, l);
> > - if (priority || noswap) {
> > + if (priority) {
> > scan >>= priority;
> > scan = (scan * percent[file]) / 100;
> > }
>
> --
> Balbir

2009-12-30 13:14:36

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo

> * KOSAKI Motohiro <[email protected]> [2009-12-28 16:48:51]:
>
> > Vmscan folks was asked "why does my system makes so much swap-out?"
> > in lkml at several times.
> > At that time, I made the debug patch to show recent_anon_{scanned/rorated}
> > parameter at least three times.
> >
> > Thus, its parameter should be showed on /proc/zoneinfo. It help
> > vmscan folks debugging.
> >
>
> Hmmm.. I think this should come under DEBUG_VM, a lot of tools use
> /proc/zoneinfo, the additional overhead may be high.. no? Also,
> I would recommend adding the additional details to the end, so
> as to not break existing tools (specifically dump line # based
> tools).

Thanks, I have three answer. 1) I really hope to don't enclose DEBUG_VM. otherwise
my harm doesn't solve. 2) but your performance worry is fair enough. I plan to remove
to grab zone->lru_lock in reading /proc/zoneinfo. 3) append new line doesn't break
existing tools. because zoneinfo show vmstat and we often append new vmstat in past
years. but I haven't seen zoneinfo breakage bug report. because zoneinfo file show
multiple zone information, then nobody access it by line number.

2010-01-01 14:56:41

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH 2/4] vmscan: get_scan_ratio cleanup

On 12/28/2009 02:48 AM, KOSAKI Motohiro wrote:
> The get_scan_ratio() should have all scan-ratio related calculations.
> Thus, this patch move some calculation into get_scan_ratio.
>
> Signed-off-by: KOSAKI Motohiro<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed.

2010-01-01 15:00:31

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo

On 12/28/2009 02:48 AM, KOSAKI Motohiro wrote:
> Vmscan folks was asked "why does my system makes so much swap-out?"
> in lkml at several times.
> At that time, I made the debug patch to show recent_anon_{scanned/rorated}
> parameter at least three times.
>
> Thus, its parameter should be showed on /proc/zoneinfo. It help
> vmscan folks debugging.
>
> Signed-off-by: KOSAKI Motohiro<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed.

2010-01-03 19:00:12

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node

On Mon, Dec 28, 2009 at 04:47:22PM +0900, KOSAKI Motohiro wrote:
> The zone->lock is one of performance critical locks. Then, it shouldn't
> be hold for long time. Currently, we have four walk_zones_in_node()
> usage and almost use-case don't need to hold zone->lock.
>
> Thus, this patch move locking responsibility from walk_zones_in_node
> to its sub function. Also this patch kill unnecessary zone->lock taking.
>
> Cc: Mel Gorman <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/vmstat.c | 8 +++++---
> 1 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6051fba..a5d45bc 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -418,15 +418,12 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
> {
> struct zone *zone;
> struct zone *node_zones = pgdat->node_zones;
> - unsigned long flags;
>
> for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
> if (!populated_zone(zone))
> continue;
>
> - spin_lock_irqsave(&zone->lock, flags);
> print(m, pgdat, zone);
> - spin_unlock_irqrestore(&zone->lock, flags);
> }
> }
>
> @@ -455,6 +452,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> pg_data_t *pgdat, struct zone *zone)
> {
> int order, mtype;
> + unsigned long flags;
>
> for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> seq_printf(m, "Node %4d, zone %8s, type %12s ",
> @@ -468,8 +466,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>
> area = &(zone->free_area[order]);
>
> + spin_lock_irqsave(&zone->lock, flags);
> list_for_each(curr, &area->free_list[mtype])
> freecount++;
> + spin_unlock_irqrestore(&zone->lock, flags);
> +

It's not clear why you feel this information requires the lock and the
others do not.

For the most part, I agree that the accuracy of the information is
not critical. Assuming partial writes of the data are not a problem,
the information is not going to go so badly out of sync that it would be
noticable, even if the information is out of date within the zone.

However, inconsistent reads in zoneinfo really could be a problem. I am
concerned that under heavy allocation load that that "pages free" would
not match "nr_pages_free" for example. Other examples that adding all the
counters together may or may not equal the total number of pages in the zone.

Lets say for example there was a subtle bug related to __inc_zone_page_state()
that meant that counters were getting slightly out of sync but it was very
marginal and/or difficult to reproduce. With this patch applied, we could
not be absolutly sure the counters were correct because it could always have
raced with someone holding the zone->lock.

Minimally, I think zoneinfo should be taking the zone lock.

Secondly, has increased zone->lock contention due to reading /proc
really been shown to be a problem? The only situation that I can think
of is a badly-written monitor program that is copying all of /proc
instead of the files of interest. If a monitor program is doing
something like that, it's likely to be incurring performance problems in
a large number of different areas. If that is not the trigger case, what
is?

Thanks

> seq_printf(m, "%6lu ", freecount);
> }
> seq_putc(m, '\n');
> @@ -709,6 +710,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> struct zone *zone)
> {
> int i;
> +

Unnecessary whitespace change.

> seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
> seq_printf(m,
> "\n pages free %lu"
> --
> 1.6.5.2
>
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-01-03 23:48:48

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 2/4] vmscan: get_scan_ratio cleanup

On Mon, 28 Dec 2009 16:48:06 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> The get_scan_ratio() should have all scan-ratio related calculations.
> Thus, this patch move some calculation into get_scan_ratio.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2010-01-03 23:50:16

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo

On Mon, 28 Dec 2009 16:48:51 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> Vmscan folks was asked "why does my system makes so much swap-out?"
> in lkml at several times.
> At that time, I made the debug patch to show recent_anon_{scanned/rorated}
> parameter at least three times.
>
> Thus, its parameter should be showed on /proc/zoneinfo. It help
> vmscan folks debugging.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2010-01-03 23:51:53

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node

On Mon, 28 Dec 2009 16:47:22 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> The zone->lock is one of performance critical locks. Then, it shouldn't
> be hold for long time. Currently, we have four walk_zones_in_node()
> usage and almost use-case don't need to hold zone->lock.
>
> Thus, this patch move locking responsibility from walk_zones_in_node
> to its sub function. Also this patch kill unnecessary zone->lock taking.
>
> Cc: Mel Gorman <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

Reviewd-by: KAMEZAWA Hiroyuki <[email protected]>

2010-01-05 02:05:13

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node

Hi

> On Mon, Dec 28, 2009 at 04:47:22PM +0900, KOSAKI Motohiro wrote:
> > The zone->lock is one of performance critical locks. Then, it shouldn't
> > be hold for long time. Currently, we have four walk_zones_in_node()
> > usage and almost use-case don't need to hold zone->lock.
> >
> > Thus, this patch move locking responsibility from walk_zones_in_node
> > to its sub function. Also this patch kill unnecessary zone->lock taking.
> >
> > Cc: Mel Gorman <[email protected]>
> > Signed-off-by: KOSAKI Motohiro <[email protected]>
> > ---
> > mm/vmstat.c | 8 +++++---
> > 1 files changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 6051fba..a5d45bc 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -418,15 +418,12 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
> > {
> > struct zone *zone;
> > struct zone *node_zones = pgdat->node_zones;
> > - unsigned long flags;
> >
> > for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
> > if (!populated_zone(zone))
> > continue;
> >
> > - spin_lock_irqsave(&zone->lock, flags);
> > print(m, pgdat, zone);
> > - spin_unlock_irqrestore(&zone->lock, flags);
> > }
> > }
> >
> > @@ -455,6 +452,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > pg_data_t *pgdat, struct zone *zone)
> > {
> > int order, mtype;
> > + unsigned long flags;
> >
> > for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > @@ -468,8 +466,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >
> > area = &(zone->free_area[order]);
> >
> > + spin_lock_irqsave(&zone->lock, flags);
> > list_for_each(curr, &area->free_list[mtype])
> > freecount++;
> > + spin_unlock_irqrestore(&zone->lock, flags);
> > +
>
> It's not clear why you feel this information requires the lock and the
> others do not.

I think above list operation require lock to prevent NULL pointer access. but other parts
doesn't protect anything, because memory-hotplug change them without zone lock.

> For the most part, I agree that the accuracy of the information is
> not critical. Assuming partial writes of the data are not a problem,
> the information is not going to go so badly out of sync that it would be
> noticable, even if the information is out of date within the zone.
>
> However, inconsistent reads in zoneinfo really could be a problem. I am
> concerned that under heavy allocation load that that "pages free" would
> not match "nr_pages_free" for example. Other examples that adding all the
> counters together may or may not equal the total number of pages in the zone.
>
> Lets say for example there was a subtle bug related to __inc_zone_page_state()
> that meant that counters were getting slightly out of sync but it was very
> marginal and/or difficult to reproduce. With this patch applied, we could
> not be absolutly sure the counters were correct because it could always have
> raced with someone holding the zone->lock.
>
> Minimally, I think zoneinfo should be taking the zone lock.

Thanks lots comments.
hmm.. I'd like to clarily your point. My point is memory-hotplug don't take zone lock,
then zone lock doesn't protect anything. so we have two option

1) Add zone lock to memroy-hotplug
2) Remove zone lock from zoneinfo

I thought (2) is sufficient. Do you mean you prefer to (1)? Or you prefer to ignore rarely event
(of cource, memory hotplug is rarely)?

> Secondly, has increased zone->lock contention due to reading /proc
> really been shown to be a problem? The only situation that I can think
> of is a badly-written monitor program that is copying all of /proc
> instead of the files of interest. If a monitor program is doing
> something like that, it's likely to be incurring performance problems in
> a large number of different areas. If that is not the trigger case, what
> is?

Ah no. I haven't observe such issue. my point is removing meaningless lock.

> > seq_printf(m, "%6lu ", freecount);
> > }
> > seq_putc(m, '\n');
> > @@ -709,6 +710,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> > struct zone *zone)
> > {
> > int i;
> > +
>
> Unnecessary whitespace change.

Ug. thanks, it's my fault.

2010-01-05 10:18:33

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node

On Tue, Jan 05, 2010 at 11:04:58AM +0900, KOSAKI Motohiro wrote:
> Hi
>
> > On Mon, Dec 28, 2009 at 04:47:22PM +0900, KOSAKI Motohiro wrote:
> > > The zone->lock is one of performance critical locks. Then, it shouldn't
> > > be hold for long time. Currently, we have four walk_zones_in_node()
> > > usage and almost use-case don't need to hold zone->lock.
> > >
> > > Thus, this patch move locking responsibility from walk_zones_in_node
> > > to its sub function. Also this patch kill unnecessary zone->lock taking.
> > >
> > > Cc: Mel Gorman <[email protected]>
> > > Signed-off-by: KOSAKI Motohiro <[email protected]>
> > > ---
> > > mm/vmstat.c | 8 +++++---
> > > 1 files changed, 5 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index 6051fba..a5d45bc 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -418,15 +418,12 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
> > > {
> > > struct zone *zone;
> > > struct zone *node_zones = pgdat->node_zones;
> > > - unsigned long flags;
> > >
> > > for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
> > > if (!populated_zone(zone))
> > > continue;
> > >
> > > - spin_lock_irqsave(&zone->lock, flags);
> > > print(m, pgdat, zone);
> > > - spin_unlock_irqrestore(&zone->lock, flags);
> > > }
> > > }
> > >
> > > @@ -455,6 +452,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > > pg_data_t *pgdat, struct zone *zone)
> > > {
> > > int order, mtype;
> > > + unsigned long flags;
> > >
> > > for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > @@ -468,8 +466,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > >
> > > area = &(zone->free_area[order]);
> > >
> > > + spin_lock_irqsave(&zone->lock, flags);
> > > list_for_each(curr, &area->free_list[mtype])
> > > freecount++;
> > > + spin_unlock_irqrestore(&zone->lock, flags);
> > > +
> >
> > It's not clear why you feel this information requires the lock and the
> > others do not.
>
> I think above list operation require lock to prevent NULL pointer access. but other parts
> doesn't protect anything, because memory-hotplug change them without zone lock.
>

True. Add a comment explaining that. I considered list_for_each_safe()
but it wouldn't work in all cases.

>
> > For the most part, I agree that the accuracy of the information is
> > not critical. Assuming partial writes of the data are not a problem,
> > the information is not going to go so badly out of sync that it would be
> > noticable, even if the information is out of date within the zone.
> >
> > However, inconsistent reads in zoneinfo really could be a problem. I am
> > concerned that under heavy allocation load that that "pages free" would
> > not match "nr_pages_free" for example. Other examples that adding all the
> > counters together may or may not equal the total number of pages in the zone.
> >
> > Lets say for example there was a subtle bug related to __inc_zone_page_state()
> > that meant that counters were getting slightly out of sync but it was very
> > marginal and/or difficult to reproduce. With this patch applied, we could
> > not be absolutly sure the counters were correct because it could always have
> > raced with someone holding the zone->lock.
> >
> > Minimally, I think zoneinfo should be taking the zone lock.
>
> Thanks lots comments.
> hmm.. I'd like to clarily your point. My point is memory-hotplug don't take zone lock,
> then zone lock doesn't protect anything. so we have two option
>
> 1) Add zone lock to memroy-hotplug
> 2) Remove zone lock from zoneinfo
>
> I thought (2) is sufficient. Do you mean you prefer to (1)? Or you prefer to ignore rarely event
> (of cource, memory hotplug is rarely)?
>

I think (2) will make zoneinfo harder to use for examining all the counters
properly as I explained above. I haven't looked at memory-hotplug in a
while but IIRC, fields like present_pages should be protected by a lock on
the pgdat and a seq lock on the zone. If this is not true at the moment,
it is a problem.

For the free lists, memory hotplug should be taking the zone->lock properly as
the final stage of onlining memory is to walk the sections being hot-added,
init the memmap and then __free_page() each page individually - i.e. the
normal free path.

So, if memory hotplug is not protected by proper locking, it's not intentional.

>
> > Secondly, has increased zone->lock contention due to reading /proc
> > really been shown to be a problem? The only situation that I can think
> > of is a badly-written monitor program that is copying all of /proc
> > instead of the files of interest. If a monitor program is doing
> > something like that, it's likely to be incurring performance problems in
> > a large number of different areas. If that is not the trigger case, what
> > is?
>
> Ah no. I haven't observe such issue. my point is removing meaningless lock.
>

Then I believe the zonelock should be preserved so that all entries in
/proc/zoneinfo are consistent.

>
> > > seq_printf(m, "%6lu ", freecount);
> > > }
> > > seq_putc(m, '\n');
> > > @@ -709,6 +710,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> > > struct zone *zone)
> > > {
> > > int i;
> > > +
> >
> > Unnecessary whitespace change.
>
> Ug. thanks, it's my fault.
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-01-06 00:33:52

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node

> > Thanks lots comments.
> > hmm.. I'd like to clarily your point. My point is memory-hotplug don't take zone lock,
> > then zone lock doesn't protect anything. so we have two option
> >
> > 1) Add zone lock to memroy-hotplug
> > 2) Remove zone lock from zoneinfo
> >
> > I thought (2) is sufficient. Do you mean you prefer to (1)? Or you prefer to ignore rarely event
> > (of cource, memory hotplug is rarely)?
> >
>
> I think (2) will make zoneinfo harder to use for examining all the counters
> properly as I explained above. I haven't looked at memory-hotplug in a
> while but IIRC, fields like present_pages should be protected by a lock on
> the pgdat and a seq lock on the zone. If this is not true at the moment,
> it is a problem.
>
> For the free lists, memory hotplug should be taking the zone->lock properly as
> the final stage of onlining memory is to walk the sections being hot-added,
> init the memmap and then __free_page() each page individually - i.e. the
> normal free path.
>
> So, if memory hotplug is not protected by proper locking, it's not intentional.

ok, I drop this patch. thanks.