Simon Kirby reported the following problem
We're seeing cases on a number of servers where cache never fully
grows to use all available memory. Sometimes we see servers with 4
GB of memory that never seem to have less than 1.5 GB free, even with
a constantly-active VM. In some cases, these servers also swap out
while this happens, even though they are constantly reading the working
set into memory. We have been seeing this happening for a long time;
I don't think it's anything recent, and it still happens on 2.6.36.
After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.
There are two apparent problems here. On the target machine, there is a small
Normal zone in comparison to DMA32. As kswapd tries to balance all zones, it
would continually try reclaiming for Normal even though DMA32 was balanced
enough for callers. The second problem is that sleeping_prematurely() uses
the requested order, not the order kswapd finally reclaimed at. This keeps
kswapd artifically awake.
This series aims to alleviate these problems but needs testing to confirm
it alleviates the actual problem and wider review to think if there is a
better alternative approach. Local tests passed but are not reproducing
the same problem unfortunately so the results are inclusive.
include/linux/mmzone.h | 3 +-
mm/page_alloc.c | 2 +-
mm/vmscan.c | 90 ++++++++++++++++++++++++++++++++++++++++-------
3 files changed, 79 insertions(+), 16 deletions(-)
When reclaiming for high-orders, kswapd is responsible for balancing a
node but it should not reclaim excessively. It avoids excessive reclaim
by considering if any zone in a node is balanced then the node is
balanced. In the cases where there are imbalanced zone sizes (e.g.
ZONE_DMA with both ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep
prematurely as just one small zone was balanced.
This alters the sleep logic of kswapd slightly. It counts the number of pages
that make up the balanced zones. If the total number of balanced pages is
more than a quarter of the zone, kswapd will go back to sleep. This should
keep a node balanced without reclaiming an excessive number of pages.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 30 ++++++++++++++++++++++--------
1 files changed, 22 insertions(+), 8 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9891efd..77c511f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2117,12 +2117,26 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
}
#endif
+/*
+ * pgdat_balanced is used when checking if a node is balanced for high-order
+ * allocations. Only zones that meet watermarks make up "balanced".
+ * The total of balanced pages must be at least 25% of the node for the
+ * node to be considered balanced. Forcing all zones to be balanced for high
+ * orders can cause excessive reclaim when there are imbalanced zones.
+ * Similarly, we do not want kswapd to go to sleep because ZONE_DMA happens
+ * to be balanced when ZONE_DMA32 is huge in comparison and unbalanced
+ */
+static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced)
+{
+ return balanced > pgdat->node_present_pages / 4;
+}
+
/* is kswapd sleeping prematurely? */
static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
{
int i;
+ unsigned long balanced = 0;
bool all_zones_ok = true;
- bool any_zone_ok = false;
/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
if (remaining)
@@ -2142,7 +2156,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
0, 0))
all_zones_ok = false;
else
- any_zone_ok = true;
+ balanced += zone->present_pages;
}
/*
@@ -2151,7 +2165,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
* For order-0, all zones must be balanced
*/
if (order)
- return !any_zone_ok;
+ return pgdat_balanced(pgdat, balanced);
else
return !all_zones_ok;
}
@@ -2181,7 +2195,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
int high_zoneidx)
{
int all_zones_ok;
- int any_zone_ok;
+ unsigned long balanced;
int priority;
int i;
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
@@ -2215,7 +2229,7 @@ loop_again:
disable_swap_token();
all_zones_ok = 1;
- any_zone_ok = 0;
+ balanced = 0;
/*
* Scan in the highmem->dma direction for the highest
@@ -2326,11 +2340,11 @@ loop_again:
*/
zone_clear_flag(zone, ZONE_CONGESTED);
if (i <= high_zoneidx)
- any_zone_ok = 1;
+ balanced += zone->present_pages;
}
}
- if (all_zones_ok || (order && any_zone_ok))
+ if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced)))
break; /* kswapd: all done */
/*
* OK, kswapd is getting into trouble. Take a nap, then take
@@ -2353,7 +2367,7 @@ loop_again:
break;
}
out:
- if (!(all_zones_ok || (order && any_zone_ok))) {
+ if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced)))) {
cond_resched();
try_to_freeze();
--
1.7.1
When the allocator enters its slow path, kswapd is woken up to balance the
node. It continues working until all zones within the node are balanced. For
order-0 allocations, this makes perfect sense but for higher orders it can
have unintended side-effects. If the zone sizes are imbalanced, kswapd
may reclaim heavily on a smaller zone discarding an excessive number of
pages. The user-visible behaviour is that kswapd is awake and reclaiming
even though plenty of pages are free from a suitable zone.
This patch alters the "balance" logic to stop kswapd if any suitable zone
becomes balanced to reduce the number of pages it reclaims from other zones.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 2 +-
mm/vmscan.c | 48 +++++++++++++++++++++++++++++++++++++++---------
3 files changed, 42 insertions(+), 11 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 39c24eb..25fe08d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -645,6 +645,7 @@ typedef struct pglist_data {
wait_queue_head_t kswapd_wait;
struct task_struct *kswapd;
int kswapd_max_order;
+ enum zone_type high_zoneidx;
} pg_data_t;
#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
@@ -660,7 +661,7 @@ typedef struct pglist_data {
extern struct mutex zonelists_mutex;
void build_all_zonelists(void *data);
-void wakeup_kswapd(struct zone *zone, int order);
+void wakeup_kswapd(struct zone *zone, int order, enum zone_type high_zoneidx);
int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
int classzone_idx, int alloc_flags);
enum memmap_context {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07a6544..344b597 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1921,7 +1921,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
struct zone *zone;
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
- wakeup_kswapd(zone, order);
+ wakeup_kswapd(zone, order, high_zoneidx);
}
static inline int
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d31d7ce..67e4283 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2165,11 +2165,14 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
* interoperates with the page allocator fallback scheme to ensure that aging
* of pages is balanced across the zones.
*/
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
+static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
+ int high_zoneidx)
{
int all_zones_ok;
+ int any_zone_ok;
int priority;
int i;
+ int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long total_scanned;
struct reclaim_state *reclaim_state = current->reclaim_state;
struct scan_control sc = {
@@ -2192,7 +2195,6 @@ loop_again:
count_vm_event(PAGEOUTRUN);
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
- int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;
int has_under_min_watermark_zone = 0;
@@ -2201,6 +2203,7 @@ loop_again:
disable_swap_token();
all_zones_ok = 1;
+ any_zone_ok = 0;
/*
* Scan in the highmem->dma direction for the highest
@@ -2310,10 +2313,12 @@ loop_again:
* spectulatively avoid congestion waits
*/
zone_clear_flag(zone, ZONE_CONGESTED);
+ if (i <= high_zoneidx)
+ any_zone_ok = 1;
}
}
- if (all_zones_ok)
+ if (all_zones_ok || (order && any_zone_ok))
break; /* kswapd: all done */
/*
* OK, kswapd is getting into trouble. Take a nap, then take
@@ -2336,7 +2341,7 @@ loop_again:
break;
}
out:
- if (!all_zones_ok) {
+ if (!(all_zones_ok || (order && any_zone_ok))) {
cond_resched();
try_to_freeze();
@@ -2361,6 +2366,22 @@ out:
goto loop_again;
}
+ /* kswapd should always balance all zones for order-0 */
+ if (order && !all_zones_ok) {
+ order = sc.order = 0;
+ goto loop_again;
+ }
+
+ /*
+ * As kswapd could be going to sleep, unconditionally mark all
+ * zones as uncongested as kswapd is the only mechanism which
+ * clears congestion flags
+ */
+ for (i = 0; i <= end_zone; i++) {
+ struct zone *zone = pgdat->node_zones + i;
+ zone_clear_flag(zone, ZONE_CONGESTED);
+ }
+
return sc.nr_reclaimed;
}
@@ -2380,6 +2401,7 @@ out:
static int kswapd(void *p)
{
unsigned long order;
+ int zone_highidx;
pg_data_t *pgdat = (pg_data_t*)p;
struct task_struct *tsk = current;
DEFINE_WAIT(wait);
@@ -2410,19 +2432,24 @@ static int kswapd(void *p)
set_freezable();
order = 0;
+ zone_highidx = MAX_NR_ZONES;
for ( ; ; ) {
unsigned long new_order;
+ int new_zone_highidx;
int ret;
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
new_order = pgdat->kswapd_max_order;
+ new_zone_highidx = pgdat->high_zoneidx;
pgdat->kswapd_max_order = 0;
- if (order < new_order) {
+ pgdat->high_zoneidx = MAX_NR_ZONES;
+ if (order < new_order || new_zone_highidx < zone_highidx) {
/*
* Don't sleep if someone wants a larger 'order'
- * allocation
+ * allocation or an order at a higher zone
*/
order = new_order;
+ zone_highidx = new_zone_highidx;
} else {
if (!freezing(current) && !kthread_should_stop()) {
long remaining = 0;
@@ -2451,6 +2478,7 @@ static int kswapd(void *p)
}
order = pgdat->kswapd_max_order;
+ zone_highidx = pgdat->high_zoneidx;
}
finish_wait(&pgdat->kswapd_wait, &wait);
@@ -2464,7 +2492,7 @@ static int kswapd(void *p)
*/
if (!ret) {
trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
- balance_pgdat(pgdat, order);
+ balance_pgdat(pgdat, order, zone_highidx);
}
}
return 0;
@@ -2473,7 +2501,7 @@ static int kswapd(void *p)
/*
* A zone is low on free memory, so wake its kswapd task to service it.
*/
-void wakeup_kswapd(struct zone *zone, int order)
+void wakeup_kswapd(struct zone *zone, int order, enum zone_type high_zoneidx)
{
pg_data_t *pgdat;
@@ -2483,8 +2511,10 @@ void wakeup_kswapd(struct zone *zone, int order)
pgdat = zone->zone_pgdat;
if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
return;
- if (pgdat->kswapd_max_order < order)
+ if (pgdat->kswapd_max_order < order) {
pgdat->kswapd_max_order = order;
+ pgdat->high_zoneidx = min(pgdat->high_zoneidx, high_zoneidx);
+ }
trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
return;
--
1.7.1
Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
there was a race pushing a zone below its watermark. If the race
happened, it stays awake. However, balance_pgdat() can decide to reclaim
at a lower order if it decides that high-order reclaim is not working as
expected. This information is not passed back to sleeping_prematurely().
The impact is that kswapd remains awake reclaiming pages long after it
should have gone to sleep. This patch passes the adjusted order to
sleeping_prematurely and uses the same logic as balance_pgdat to decide
if it's ok to go to sleep.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 30 ++++++++++++++++++++++++------
1 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 67e4283..9891efd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2118,15 +2118,17 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
#endif
/* is kswapd sleeping prematurely? */
-static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
+static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
{
int i;
+ bool all_zones_ok = true;
+ bool any_zone_ok = false;
/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
if (remaining)
return 1;
- /* If after HZ/10, a zone is below the high mark, it's premature */
+ /* Check the watermark levels */
for (i = 0; i < pgdat->nr_zones; i++) {
struct zone *zone = pgdat->node_zones + i;
@@ -2138,10 +2140,20 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
0, 0))
- return 1;
+ all_zones_ok = false;
+ else
+ any_zone_ok = true;
}
- return 0;
+ /*
+ * For high-order requests, any zone meeting the watermark is enough
+ * to allow kswapd go back to sleep
+ * For order-0, all zones must be balanced
+ */
+ if (order)
+ return !any_zone_ok;
+ else
+ return !all_zones_ok;
}
/*
@@ -2382,7 +2394,13 @@ out:
zone_clear_flag(zone, ZONE_CONGESTED);
}
- return sc.nr_reclaimed;
+ /*
+ * Return the order we were reclaiming at so sleeping_prematurely()
+ * makes a decision on the order we were last reclaiming at. However,
+ * if another caller entered the allocator slow path while kswapd
+ * was awake, order will remain at the higher level
+ */
+ return order;
}
/*
@@ -2492,7 +2510,7 @@ static int kswapd(void *p)
*/
if (!ret) {
trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
- balance_pgdat(pgdat, order, zone_highidx);
+ order = balance_pgdat(pgdat, order, zone_highidx);
}
}
return 0;
--
1.7.1
On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> When the allocator enters its slow path, kswapd is woken up to balance the
> node. It continues working until all zones within the node are balanced. For
> order-0 allocations, this makes perfect sense but for higher orders it can
> have unintended side-effects. If the zone sizes are imbalanced, kswapd
> may reclaim heavily on a smaller zone discarding an excessive number of
> pages. The user-visible behaviour is that kswapd is awake and reclaiming
> even though plenty of pages are free from a suitable zone.
>
> This patch alters the "balance" logic to stop kswapd if any suitable zone
> becomes balanced to reduce the number of pages it reclaims from other zones.
from my understanding, the patch will break reclaim high zone if a low
zone meets the high order allocation, even the high zone doesn't meet
the high order allocation. This, for example, will make a high order
allocation from a high zone fallback to low zone and quickly exhaust low
zone, for example DMA. This will break some drivers.
> On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> > When the allocator enters its slow path, kswapd is woken up to balance the
> > node. It continues working until all zones within the node are balanced. For
> > order-0 allocations, this makes perfect sense but for higher orders it can
> > have unintended side-effects. If the zone sizes are imbalanced, kswapd
> > may reclaim heavily on a smaller zone discarding an excessive number of
> > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> > even though plenty of pages are free from a suitable zone.
> >
> > This patch alters the "balance" logic to stop kswapd if any suitable zone
> > becomes balanced to reduce the number of pages it reclaims from other zones.
> from my understanding, the patch will break reclaim high zone if a low
> zone meets the high order allocation, even the high zone doesn't meet
> the high order allocation. This, for example, will make a high order
> allocation from a high zone fallback to low zone and quickly exhaust low
> zone, for example DMA. This will break some drivers.
Have you seen patch [3/3]? I think it migigate your pointed issue.
On Wed, 2010-12-01 at 10:23 +0800, KOSAKI Motohiro wrote:
> > On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> > > When the allocator enters its slow path, kswapd is woken up to balance the
> > > node. It continues working until all zones within the node are balanced. For
> > > order-0 allocations, this makes perfect sense but for higher orders it can
> > > have unintended side-effects. If the zone sizes are imbalanced, kswapd
> > > may reclaim heavily on a smaller zone discarding an excessive number of
> > > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> > > even though plenty of pages are free from a suitable zone.
> > >
> > > This patch alters the "balance" logic to stop kswapd if any suitable zone
> > > becomes balanced to reduce the number of pages it reclaims from other zones.
> > from my understanding, the patch will break reclaim high zone if a low
> > zone meets the high order allocation, even the high zone doesn't meet
> > the high order allocation. This, for example, will make a high order
> > allocation from a high zone fallback to low zone and quickly exhaust low
> > zone, for example DMA. This will break some drivers.
>
> Have you seen patch [3/3]? I think it migigate your pointed issue.
yes, it improves a lot, but still possible for small systems.
> On Wed, 2010-12-01 at 10:23 +0800, KOSAKI Motohiro wrote:
> > > On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> > > > When the allocator enters its slow path, kswapd is woken up to balance the
> > > > node. It continues working until all zones within the node are balanced. For
> > > > order-0 allocations, this makes perfect sense but for higher orders it can
> > > > have unintended side-effects. If the zone sizes are imbalanced, kswapd
> > > > may reclaim heavily on a smaller zone discarding an excessive number of
> > > > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> > > > even though plenty of pages are free from a suitable zone.
> > > >
> > > > This patch alters the "balance" logic to stop kswapd if any suitable zone
> > > > becomes balanced to reduce the number of pages it reclaims from other zones.
> > > from my understanding, the patch will break reclaim high zone if a low
> > > zone meets the high order allocation, even the high zone doesn't meet
> > > the high order allocation. This, for example, will make a high order
> > > allocation from a high zone fallback to low zone and quickly exhaust low
> > > zone, for example DMA. This will break some drivers.
> >
> > Have you seen patch [3/3]? I think it migigate your pointed issue.
> yes, it improves a lot, but still possible for small systems.
Ok, I got you. so please define your "small systems" word? we can't make
perfect VM heuristics obviously, then we need to compare pros/cons.
Of cource, I'm glad if you have better idea and show it.
On Wed, 2010-12-01 at 10:59 +0800, KOSAKI Motohiro wrote:
> > On Wed, 2010-12-01 at 10:23 +0800, KOSAKI Motohiro wrote:
> > > > On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> > > > > When the allocator enters its slow path, kswapd is woken up to balance the
> > > > > node. It continues working until all zones within the node are balanced. For
> > > > > order-0 allocations, this makes perfect sense but for higher orders it can
> > > > > have unintended side-effects. If the zone sizes are imbalanced, kswapd
> > > > > may reclaim heavily on a smaller zone discarding an excessive number of
> > > > > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> > > > > even though plenty of pages are free from a suitable zone.
> > > > >
> > > > > This patch alters the "balance" logic to stop kswapd if any suitable zone
> > > > > becomes balanced to reduce the number of pages it reclaims from other zones.
> > > > from my understanding, the patch will break reclaim high zone if a low
> > > > zone meets the high order allocation, even the high zone doesn't meet
> > > > the high order allocation. This, for example, will make a high order
> > > > allocation from a high zone fallback to low zone and quickly exhaust low
> > > > zone, for example DMA. This will break some drivers.
> > >
> > > Have you seen patch [3/3]? I think it migigate your pointed issue.
> > yes, it improves a lot, but still possible for small systems.
>
> Ok, I got you. so please define your "small systems" word?
an embedded system with less memory memory, obviously
> we can't make
> perfect VM heuristics obviously, then we need to compare pros/cons.
if you don't care about small system, let's consider a NORMAL i386
system with 896m normal zone, and 896M*3 high zone. normal zone will
quickly exhaust by high order high zone allocation, leave a latter
allocation which does need normal zone fail.
> On Wed, 2010-12-01 at 10:59 +0800, KOSAKI Motohiro wrote:
> > > On Wed, 2010-12-01 at 10:23 +0800, KOSAKI Motohiro wrote:
> > > > > On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> > > > > > When the allocator enters its slow path, kswapd is woken up to balance the
> > > > > > node. It continues working until all zones within the node are balanced. For
> > > > > > order-0 allocations, this makes perfect sense but for higher orders it can
> > > > > > have unintended side-effects. If the zone sizes are imbalanced, kswapd
> > > > > > may reclaim heavily on a smaller zone discarding an excessive number of
> > > > > > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> > > > > > even though plenty of pages are free from a suitable zone.
> > > > > >
> > > > > > This patch alters the "balance" logic to stop kswapd if any suitable zone
> > > > > > becomes balanced to reduce the number of pages it reclaims from other zones.
> > > > > from my understanding, the patch will break reclaim high zone if a low
> > > > > zone meets the high order allocation, even the high zone doesn't meet
> > > > > the high order allocation. This, for example, will make a high order
> > > > > allocation from a high zone fallback to low zone and quickly exhaust low
> > > > > zone, for example DMA. This will break some drivers.
> > > >
> > > > Have you seen patch [3/3]? I think it migigate your pointed issue.
> > > yes, it improves a lot, but still possible for small systems.
> >
> > Ok, I got you. so please define your "small systems" word?
> an embedded system with less memory memory, obviously
Typical embedded system don't have multiple zone. It's not obvious.
> > we can't make
> > perfect VM heuristics obviously, then we need to compare pros/cons.
> if you don't care about small system, let's consider a NORMAL i386
> system with 896m normal zone, and 896M*3 high zone. normal zone will
> quickly exhaust by high order high zone allocation, leave a latter
> allocation which does need normal zone fail.
Not happen. slab don't allocate from highmem and page cache allocation
is always using order-0. When happen high order high zone allocation?
On Wed, 2010-12-01 at 11:28 +0800, KOSAKI Motohiro wrote:
> > On Wed, 2010-12-01 at 10:59 +0800, KOSAKI Motohiro wrote:
> > > > On Wed, 2010-12-01 at 10:23 +0800, KOSAKI Motohiro wrote:
> > > > > > On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> > > > > > > When the allocator enters its slow path, kswapd is woken up to balance the
> > > > > > > node. It continues working until all zones within the node are balanced. For
> > > > > > > order-0 allocations, this makes perfect sense but for higher orders it can
> > > > > > > have unintended side-effects. If the zone sizes are imbalanced, kswapd
> > > > > > > may reclaim heavily on a smaller zone discarding an excessive number of
> > > > > > > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> > > > > > > even though plenty of pages are free from a suitable zone.
> > > > > > >
> > > > > > > This patch alters the "balance" logic to stop kswapd if any suitable zone
> > > > > > > becomes balanced to reduce the number of pages it reclaims from other zones.
> > > > > > from my understanding, the patch will break reclaim high zone if a low
> > > > > > zone meets the high order allocation, even the high zone doesn't meet
> > > > > > the high order allocation. This, for example, will make a high order
> > > > > > allocation from a high zone fallback to low zone and quickly exhaust low
> > > > > > zone, for example DMA. This will break some drivers.
> > > > >
> > > > > Have you seen patch [3/3]? I think it migigate your pointed issue.
> > > > yes, it improves a lot, but still possible for small systems.
> > >
> > > Ok, I got you. so please define your "small systems" word?
> > an embedded system with less memory memory, obviously
>
> Typical embedded system don't have multiple zone. It's not obvious.
IIRC, ARM supports highmem. But you are right, slub doen't allocate from
highmem.
> > > we can't make
> > > perfect VM heuristics obviously, then we need to compare pros/cons.
> > if you don't care about small system, let's consider a NORMAL i386
> > system with 896m normal zone, and 896M*3 high zone. normal zone will
> > quickly exhaust by high order high zone allocation, leave a latter
> > allocation which does need normal zone fail.
>
> Not happen. slab don't allocate from highmem and page cache allocation
> is always using order-0. When happen high order high zone allocation?
ok, thanks, I missed this. then how about a x86_64 box with 896M DMA32
and 896*3M NORMAL? some pci devices can only dma to DMA32 zone.
> > > > we can't make
> > > > perfect VM heuristics obviously, then we need to compare pros/cons.
> > > if you don't care about small system, let's consider a NORMAL i386
> > > system with 896m normal zone, and 896M*3 high zone. normal zone will
> > > quickly exhaust by high order high zone allocation, leave a latter
> > > allocation which does need normal zone fail.
> >
> > Not happen. slab don't allocate from highmem and page cache allocation
> > is always using order-0. When happen high order high zone allocation?
> ok, thanks, I missed this. then how about a x86_64 box with 896M DMA32
> and 896*3M NORMAL? some pci devices can only dma to DMA32 zone.
First, DMA32 is 4GB. Second, modern high end system don't use 32bit PCI
device. Third, while we are thinking desktop users, 4GB is not small
room. nowadays, typical desktop have only 2GB or 4GB memory.
In other word, I agree your pointed issue is exist _potentially_. but
I don't think it is frequently than Simon's case.
In other word, when deciding heuristics, we can't avoid to think issue
frequency. It's very important.
Of cource, if you have better idea, I don't oppose it.
On Wed, 2010-12-01 at 15:52 +0800, KOSAKI Motohiro wrote:
> > > > > we can't make
> > > > > perfect VM heuristics obviously, then we need to compare pros/cons.
> > > > if you don't care about small system, let's consider a NORMAL i386
> > > > system with 896m normal zone, and 896M*3 high zone. normal zone will
> > > > quickly exhaust by high order high zone allocation, leave a latter
> > > > allocation which does need normal zone fail.
> > >
> > > Not happen. slab don't allocate from highmem and page cache allocation
> > > is always using order-0. When happen high order high zone allocation?
> > ok, thanks, I missed this. then how about a x86_64 box with 896M DMA32
> > and 896*3M NORMAL? some pci devices can only dma to DMA32 zone.
>
> First, DMA32 is 4GB. Second, modern high end system don't use 32bit PCI
> device. Third, while we are thinking desktop users, 4GB is not small
> room. nowadays, typical desktop have only 2GB or 4GB memory.
DMA32 isn't 4G, because there is hole under 4G for PCI bars. I don't
think 32 bit PCI device is rare too. But anyway, if you insist this
isn't a big issue, I'm ok.
On Wed, Dec 01, 2010 at 10:13:56AM +0800, Shaohua Li wrote:
> On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> > When the allocator enters its slow path, kswapd is woken up to balance the
> > node. It continues working until all zones within the node are balanced. For
> > order-0 allocations, this makes perfect sense but for higher orders it can
> > have unintended side-effects. If the zone sizes are imbalanced, kswapd
> > may reclaim heavily on a smaller zone discarding an excessive number of
> > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> > even though plenty of pages are free from a suitable zone.
> >
> > This patch alters the "balance" logic to stop kswapd if any suitable zone
> > becomes balanced to reduce the number of pages it reclaims from other zones.
>
> from my understanding, the patch will break reclaim high zone if a low
> zone meets the high order allocation, even the high zone doesn't meet
> the high order allocation.
Indeed this is possible and it's a situation confirmed by Simon. Patch 3
should cover it because replacing "are any zones ok?" with "are zones
representing at least 25% of the node balanced?"
> This, for example, will make a high order
> allocation from a high zone fallback to low zone and quickly exhaust low
> zone, for example DMA. This will break some drivers.
>
The lowmem reserve would prevent that happening so the drivers would be
fine. The real impact is that kswapd would stop when DMA was balanced
even though it was really DMA32 or Normal needed to be balanced for
proper behaviour.
On lowmem reserves though, there is another buglet in
sleeping_prematurely. The classzone_idx it uses means that the wrong
lowmem_reserve is used for the majority of allocation requests.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab