2007-08-09 21:06:29

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/4] Use one zonelist per node instead of multiple zonelists v3

Changelog since V2
o shrink_zones() uses zonelist instead of zonelist->zones
o hugetlb uses zonelist iterator
o zone_idx information is embedded in zonelist pointers
o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid)

Changelog since V1
o Break up the patch into 3 patches
o Introduce iterators for zonelists
o Performance regression test

The following patches replace multiple zonelists per node with one zonelist
that is filtered based on the GFP flags. The patches as a set fix a bug
with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset,
the MPOL_BIND will apply to the two highest zones when the highest zone
is ZONE_MOVABLE. This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that
filters only custom zonelists. As a bonus, the patchset reduces the cache
footprint of the kernel and should improve performance in a number of cases.

The first patch cleans up an inconsitency where direct reclaim uses
zonelist->zones where other places use zonelist.

The second patch replaces multiple zonelists with one zonelist that is
filtered.

The final patch is a fix that depends on the previous two patches. The
patch changes policy zone so that the MPOL_BIND policy gets applied
to the two highest populated zones when the highest populated zone is
ZONE_MOVABLE. Otherwise, MPOL_BIND only applies to the highest populated zone.

The tests passed regression tests with numactltest. Performance results
varied depending on the machine configuration but were usually small
performance gains. The new algorithm relies heavily on the implementation
of zone_idx which is currently pretty expensive. Experiments to optimise
this have shown significant improvements for this algorithm, but is beyond
the scope of this patchset. Due to the nature of the change, the results
for other people are likely to vary - it'll usually win but occasionally lose.

In real workloads the gain/loss will depend on how much the userspace
portion of the benchmark benefits from having more cache available due
to reduced referencing of zonelists. I expect it'll be more noticable on
x86_64 with many zones than on IA64 which typically would only have one
active zonelist-per-node.

These are the range of performance losses/gains I found when running against
2.6.23-rc1-mm2. The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.

Total CPU time on Kernbench: -0.02% to 0.27%
Elapsed time on Kernbench: -0.21% to 1.26%
page_test from aim9: -3.41% to 3.90%
brk_test from aim9: -0.20% to 40.94%
fork_test from aim9: -0.42% to 4.59%
exec_test from aim9: -0.78% to 1.95%
Size reduction of pg_dat_t: 0 to 7808 bytes (depends on alignment)

The TBench figures were too variable between runs to draw conclusions from but
there didn't appear to be any regressions there. The hackbench results for both
sockets and pipes was within noise. I haven't gone though lmbench.

These four patches are a standalone set which address the MPOL_BIND problem
with ZONE_MOVABLE as well as reducing memory usage and in many cases the
cache footprint of the kernel. They should be considered as a bug fix due to
the MPOL_BIND fixup.

If these patches are accepted, the follow-on work would entail;

o Encode zone_id in the zonelist pointers to avoid zone_idx() (Christoph's idea)
o If zone_id works out, eliminate z_to_n from the zonelist cache as unnecessary
o Remove bind_zonelist() (Patch in progress, very messy right now)
o Eliminate policy_zone (Trickier)

Comments?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab


2007-08-09 21:06:54

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/4] Use zonelists instead of zones when direct reclaiming pages


The allocator deals with zonelists which indicate the order in which zones
should be targeted for an allocation. Similarly, direct reclaim of pages
iterates over an array of zones. For consistency, this patch converts direct
reclaim to use a zonelist. No functionality is changed by this patch. This
simplifies zonelist iterators in the next patch.

Signed-off-by: Mel Gorman <[email protected]>
---

include/linux/swap.h | 2 +-
mm/page_alloc.c | 2 +-
mm/vmscan.c | 9 ++++++---
3 files changed, 8 insertions(+), 5 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-clean/include/linux/swap.h linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/swap.h
--- linux-2.6.23-rc1-mm2-clean/include/linux/swap.h 2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/swap.h 2007-08-08 17:01:12.000000000 +0100
@@ -189,7 +189,7 @@ extern int rotate_reclaimable_page(struc
extern void swap_setup(void);

/* linux/mm/vmscan.c */
-extern unsigned long try_to_free_pages(struct zone **zones, int order,
+extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-clean/mm/page_alloc.c linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/page_alloc.c
--- linux-2.6.23-rc1-mm2-clean/mm/page_alloc.c 2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/page_alloc.c 2007-08-08 17:01:12.000000000 +0100
@@ -1644,7 +1644,7 @@ nofail_alloc:
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;

- did_some_progress = try_to_free_pages(zonelist->zones, order, gfp_mask);
+ did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);

p->reclaim_state = NULL;
p->flags &= ~PF_MEMALLOC;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-clean/mm/vmscan.c linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/vmscan.c
--- linux-2.6.23-rc1-mm2-clean/mm/vmscan.c 2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/vmscan.c 2007-08-09 10:18:59.000000000 +0100
@@ -1086,10 +1086,11 @@ static unsigned long shrink_zone(int pri
* If a zone is deemed to be full of pinned pages then just give it a light
* scan then give up on it.
*/
-static unsigned long shrink_zones(int priority, struct zone **zones,
+static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
struct scan_control *sc)
{
unsigned long nr_reclaimed = 0;
+ struct zones **zones = zonelist->zones;
int i;

sc->all_unreclaimable = 1;
@@ -1127,7 +1128,8 @@ static unsigned long shrink_zones(int pr
* holds filesystem locks which prevent writeout this might not work, and the
* allocation attempt will fail.
*/
-unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
+unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
+ gfp_t gfp_mask)
{
int priority;
int ret = 0;
@@ -1135,6 +1137,7 @@ unsigned long try_to_free_pages(struct z
unsigned long nr_reclaimed = 0;
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long lru_pages = 0;
+ struct zone **zones = zonelist->zones;
int i;
struct scan_control sc = {
.gfp_mask = gfp_mask,
@@ -1162,7 +1165,7 @@ unsigned long try_to_free_pages(struct z
sc.nr_scanned = 0;
if (!priority)
disable_swap_token();
- nr_reclaimed += shrink_zones(priority, zones, &sc);
+ nr_reclaimed += shrink_zones(priority, zonelist, &sc);
shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
if (reclaim_state) {
nr_reclaimed += reclaim_state->reclaimed_slab;

2007-08-09 21:07:30

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/4] Use one zonelist that is filtered instead of multiple zonelists


Currently a node has a number of zonelists, one for each zone type in
the system. Based on the zones allowed by a gfp mask, one of these zonelists
is selected. All of these zonelists occupy memory and consume cache lines.

This patch replaces the multiple zonelists in the node with a single
zonelist that contains all populated zones in the system. An iterator macro
is introduced called for_each_zone_zonelist() interates through each zone
in the zonelist that is allowed by the GFP flags.

Signed-off-by: Mel Gorman <[email protected]>
---

arch/parisc/mm/init.c | 11 ++-
drivers/char/sysrq.c | 3
fs/buffer.c | 9 +-
include/linux/gfp.h | 3
include/linux/mempolicy.h | 2
include/linux/mmzone.h | 39 +++++++++++
mm/hugetlb.c | 8 +-
mm/mempolicy.c | 6 -
mm/oom_kill.c | 8 +-
mm/page_alloc.c | 139 +++++++++++++++++------------------------
mm/slab.c | 11 +--
mm/slub.c | 11 +--
mm/vmscan.c | 21 ++----
13 files changed, 146 insertions(+), 125 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/arch/parisc/mm/init.c linux-2.6.23-rc1-mm2-010_use_zonelist/arch/parisc/mm/init.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/arch/parisc/mm/init.c 2007-07-22 21:41:00.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/arch/parisc/mm/init.c 2007-08-09 14:58:42.000000000 +0100
@@ -599,15 +599,18 @@ void show_mem(void)
#ifdef CONFIG_DISCONTIGMEM
{
struct zonelist *zl;
- int i, j, k;
+ int i, j;

for (i = 0; i < npmem_ranges; i++) {
+ zl = node_zonelist(i);
for (j = 0; j < MAX_NR_ZONES; j++) {
- zl = NODE_DATA(i)->node_zonelists + j;
+ struct zone **z;
+ struct zone *zone;

printk("Zone list for zone %d on node %d: ", j, i);
- for (k = 0; zl->zones[k] != NULL; k++)
- printk("[%ld/%s] ", zone_to_nid(zl->zones[k]), zl->zones[k]->name);
+ for_each_zone_zonelist(zone, z, zl, j)
+ printk(KERN_INFO, "[%ld/%s] ",
+ zone_to_nid(zone), zone->name);
printk("\n");
}
}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/drivers/char/sysrq.c linux-2.6.23-rc1-mm2-010_use_zonelist/drivers/char/sysrq.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/drivers/char/sysrq.c 2007-08-07 14:45:09.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/drivers/char/sysrq.c 2007-08-09 14:59:06.000000000 +0100
@@ -270,8 +270,7 @@ static struct sysrq_key_op sysrq_term_op

static void moom_callback(struct work_struct *ignored)
{
- out_of_memory(&NODE_DATA(0)->node_zonelists[ZONE_NORMAL],
- GFP_KERNEL, 0);
+ out_of_memory(node_zonelist(0), GFP_KERNEL, 0);
}

static DECLARE_WORK(moom_work, moom_callback);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/fs/buffer.c linux-2.6.23-rc1-mm2-010_use_zonelist/fs/buffer.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/fs/buffer.c 2007-08-07 14:45:10.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/fs/buffer.c 2007-08-09 15:11:18.000000000 +0100
@@ -356,15 +356,16 @@ void invalidate_bdev(struct block_device
static void free_more_memory(void)
{
struct zone **zones;
- pg_data_t *pgdat;
+ int nid;

wakeup_pdflush(1024);
yield();

- for_each_online_pgdat(pgdat) {
- zones = pgdat->node_zonelists[gfp_zone(GFP_NOFS)].zones;
+ for_each_online_node(nid) {
+ zones = first_zones_zonelist(node_zonelist(nid),
+ gfp_zone(GFP_NOFS));
if (*zones)
- try_to_free_pages(zones, 0, GFP_NOFS);
+ try_to_free_pages(node_zonelist(nid), 0, GFP_NOFS);
}
}

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/gfp.h linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/gfp.h
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/gfp.h 2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/gfp.h 2007-08-09 15:01:10.000000000 +0100
@@ -175,8 +175,7 @@ static inline struct page *alloc_pages_n
if (nid < 0)
nid = numa_node_id();

- return __alloc_pages(gfp_mask, order,
- NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
+ return __alloc_pages(gfp_mask, order, node_zonelist(nid));
}

#ifdef CONFIG_NUMA
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/mempolicy.h linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/mempolicy.h
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/mempolicy.h 2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/mempolicy.h 2007-08-09 15:01:36.000000000 +0100
@@ -240,7 +240,7 @@ static inline void mpol_fix_fork_child_f
static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
unsigned long addr, gfp_t gfp_flags)
{
- return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
+ return node_zonelist(0);
}

static inline int do_migrate_pages(struct mm_struct *mm,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/mmzone.h linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/mmzone.h
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/mmzone.h 2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/mmzone.h 2007-08-09 15:08:15.000000000 +0100
@@ -469,7 +469,7 @@ extern struct page *mem_map;
struct bootmem_data;
typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
- struct zonelist node_zonelists[MAX_NR_ZONES];
+ struct zonelist node_zonelist;
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP
struct page *node_mem_map;
@@ -670,6 +670,43 @@ extern struct zone *next_zone(struct zon
zone; \
zone = next_zone(zone))

+/* Return the zonelist belonging to a node of a given ID */
+static inline struct zonelist *node_zonelist(int nid)
+{
+ return &NODE_DATA(nid)->node_zonelist;
+}
+
+/* Returns the first zone at or below highest_zoneidx in a zonelist */
+static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
+ enum zone_type highest_zoneidx)
+{
+ struct zone **z;
+ for (z = zonelist->zones; *z && zone_idx(*z) > highest_zoneidx; z++);
+ return z;
+}
+
+/* Returns the next zone at or below highest_zoneidx in a zonelist */
+static inline struct zone **next_zones_zonelist(struct zone **z,
+ enum zone_type highest_zoneidx)
+{
+ for (++z; *z && zone_idx(*z) > highest_zoneidx; z++);
+ return z;
+}
+
+/**
+ * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * @zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ *
+ * This iterator iterates though all zones at or below a given zone index.
+ */
+#define for_each_zone_zonelist(zone, z, zlist, highidx) \
+ for (z = first_zones_zonelist(zlist, highidx), zone = *z; \
+ zone; \
+ z = next_zones_zonelist(z, highidx), zone = *z)
+
#ifdef CONFIG_SPARSEMEM
#include <asm/sparsemem.h>
#endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/hugetlb.c linux-2.6.23-rc1-mm2-010_use_zonelist/mm/hugetlb.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/hugetlb.c 2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/mm/hugetlb.c 2007-08-09 14:53:35.000000000 +0100
@@ -73,11 +73,11 @@ static struct page *dequeue_huge_page(st
struct page *page = NULL;
struct zonelist *zonelist = huge_zonelist(vma, address,
htlb_alloc_mask);
- struct zone **z;
+ struct zone *zone, **z;

- for (z = zonelist->zones; *z; z++) {
- nid = zone_to_nid(*z);
- if (cpuset_zone_allowed_softwall(*z, htlb_alloc_mask) &&
+ for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) {
+ nid = zone_to_nid(zone);
+ if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
!list_empty(&hugepage_freelists[nid])) {
page = list_entry(hugepage_freelists[nid].next,
struct page, lru);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/mempolicy.c linux-2.6.23-rc1-mm2-010_use_zonelist/mm/mempolicy.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/mempolicy.c 2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/mm/mempolicy.c 2007-08-09 18:34:00.000000000 +0100
@@ -1120,7 +1120,7 @@ static struct zonelist *zonelist_policy(
nd = 0;
BUG();
}
- return NODE_DATA(nd)->node_zonelists + gfp_zone(gfp);
+ return node_zonelist(nd);
}

/* Do dynamic interleaving for a process */
@@ -1216,7 +1216,7 @@ struct zonelist *huge_zonelist(struct vm
unsigned nid;

nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
- return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
+ return node_zonelist(nid);
}
return zonelist_policy(GFP_HIGHUSER, pol);
}
@@ -1230,7 +1230,7 @@ static struct page *alloc_page_interleav
struct zonelist *zl;
struct page *page;

- zl = NODE_DATA(nid)->node_zonelists + gfp_zone(gfp);
+ zl = node_zonelist(nid);
page = __alloc_pages(gfp, order, zl);
if (page && page_zone(page) == zl->zones[0])
inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/oom_kill.c linux-2.6.23-rc1-mm2-010_use_zonelist/mm/oom_kill.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/oom_kill.c 2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/mm/oom_kill.c 2007-08-09 10:19:05.000000000 +0100
@@ -177,8 +177,10 @@ static inline int constrained_alloc(stru
{
#ifdef CONFIG_NUMA
struct zone **z;
+ struct zone *zone;
nodemask_t nodes;
int node;
+ enum zone_type high_zoneidx = gfp_zone(gfp_mask);

nodes_clear(nodes);
/* node has memory ? */
@@ -186,9 +188,9 @@ static inline int constrained_alloc(stru
if (NODE_DATA(node)->node_present_pages)
node_set(node, nodes);

- for (z = zonelist->zones; *z; z++)
- if (cpuset_zone_allowed_softwall(*z, gfp_mask))
- node_clear(zone_to_nid(*z), nodes);
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+ if (cpuset_zone_allowed_softwall(zone, gfp_mask))
+ node_clear(zone_to_nid(zone), nodes);
else
return CONSTRAINT_CPUSET;

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/page_alloc.c linux-2.6.23-rc1-mm2-010_use_zonelist/mm/page_alloc.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/page_alloc.c 2007-08-08 17:01:12.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/mm/page_alloc.c 2007-08-09 15:04:19.000000000 +0100
@@ -1416,6 +1416,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
struct page *page = NULL;
int classzone_idx = zone_idx(zonelist->zones[0]);
struct zone *zone;
+ enum zone_type high_zoneidx = gfp_zone(gfp_mask);
nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */
@@ -1425,13 +1426,10 @@ zonelist_scan:
* Scan zonelist, looking for a zone with enough free.
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
- z = zonelist->zones;
-
- do {
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
if (NUMA_BUILD && zlc_active &&
!zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;
- zone = *z;
if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) &&
zone->zone_pgdat != zonelist->zones[0]->zone_pgdat))
break;
@@ -1468,7 +1466,7 @@ try_next_zone:
zlc_active = 1;
did_zlc_setup = 1;
}
- } while (*(++z) != NULL);
+ }

if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
/* Disable zlc cache for second zonelist scan */
@@ -1781,15 +1779,15 @@ EXPORT_SYMBOL(free_pages);

static unsigned int nr_free_zone_pages(int offset)
{
+ enum zone_type high_zoneidx = MAX_NR_ZONES - 1;
+ struct zone **z;
+ struct zone *zone;
+
/* Just pick one node, since fallback list is circular */
- pg_data_t *pgdat = NODE_DATA(numa_node_id());
unsigned int sum = 0;
+ struct zonelist *zonelist = node_zonelist(numa_node_id());

- struct zonelist *zonelist = pgdat->node_zonelists + offset;
- struct zone **zonep = zonelist->zones;
- struct zone *zone;
-
- for (zone = *zonep++; zone; zone = *zonep++) {
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
unsigned long size = zone->present_pages;
unsigned long high = zone->pages_high;
if (size > high)
@@ -2148,17 +2146,14 @@ static int find_next_best_node(int node,
*/
static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
{
- enum zone_type i;
int j;
struct zonelist *zonelist;

- for (i = 0; i < MAX_NR_ZONES; i++) {
- zonelist = pgdat->node_zonelists + i;
- for (j = 0; zonelist->zones[j] != NULL; j++)
- ;
- j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
- zonelist->zones[j] = NULL;
- }
+ zonelist = &pgdat->node_zonelist;
+ for (j = 0; zonelist->zones[j] != NULL; j++)
+ ;
+ j = build_zonelists_node(NODE_DATA(node), zonelist, j, MAX_NR_ZONES-1);
+ zonelist->zones[j] = NULL;
}

/*
@@ -2171,27 +2166,24 @@ static int node_order[MAX_NUMNODES];

static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
{
- enum zone_type i;
int pos, j, node;
int zone_type; /* needs to be signed */
struct zone *z;
struct zonelist *zonelist;

- for (i = 0; i < MAX_NR_ZONES; i++) {
- zonelist = pgdat->node_zonelists + i;
- pos = 0;
- for (zone_type = i; zone_type >= 0; zone_type--) {
- for (j = 0; j < nr_nodes; j++) {
- node = node_order[j];
- z = &NODE_DATA(node)->node_zones[zone_type];
- if (populated_zone(z)) {
- zonelist->zones[pos++] = z;
- check_highest_zone(zone_type);
- }
+ zonelist = &pgdat->node_zonelist;
+ pos = 0;
+ for (zone_type = MAX_NR_ZONES-1; zone_type >= 0; zone_type--) {
+ for (j = 0; j < nr_nodes; j++) {
+ node = node_order[j];
+ z = &NODE_DATA(node)->node_zones[zone_type];
+ if (populated_zone(z)) {
+ zonelist->zones[pos++] = z;
+ check_highest_zone(zone_type);
}
}
- zonelist->zones[pos] = NULL;
}
+ zonelist->zones[pos] = NULL;
}

static int default_zonelist_order(void)
@@ -2258,17 +2250,14 @@ static void set_zonelist_order(void)
static void build_zonelists(pg_data_t *pgdat)
{
int j, node, load;
- enum zone_type i;
nodemask_t used_mask;
int local_node, prev_node;
struct zonelist *zonelist;
int order = current_zonelist_order;

- /* initialize zonelists */
- for (i = 0; i < MAX_NR_ZONES; i++) {
- zonelist = pgdat->node_zonelists + i;
- zonelist->zones[0] = NULL;
- }
+ /* initialize zonelist */
+ zonelist = &pgdat->node_zonelist;
+ zonelist->zones[0] = NULL;

/* NUMA-aware ordering of nodes */
local_node = pgdat->node_id;
@@ -2316,18 +2305,15 @@ static void build_zonelists(pg_data_t *p
static void build_zonelist_cache(pg_data_t *pgdat)
{
int i;
+ struct zonelist *zonelist;
+ struct zonelist_cache *zlc;
+ struct zone **z;

- for (i = 0; i < MAX_NR_ZONES; i++) {
- struct zonelist *zonelist;
- struct zonelist_cache *zlc;
- struct zone **z;
-
- zonelist = pgdat->node_zonelists + i;
- zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
- bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
- for (z = zonelist->zones; *z; z++)
- zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
- }
+ zonelist = &pgdat->node_zonelist;
+ zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
+ bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
+ for (z = zonelist->zones; *z; z++)
+ zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
}


@@ -2341,45 +2327,42 @@ static void set_zonelist_order(void)
static void build_zonelists(pg_data_t *pgdat)
{
int node, local_node;
- enum zone_type i,j;
+ enum zone_type j;
+ struct zonelist *zonelist;

local_node = pgdat->node_id;
- for (i = 0; i < MAX_NR_ZONES; i++) {
- struct zonelist *zonelist;
-
- zonelist = pgdat->node_zonelists + i;

- j = build_zonelists_node(pgdat, zonelist, 0, i);
- /*
- * Now we build the zonelist so that it contains the zones
- * of all the other nodes.
- * We don't want to pressure a particular node, so when
- * building the zones for node N, we make sure that the
- * zones coming right after the local ones are those from
- * node N+1 (modulo N)
- */
- for (node = local_node + 1; node < MAX_NUMNODES; node++) {
- if (!node_online(node))
- continue;
- j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
- }
- for (node = 0; node < local_node; node++) {
- if (!node_online(node))
- continue;
- j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
- }
+ zonelist = &pgdat->node_zonelist;
+ j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES-1);

- zonelist->zones[j] = NULL;
+ /*
+ * Now we build the zonelist so that it contains the zones
+ * of all the other nodes.
+ * We don't want to pressure a particular node, so when
+ * building the zones for node N, we make sure that the
+ * zones coming right after the local ones are those from
+ * node N+1 (modulo N)
+ */
+ for (node = local_node + 1; node < MAX_NUMNODES; node++) {
+ if (!node_online(node))
+ continue;
+ j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+ MAX_NR_ZONES-1);
+ }
+ for (node = 0; node < local_node; node++) {
+ if (!node_online(node))
+ continue;
+ j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+ MAX_NR_ZONES-1);
}
+
+ zonelist->zones[j] = NULL;
}

/* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
static void build_zonelist_cache(pg_data_t *pgdat)
{
- int i;
-
- for (i = 0; i < MAX_NR_ZONES; i++)
- pgdat->node_zonelists[i].zlcache_ptr = NULL;
+ pgdat->node_zonelist.zlcache_ptr = NULL;
}

#endif /* CONFIG_NUMA */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/slab.c linux-2.6.23-rc1-mm2-010_use_zonelist/mm/slab.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/slab.c 2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/mm/slab.c 2007-08-09 15:04:45.000000000 +0100
@@ -3239,14 +3239,15 @@ static void *fallback_alloc(struct kmem_
struct zonelist *zonelist;
gfp_t local_flags;
struct zone **z;
+ struct zone *zone;
+ enum zone_type high_zoneidx = gfp_zone(flags);
void *obj = NULL;
int nid;

if (flags & __GFP_THISNODE)
return NULL;

- zonelist = &NODE_DATA(slab_node(current->mempolicy))
- ->node_zonelists[gfp_zone(flags)];
+ zonelist = node_zonelist(slab_node(current->mempolicy));
local_flags = (flags & GFP_LEVEL_MASK);

retry:
@@ -3254,10 +3255,10 @@ retry:
* Look through allowed nodes for objects available
* from existing per node queues.
*/
- for (z = zonelist->zones; *z && !obj; z++) {
- nid = zone_to_nid(*z);
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+ nid = zone_to_nid(zone);

- if (cpuset_zone_allowed_hardwall(*z, flags) &&
+ if (cpuset_zone_allowed_hardwall(zone, flags) &&
cache->nodelists[nid] &&
cache->nodelists[nid]->free_objects)
obj = ____cache_alloc_node(cache,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/slub.c linux-2.6.23-rc1-mm2-010_use_zonelist/mm/slub.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/slub.c 2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/mm/slub.c 2007-08-09 15:05:09.000000000 +0100
@@ -1285,6 +1285,8 @@ static struct page *get_any_partial(stru
#ifdef CONFIG_NUMA
struct zonelist *zonelist;
struct zone **z;
+ struct zone *zone;
+ enum zone_type high_zoneidx = gfp_zone(flags);
struct page *page;

/*
@@ -1308,14 +1310,13 @@ static struct page *get_any_partial(stru
if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
return NULL;

- zonelist = &NODE_DATA(slab_node(current->mempolicy))
- ->node_zonelists[gfp_zone(flags)];
- for (z = zonelist->zones; *z; z++) {
+ zonelist = node_zonelist(slab_node(current->mempolicy));
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
struct kmem_cache_node *n;

- n = get_node(s, zone_to_nid(*z));
+ n = get_node(s, zone_to_nid(zone));

- if (n && cpuset_zone_allowed_hardwall(*z, flags) &&
+ if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
n->nr_partial > MIN_PARTIAL) {
page = get_partial_node(n);
if (page)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/vmscan.c linux-2.6.23-rc1-mm2-010_use_zonelist/mm/vmscan.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/vmscan.c 2007-08-09 10:18:59.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/mm/vmscan.c 2007-08-09 14:46:57.000000000 +0100
@@ -1090,13 +1090,11 @@ static unsigned long shrink_zones(int pr
struct scan_control *sc)
{
unsigned long nr_reclaimed = 0;
- struct zones **zones = zonelist->zones;
- int i;
+ struct zone **z;
+ struct zone *zone;

sc->all_unreclaimable = 1;
- for (i = 0; zones[i] != NULL; i++) {
- struct zone *zone = zones[i];
-
+ for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) {
if (!populated_zone(zone))
continue;

@@ -1137,8 +1135,9 @@ unsigned long try_to_free_pages(struct z
unsigned long nr_reclaimed = 0;
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long lru_pages = 0;
- struct zone **zones = zonelist->zones;
- int i;
+ struct zone **z;
+ struct zone *zone;
+ enum zone_type high_zoneidx = gfp_zone(gfp_mask);
struct scan_control sc = {
.gfp_mask = gfp_mask,
.may_writepage = !laptop_mode,
@@ -1151,9 +1150,7 @@ unsigned long try_to_free_pages(struct z
delay_swap_prefetch();
count_vm_event(ALLOCSTALL);

- for (i = 0; zones[i] != NULL; i++) {
- struct zone *zone = zones[i];
-
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;

@@ -1207,9 +1204,7 @@ out:
*/
if (priority < 0)
priority = 0;
- for (i = 0; zones[i] != 0; i++) {
- struct zone *zone = zones[i];
-
+ for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;

2007-08-09 21:07:52

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer


Using one zonelist per node requires very frequent use of zone_idx(). This
is costly as it involves a lookup of another structure and a substraction
operation. struct zone is aligned on a node-interleave boundary so the
pointer values of plenty of 0's at the least significant bits of the address.

This patch embeds the zone_id of a zone in the zonelist->zones pointers.
The real zone pointer is found using the zonelist_zone() helper function.
The ID of the zone is found using zonelist_zone_idx(). To avoid accidental
references, the zones field is renamed to _zones.
Signed-off-by: Mel Gorman <[email protected]>
---

include/linux/mmzone.h | 59 ++++++++++++++++++++++++++++++++++++++++----
mm/mempolicy.c | 22 ++++++++--------
mm/page_alloc.c | 36 +++++++++++++-------------
mm/vmstat.c | 4 +-
4 files changed, 85 insertions(+), 36 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/mmzone.h linux-2.6.23-rc1-mm2-015_zoneid_zonelist/include/linux/mmzone.h
--- linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/mmzone.h 2007-08-09 15:08:15.000000000 +0100
+++ linux-2.6.23-rc1-mm2-015_zoneid_zonelist/include/linux/mmzone.h 2007-08-09 18:00:41.000000000 +0100
@@ -436,7 +436,9 @@ struct zonelist_cache;

struct zonelist {
struct zonelist_cache *zlcache_ptr; // NULL or &zlcache
- struct zone *zones[MAX_ZONES_PER_ZONELIST + 1]; // NULL delimited
+ struct zone *_zones[MAX_ZONES_PER_ZONELIST + 1]; // NULL delimited
+ // Access with
+ // zonelist_zone()
#ifdef CONFIG_NUMA
struct zonelist_cache zlcache; // optional ...
#endif
@@ -676,12 +678,57 @@ static inline struct zonelist *node_zone
return &NODE_DATA(nid)->node_zonelist;
}

+#if defined(CONFIG_SMP) && INTERNODE_CACHE_SHIFT > ZONES_SHIFT
+
+/* Similar to ZONES_MASK but is not available in this context */
+#define ZONELIST_ZONEIDX_MASK ((1UL << ZONES_SHIFT) - 1)
+
+/* zone_id is small enough to fit at bottom of zone pointer in zonelist */
+static inline struct zone *zonelist_zone(struct zone *zone)
+{
+ return (struct zone *)((unsigned long)zone & ~ZONELIST_ZONEIDX_MASK);
+}
+
+
+static inline int zonelist_zone_idx(struct zone *zone)
+{
+ /* ZONES_MASK not available in this context */
+ return (unsigned long)zone & ZONELIST_ZONEIDX_MASK;
+}
+
+static inline struct zone *encode_zone_idx(struct zone *zone)
+{
+ struct zone *encoded;
+
+ encoded = (struct zone *)((unsigned long)zone | zone_idx(zone));
+ BUG_ON(zonelist_zone(encoded) != zone);
+ return encoded;
+}
+#else
+static inline struct zone *zonelist_zone(struct zone *zone)
+{
+ return zone;
+}
+
+static inline int zonelist_zone_idx(struct zone *zone)
+{
+ return zone_idx(zone);
+}
+
+static inline struct zone *encode_zone_idx(struct zone *zone)
+{
+ return zone;
+}
+#endif
+
/* Returns the first zone at or below highest_zoneidx in a zonelist */
static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
enum zone_type highest_zoneidx)
{
struct zone **z;
- for (z = zonelist->zones; *z && zone_idx(*z) > highest_zoneidx; z++);
+ for (z = zonelist->_zones;
+ *z && zonelist_zone_idx(*z) > highest_zoneidx;
+ z++);
return z;
}

@@ -689,7 +736,9 @@ static inline struct zone **first_zones_
static inline struct zone **next_zones_zonelist(struct zone **z,
enum zone_type highest_zoneidx)
{
- for (++z; *z && zone_idx(*z) > highest_zoneidx; z++);
+ for (++z;
+ *z && zonelist_zone_idx(*z) > highest_zoneidx;
+ z++);
return z;
}

@@ -703,9 +752,9 @@ static inline struct zone **next_zones_z
* This iterator iterates though all zones at or below a given zone index.
*/
#define for_each_zone_zonelist(zone, z, zlist, highidx) \
- for (z = first_zones_zonelist(zlist, highidx), zone = *z; \
+ for (z = first_zones_zonelist(zlist, highidx), zone = zonelist_zone(*z); \
zone; \
- z = next_zones_zonelist(z, highidx), zone = *z)
+ z = next_zones_zonelist(z, highidx), zone = zonelist_zone(*z))

#ifdef CONFIG_SPARSEMEM
#include <asm/sparsemem.h>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-010_use_zonelist/mm/mempolicy.c linux-2.6.23-rc1-mm2-015_zoneid_zonelist/mm/mempolicy.c
--- linux-2.6.23-rc1-mm2-010_use_zonelist/mm/mempolicy.c 2007-08-09 18:34:00.000000000 +0100
+++ linux-2.6.23-rc1-mm2-015_zoneid_zonelist/mm/mempolicy.c 2007-08-09 18:33:51.000000000 +0100
@@ -156,7 +156,7 @@ static struct zonelist *bind_zonelist(no
for_each_node_mask(nd, *nodes) {
struct zone *z = &NODE_DATA(nd)->node_zones[k];
if (z->present_pages > 0)
- zl->zones[num++] = z;
+ zl->_zones[num++] = encode_zone_idx(z);
}
if (k == 0)
break;
@@ -166,7 +166,7 @@ static struct zonelist *bind_zonelist(no
kfree(zl);
return ERR_PTR(-EINVAL);
}
- zl->zones[num] = NULL;
+ zl->_zones[num] = NULL;
return zl;
}

@@ -486,8 +486,8 @@ static void get_zonemask(struct mempolic
nodes_clear(*nodes);
switch (p->policy) {
case MPOL_BIND:
- for (i = 0; p->v.zonelist->zones[i]; i++)
- node_set(zone_to_nid(p->v.zonelist->zones[i]),
+ for (i = 0; p->v.zonelist->_zones[i]; i++)
+ node_set(zone_to_nid(zonelist_zone(p->v.zonelist->_zones[i])),
*nodes);
break;
case MPOL_DEFAULT:
@@ -1154,7 +1154,7 @@ unsigned slab_node(struct mempolicy *pol
* Follow bind policy behavior and start allocation at the
* first node.
*/
- return zone_to_nid(policy->v.zonelist->zones[0]);
+ return zone_to_nid(zonelist_zone(policy->v.zonelist->_zones[0]));

case MPOL_PREFERRED:
if (policy->v.preferred_node >= 0)
@@ -1232,7 +1232,7 @@ static struct page *alloc_page_interleav

zl = node_zonelist(nid);
page = __alloc_pages(gfp, order, zl);
- if (page && page_zone(page) == zl->zones[0])
+ if (page && page_zone(page) == zonelist_zone(zl->_zones[0]))
inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
return page;
}
@@ -1356,10 +1356,10 @@ int __mpol_equal(struct mempolicy *a, st
return a->v.preferred_node == b->v.preferred_node;
case MPOL_BIND: {
int i;
- for (i = 0; a->v.zonelist->zones[i]; i++)
- if (a->v.zonelist->zones[i] != b->v.zonelist->zones[i])
+ for (i = 0; a->v.zonelist->_zones[i]; i++)
+ if (zonelist_zone(a->v.zonelist->_zones[i]) != zonelist_zone(b->v.zonelist->_zones[i]))
return 0;
- return b->v.zonelist->zones[i] == NULL;
+ return b->v.zonelist->_zones[i] == NULL;
}
default:
BUG();
@@ -1682,8 +1682,8 @@ static void mpol_rebind_policy(struct me
struct zonelist *zonelist;

nodes_clear(nodes);
- for (z = pol->v.zonelist->zones; *z; z++)
- node_set(zone_to_nid(*z), nodes);
+ for (z = pol->v.zonelist->_zones; *z; z++)
+ node_set(zone_to_nid(zonelist_zone(*z)), nodes);
nodes_remap(tmp, nodes, *mpolmask, *newmask);
nodes = tmp;

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-010_use_zonelist/mm/page_alloc.c linux-2.6.23-rc1-mm2-015_zoneid_zonelist/mm/page_alloc.c
--- linux-2.6.23-rc1-mm2-010_use_zonelist/mm/page_alloc.c 2007-08-09 15:04:19.000000000 +0100
+++ linux-2.6.23-rc1-mm2-015_zoneid_zonelist/mm/page_alloc.c 2007-08-09 15:52:12.000000000 +0100
@@ -1360,7 +1360,7 @@ static int zlc_zone_worth_trying(struct
if (!zlc)
return 1;

- i = z - zonelist->zones;
+ i = z - zonelist->_zones;
n = zlc->z_to_n[i];

/* This zone is worth trying if it is allowed but not full */
@@ -1381,7 +1381,7 @@ static void zlc_mark_zone_full(struct zo
if (!zlc)
return;

- i = z - zonelist->zones;
+ i = z - zonelist->_zones;

set_bit(i, zlc->fullzones);
}
@@ -1414,7 +1414,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
{
struct zone **z;
struct page *page = NULL;
- int classzone_idx = zone_idx(zonelist->zones[0]);
+ int classzone_idx = zonelist_zone_idx(zonelist->_zones[0]);
struct zone *zone;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
@@ -1431,7 +1431,7 @@ zonelist_scan:
!zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;
if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) &&
- zone->zone_pgdat != zonelist->zones[0]->zone_pgdat))
+ zone->zone_pgdat != zonelist_zone(zonelist->_zones[0])->zone_pgdat))
break;
if ((alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed_softwall(zone, gfp_mask))
@@ -1554,9 +1554,9 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
return NULL;

restart:
- z = zonelist->zones; /* the list of zones suitable for gfp_mask */
+ z = zonelist->_zones; /* the list of zones suitable for gfp_mask */

- if (unlikely(*z == NULL)) {
+ if (unlikely(zonelist_zone(*z) == NULL)) {
/* Should this ever happen?? */
return NULL;
}
@@ -1577,8 +1577,8 @@ restart:
if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
goto nopage;

- for (z = zonelist->zones; *z; z++)
- wakeup_kswapd(*z, order);
+ for (z = zonelist->_zones; *z; z++)
+ wakeup_kswapd(zonelist_zone(*z), order);

/*
* OK, we're below the kswapd watermark and have kicked background
@@ -1974,7 +1974,7 @@ static int build_zonelists_node(pg_data_
zone_type--;
zone = pgdat->node_zones + zone_type;
if (populated_zone(zone)) {
- zonelist->zones[nr_zones++] = zone;
+ zonelist->_zones[nr_zones++] = encode_zone_idx(zone);
check_highest_zone(zone_type);
}

@@ -2150,10 +2150,10 @@ static void build_zonelists_in_node_orde
struct zonelist *zonelist;

zonelist = &pgdat->node_zonelist;
- for (j = 0; zonelist->zones[j] != NULL; j++)
+ for (j = 0; zonelist->_zones[j] != NULL; j++)
;
j = build_zonelists_node(NODE_DATA(node), zonelist, j, MAX_NR_ZONES-1);
- zonelist->zones[j] = NULL;
+ zonelist->_zones[j] = NULL;
}

/*
@@ -2178,12 +2178,12 @@ static void build_zonelists_in_zone_orde
node = node_order[j];
z = &NODE_DATA(node)->node_zones[zone_type];
if (populated_zone(z)) {
- zonelist->zones[pos++] = z;
+ zonelist->_zones[pos++] = encode_zone_idx(z);
check_highest_zone(zone_type);
}
}
}
- zonelist->zones[pos] = NULL;
+ zonelist->_zones[pos] = NULL;
}

static int default_zonelist_order(void)
@@ -2257,7 +2257,7 @@ static void build_zonelists(pg_data_t *p

/* initialize zonelist */
zonelist = &pgdat->node_zonelist;
- zonelist->zones[0] = NULL;
+ zonelist->_zones[0] = NULL;

/* NUMA-aware ordering of nodes */
local_node = pgdat->node_id;
@@ -2304,7 +2304,6 @@ static void build_zonelists(pg_data_t *p
/* Construct the zonelist performance cache - see further mmzone.h */
static void build_zonelist_cache(pg_data_t *pgdat)
{
- int i;
struct zonelist *zonelist;
struct zonelist_cache *zlc;
struct zone **z;
@@ -2312,8 +2311,9 @@ static void build_zonelist_cache(pg_data
zonelist = &pgdat->node_zonelist;
zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
- for (z = zonelist->zones; *z; z++)
- zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
+ for (z = zonelist->_zones; *z; z++)
+ zlc->z_to_n[z - zonelist->_zones] =
+ zone_to_nid(zonelist_zone(*z));
}


@@ -2356,7 +2356,7 @@ static void build_zonelists(pg_data_t *p
MAX_NR_ZONES-1);
}

- zonelist->zones[j] = NULL;
+ zonelist->_zones[j] = NULL;
}

/* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-010_use_zonelist/mm/vmstat.c linux-2.6.23-rc1-mm2-015_zoneid_zonelist/mm/vmstat.c
--- linux-2.6.23-rc1-mm2-010_use_zonelist/mm/vmstat.c 2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-015_zoneid_zonelist/mm/vmstat.c 2007-08-09 15:52:12.000000000 +0100
@@ -365,11 +365,11 @@ void refresh_cpu_vm_stats(int cpu)
*/
void zone_statistics(struct zonelist *zonelist, struct zone *z)
{
- if (z->zone_pgdat == zonelist->zones[0]->zone_pgdat) {
+ if (z->zone_pgdat == zonelist_zone(zonelist->_zones[0])->zone_pgdat) {
__inc_zone_state(z, NUMA_HIT);
} else {
__inc_zone_state(z, NUMA_MISS);
- __inc_zone_state(zonelist->zones[0], NUMA_FOREIGN);
+ __inc_zone_state(zonelist_zone(zonelist->_zones[0]), NUMA_FOREIGN);
}
if (z->node == numa_node_id())
__inc_zone_state(z, NUMA_LOCAL);

2007-08-09 21:08:30

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 4/4] Apply MPOL_BIND policy to two highest zones when highest is ZONE_MOVABLE


The NUMA layer only supports the MPOL_BIND NUMA policy for the highest
zone. When ZONE_MOVABLE is configured with kernelcore=, the the highest
zone becomes ZONE_MOVABLE. The result is that the bind policy policies is
only applied to allocations like anonymous pages and page cache allocated
from ZONE_MOVABLE when the zone is used.

This patch applies policies to the two highest zones when the highest zone
is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real"
zone, these two zones are equivalent in policy terms.

Signed-off-by: Mel Gorman <[email protected]>
---

include/linux/mempolicy.h | 2 +-
mm/mempolicy.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-015_zoneid_zonelist/include/linux/mempolicy.h linux-2.6.23-rc1-mm2-020_treat_movable_highest/include/linux/mempolicy.h
--- linux-2.6.23-rc1-mm2-015_zoneid_zonelist/include/linux/mempolicy.h 2007-08-09 15:01:36.000000000 +0100
+++ linux-2.6.23-rc1-mm2-020_treat_movable_highest/include/linux/mempolicy.h 2007-08-09 15:52:23.000000000 +0100
@@ -157,7 +157,7 @@ extern enum zone_type policy_zone;

static inline void check_highest_zone(enum zone_type k)
{
- if (k > policy_zone)
+ if (k > policy_zone && k != ZONE_MOVABLE)
policy_zone = k;
}

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-015_zoneid_zonelist/mm/mempolicy.c linux-2.6.23-rc1-mm2-020_treat_movable_highest/mm/mempolicy.c
--- linux-2.6.23-rc1-mm2-015_zoneid_zonelist/mm/mempolicy.c 2007-08-09 18:33:51.000000000 +0100
+++ linux-2.6.23-rc1-mm2-020_treat_movable_highest/mm/mempolicy.c 2007-08-09 18:33:42.000000000 +0100
@@ -151,7 +151,7 @@ static struct zonelist *bind_zonelist(no
lower zones etc. Avoid empty zones because the memory allocator
doesn't like them. If you implement node hot removal you
have to fix that. */
- k = policy_zone;
+ k = MAX_NR_ZONES - 1;
while (1) {
for_each_node_mask(nd, *nodes) {
struct zone *z = &NODE_DATA(nd)->node_zones[k];

2007-08-09 21:19:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 0/4] Use one zonelist per node instead of multiple zonelists v3

On Thu, 9 Aug 2007, Mel Gorman wrote:

> o Encode zone_id in the zonelist pointers to avoid zone_idx() (Christoph's idea)

That is addressed by this patch it seems?

2007-08-09 21:28:41

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/4] Use one zonelist that is filtered instead of multiple zonelists

Some uses of the loops over online nodes that suppose memory is present on
these nodes. These will have to be updated at some point to only loop over
nodes with memory (memoryless nodes).

Acked-by: Christoph Lameter <[email protected]>

2007-08-09 21:37:48

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Thu, 9 Aug 2007, Mel Gorman wrote:

> }
>
> +#if defined(CONFIG_SMP) && INTERNODE_CACHE_SHIFT > ZONES_SHIFT

Is this necessary? ZONES_SHIFT is always <= 2 so it will work with
any pointer. Why disable this for UP?

> --- linux-2.6.23-rc1-mm2-010_use_zonelist/mm/vmstat.c 2007-08-07 14:45:11.000000000 +0100
> +++ linux-2.6.23-rc1-mm2-015_zoneid_zonelist/mm/vmstat.c 2007-08-09 15:52:12.000000000 +0100
> @@ -365,11 +365,11 @@ void refresh_cpu_vm_stats(int cpu)
> */
> void zone_statistics(struct zonelist *zonelist, struct zone *z)
> {
> - if (z->zone_pgdat == zonelist->zones[0]->zone_pgdat) {
> + if (z->zone_pgdat == zonelist_zone(zonelist->_zones[0])->zone_pgdat) {
> __inc_zone_state(z, NUMA_HIT);
> } else {
> __inc_zone_state(z, NUMA_MISS);
> - __inc_zone_state(zonelist->zones[0], NUMA_FOREIGN);
> + __inc_zone_state(zonelist_zone(zonelist->_zones[0]), NUMA_FOREIGN);
> }
> if (z->node == numa_node_id())
> __inc_zone_state(z, NUMA_LOCAL);

Hmmmm. I hope the compiler does subexpression optimization on

zonelist_zone(zonelist->_zones[0])

Acked-by: Christoph Lameter <[email protected]>

2007-08-09 23:33:17

by mel

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On (09/08/07 14:37), Christoph Lameter didst pronounce:
> On Thu, 9 Aug 2007, Mel Gorman wrote:
>
> > }
> >
> > +#if defined(CONFIG_SMP) && INTERNODE_CACHE_SHIFT > ZONES_SHIFT
>
> Is this necessary? ZONES_SHIFT is always <= 2 so it will work with
> any pointer. Why disable this for UP?
>

Caution in case the number of zones increases. There was no guarantee of
zone alignment. It's the same reason I have a BUG_ON in the encode
function so that if we don't catch problems at compile-time, it'll go
BANG in a nice predictable fashion.

> > --- linux-2.6.23-rc1-mm2-010_use_zonelist/mm/vmstat.c 2007-08-07 14:45:11.000000000 +0100
> > +++ linux-2.6.23-rc1-mm2-015_zoneid_zonelist/mm/vmstat.c 2007-08-09 15:52:12.000000000 +0100
> > @@ -365,11 +365,11 @@ void refresh_cpu_vm_stats(int cpu)
> > */
> > void zone_statistics(struct zonelist *zonelist, struct zone *z)
> > {
> > - if (z->zone_pgdat == zonelist->zones[0]->zone_pgdat) {
> > + if (z->zone_pgdat == zonelist_zone(zonelist->_zones[0])->zone_pgdat) {
> > __inc_zone_state(z, NUMA_HIT);
> > } else {
> > __inc_zone_state(z, NUMA_MISS);
> > - __inc_zone_state(zonelist->zones[0], NUMA_FOREIGN);
> > + __inc_zone_state(zonelist_zone(zonelist->_zones[0]), NUMA_FOREIGN);
> > }
> > if (z->node == numa_node_id())
> > __inc_zone_state(z, NUMA_LOCAL);
>
> Hmmmm. I hope the compiler does subexpression optimization on
>
> zonelist_zone(zonelist->_zones[0])
>

I'll check

> Acked-by: Christoph Lameter <[email protected]>
>

--
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-08-10 01:44:19

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer


On Fri, 10 Aug 2007, Mel Gorman wrote:

> > > +#if defined(CONFIG_SMP) && INTERNODE_CACHE_SHIFT > ZONES_SHIFT
> >
> > Is this necessary? ZONES_SHIFT is always <= 2 so it will work with
> > any pointer. Why disable this for UP?
> >
>
> Caution in case the number of zones increases. There was no guarantee of
> zone alignment. It's the same reason I have a BUG_ON in the encode
> function so that if we don't catch problems at compile-time, it'll go
> BANG in a nice predictable fashion.

Caution would lead to a BUG_ON but why the #if? Why exclude UP?

2007-08-10 10:48:00

by mel

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On (09/08/07 18:44), Christoph Lameter didst pronounce:
>
> On Fri, 10 Aug 2007, Mel Gorman wrote:
>
> > > > +#if defined(CONFIG_SMP) && INTERNODE_CACHE_SHIFT > ZONES_SHIFT
> > >
> > > Is this necessary? ZONES_SHIFT is always <= 2 so it will work with
> > > any pointer. Why disable this for UP?
> > >
> >
> > Caution in case the number of zones increases. There was no guarantee of
> > zone alignment. It's the same reason I have a BUG_ON in the encode
> > function so that if we don't catch problems at compile-time, it'll go
> > BANG in a nice predictable fashion.
>
> Caution would lead to a BUG_ON but why the #if? Why exclude UP?

On x86_64 would have ZONE_DMA, ZONE_DMA32, ZONE_NORMAL, ZONE_HIGHMEM and
ZONE_MOVABLE. On SMP, that's more than two bits worth and would fail t
runtime. Well, it should at least I didn't actually try it out.

However, I accept that the SMP check is less than than ideal. I considered
comparing it against MAX_NR_ZONES but as it's an enum, it can't be checked
at compile time. What else would make a better check?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-08-10 17:37:21

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Fri, 10 Aug 2007, Mel Gorman wrote:

> On (09/08/07 18:44), Christoph Lameter didst pronounce:
> >
> > On Fri, 10 Aug 2007, Mel Gorman wrote:
> >
> > > > > +#if defined(CONFIG_SMP) && INTERNODE_CACHE_SHIFT > ZONES_SHIFT
> > > >
> > > > Is this necessary? ZONES_SHIFT is always <= 2 so it will work with
> > > > any pointer. Why disable this for UP?
> > > >
> > >
> > > Caution in case the number of zones increases. There was no guarantee of
> > > zone alignment. It's the same reason I have a BUG_ON in the encode
> > > function so that if we don't catch problems at compile-time, it'll go
> > > BANG in a nice predictable fashion.
> >
> > Caution would lead to a BUG_ON but why the #if? Why exclude UP?
>
> On x86_64 would have ZONE_DMA, ZONE_DMA32, ZONE_NORMAL, ZONE_HIGHMEM and
> ZONE_MOVABLE. On SMP, that's more than two bits worth and would fail t
> runtime. Well, it should at least I didn't actually try it out.

x86_64 does not support ZONE_HIGHMEM. The number of zones is
depending on SMP?

> However, I accept that the SMP check is less than than ideal. I considered
> comparing it against MAX_NR_ZONES but as it's an enum, it can't be checked
> at compile time. What else would make a better check?

You could do a BUILD_BUG_ON() instead?

2007-08-10 18:15:48

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer


> x86_64 does not support ZONE_HIGHMEM.

I also plan to eliminate ZONE_DMA soon (and replace all its users
with a new allocator that sits outside the normal fallback lists)

-Andi

2007-08-10 19:02:30

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Fri, 10 Aug 2007, Andi Kleen wrote:

>
> > x86_64 does not support ZONE_HIGHMEM.
>
> I also plan to eliminate ZONE_DMA soon (and replace all its users
> with a new allocator that sits outside the normal fallback lists)

Hallelujah. You are my hero! x86_64 will switch off CONFIG_ZONE_DMA?


2007-08-11 01:06:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Friday 10 August 2007 21:02, Christoph Lameter wrote:
> On Fri, 10 Aug 2007, Andi Kleen wrote:
> > > x86_64 does not support ZONE_HIGHMEM.
> >
> > I also plan to eliminate ZONE_DMA soon (and replace all its users
> > with a new allocator that sits outside the normal fallback lists)
>
> Hallelujah. You are my hero! x86_64 will switch off CONFIG_ZONE_DMA?

Yes. i386 too actually.

The DMA zone will be still there, but only reachable with special functions.

This is fine because the default zone protection heuristics keep DMA
near always free from !GFP_DMA allocations anyways -- so it doesn't make much
difference if it's totally unreachable. swiotlb will also use the same pool.

Also all callers are going to pass masks around so it's always clear
what address range they really need. Actually a lot of them
pass still 16MB simply because it is hard to find out what masks
old undocumented hardware really needs. But this could change.

This also means the DMA support in sl[a-z]b is not needed anymore.

I went through near all GFP_DMA users and found they're usually
happy enough with pages. If someone comes up who really needs
lots of subobjects the right way for them would be likely extending
the pci pool allocator for this case. But I haven't found a need for this yet.

-Andi

2007-08-13 21:25:58

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Sat, 11 Aug 2007, Andi Kleen wrote:

> > Hallelujah. You are my hero! x86_64 will switch off CONFIG_ZONE_DMA?
>
> Yes. i386 too actually.
>
> The DMA zone will be still there, but only reachable with special functions.

Not too happy with that one but this is going the right direcrtion.

On NUMA this would still mean allocating space for the DMA zone on all
nodes although we only need this on node 0.

> Also all callers are going to pass masks around so it's always clear
> what address range they really need. Actually a lot of them
> pass still 16MB simply because it is hard to find out what masks
> old undocumented hardware really needs. But this could change.

Good.

> This also means the DMA support in sl[a-z]b is not needed anymore.

Tell me when. SLUB has an #ifdef CONFIG_ZONE_DMA. We can just drop that
code in the #ifdef's if you are ready.

> I went through near all GFP_DMA users and found they're usually
> happy enough with pages. If someone comes up who really needs
> lots of subobjects the right way for them would be likely extending
> the pci pool allocator for this case. But I haven't found a need for this yet.

Great.

2007-08-13 21:56:24

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Mon, Aug 13, 2007 at 02:25:36PM -0700, Christoph Lameter wrote:
> On Sat, 11 Aug 2007, Andi Kleen wrote:
>
> > > Hallelujah. You are my hero! x86_64 will switch off CONFIG_ZONE_DMA?
> >
> > Yes. i386 too actually.
> >
> > The DMA zone will be still there, but only reachable with special functions.
>
> Not too happy with that one but this is going the right direcrtion.
>
> On NUMA this would still mean allocating space for the DMA zone on all
> nodes although we only need this on node 0.

The DMA allocator is NUMA unaware. This means it doesn't require multiple
dma zones per node, but is happy with a global one that can live somewhere
else outside the pgdat. I also removed PCP and other fanciness so it's
really quite independent and much simpler than the normal one. It also
doesn't need try_to_free_pages() because in a isolated zone there
shouldn't be any freeable pages.

The big difference is that it can go into a slower than O(1) mode that tries
to find pages based on the DMA mask

> > This also means the DMA support in sl[a-z]b is not needed anymore.
>
> Tell me when. SLUB has an #ifdef CONFIG_ZONE_DMA. We can just drop that
> code in the #ifdef's if you are ready.

There are still other architectures that use it. Biggest offender
is s390. I'll leave them to their respective maintainers.

It's not clear s390 really needs the mask allocator anyways.
e.g. I suspect they just need to put data into a specific
range, but if there are no subranges then it might be
overkill.

-Andi

2007-08-13 22:00:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Tue, 14 Aug 2007, Andi Kleen wrote:

> > > The DMA zone will be still there, but only reachable with special functions.
> >
> > Not too happy with that one but this is going the right direcrtion.
> >
> > On NUMA this would still mean allocating space for the DMA zone on all
> > nodes although we only need this on node 0.
>
> The DMA allocator is NUMA unaware. This means it doesn't require multiple
> dma zones per node, but is happy with a global one that can live somewhere
> else outside the pgdat. I also removed PCP and other fanciness so it's
> really quite independent and much simpler than the normal one. It also
> doesn't need try_to_free_pages() because in a isolated zone there
> shouldn't be any freeable pages.

You said that ZONE_DMA will still be there right? So the zone will be
replicated over all nodes but remain unused except for node 0.

> There are still other architectures that use it. Biggest offender
> is s390. I'll leave them to their respective maintainers.

IA64 also uses ZONE_DMA to support 32bit controllers.

So I think we can only get rid of ZONE_DMA in its 16MB incarnation for
i386 and x86_64.

But you will be keeping ZONE_DMA32?

If so then it may be better to drop ZONE_DMA32 and make ZONE_DMA be below
4GB like other 64bit arches.

2007-08-13 22:04:48

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Mon, Aug 13, 2007 at 03:00:14PM -0700, Christoph Lameter wrote:
> You said that ZONE_DMA will still be there right? So the zone will be

There will be a (variable sized) dma zone, but not a ZONE_DMA entry in pgdat
or in the the fallback lists.

>
> > There are still other architectures that use it. Biggest offender
> > is s390. I'll leave them to their respective maintainers.
>
> IA64 also uses ZONE_DMA to support 32bit controllers.

ZONE_DMA32 I thought? That one is not changed.

>
> So I think we can only get rid of ZONE_DMA in its 16MB incarnation for
> i386 and x86_64.

Correct.

>
> But you will be keeping ZONE_DMA32?

Yes.

> If so then it may be better to drop ZONE_DMA32 and make ZONE_DMA be below
> 4GB like other 64bit arches.

That might be possible as a followup, but would change the driver
API. Is it worth it?

-Andi

2007-08-13 22:10:19

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Tue, 14 Aug 2007, Andi Kleen wrote:

> On Mon, Aug 13, 2007 at 03:00:14PM -0700, Christoph Lameter wrote:
> > You said that ZONE_DMA will still be there right? So the zone will be
>
> There will be a (variable sized) dma zone, but not a ZONE_DMA entry in pgdat
> or in the the fallback lists.

Ahh.. Okay.

> > > There are still other architectures that use it. Biggest offender
> > > is s390. I'll leave them to their respective maintainers.
> >
> > IA64 also uses ZONE_DMA to support 32bit controllers.
>
> ZONE_DMA32 I thought? That one is not changed.

x86_64 is the only platforms that uses ZONE_DMA32. Ia64 and other 64 bit
platforms use ZONE_DMA for <4GB allocs.

> > If so then it may be better to drop ZONE_DMA32 and make ZONE_DMA be below
> > 4GB like other 64bit arches.
>
> That might be possible as a followup, but would change the driver
> API. Is it worth it?

It would leave the driver API as is for many arches.

I think s/ZONE_DMA32/ZONE_DMA would restore the one DMA zone thing which
is good. We could drop all ZONE_DMA32 stuff that is only needed by a
single arch.


2007-08-13 22:14:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Mon, Aug 13, 2007 at 03:09:55PM -0700, Christoph Lameter wrote:
> > > > There are still other architectures that use it. Biggest offender
> > > > is s390. I'll leave them to their respective maintainers.
> > >
> > > IA64 also uses ZONE_DMA to support 32bit controllers.
> >
> > ZONE_DMA32 I thought? That one is not changed.
>
> x86_64 is the only platforms that uses ZONE_DMA32. Ia64 and other 64 bit
> platforms use ZONE_DMA for <4GB allocs.

Yes, but ZONE_DMA32 == ZONE_DMA.

Also when the slab users of GFP_DMA are all gone ia64 won't need
the slab support anymore. So either you change your ifdef in slub or
switch to ZONE_DMA32 for IA64.

The trouble is that this cannot be done globally, at least not
until s390 and a few other architures using GFP_DMA with slab
are all converted.

> I think s/ZONE_DMA32/ZONE_DMA would restore the one DMA zone thing which
> is good. We could drop all ZONE_DMA32 stuff that is only needed by a
> single arch.

But it's not quite the same: GFP_DMA32 has no explicit slab support.

-Andi

2007-08-13 22:22:37

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Tue, 14 Aug 2007, Andi Kleen wrote:

> > x86_64 is the only platforms that uses ZONE_DMA32. Ia64 and other 64 bit
> > platforms use ZONE_DMA for <4GB allocs.
>
> Yes, but ZONE_DMA32 == ZONE_DMA.

I am not sure what you mean by that. Ia64 ZONE_DMA == x86_84 ZONE_DMA32?

> Also when the slab users of GFP_DMA are all gone ia64 won't need
> the slab support anymore. So either you change your ifdef in slub or
> switch to ZONE_DMA32 for IA64.

If you have gotten rid of all slab users of GFP_DMA (and also all arch
uses of it) then we can drop the code in SLAB.

> The trouble is that this cannot be done globally, at least not
> until s390 and a few other architures using GFP_DMA with slab
> are all converted.

The s390 arch code still contains GFP_DMA uses? No drivers elsewhere
still use GFP_DMA?

>
> > I think s/ZONE_DMA32/ZONE_DMA would restore the one DMA zone thing which
> > is good. We could drop all ZONE_DMA32 stuff that is only needed by a
> > single arch.
>
> But it's not quite the same: GFP_DMA32 has no explicit slab support.

Right. So we could

1. Drop sl?b support for GFP_DMA.

2. Drop GFP_DMA32 support.

Then we only allow page allocator allocs using GFP_DMA? That may be the
least invasive for arch code.

2007-08-13 22:38:37

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

I just did a grep for GFP_DMA and I still see a large list of GFP_DMA
kmallocs???


arch/mips/au1000/common/dbdma.c: if ((desc_base = (u32)kmalloc(i, GFP_KERNEL|GFP_DMA)) == 0)
arch/m68k/kernel/dma.c: map = kmalloc(sizeof(struct page *) << order, flag & ~__GFP_DMA);
arch/s390/hypfs/hypfs_diag.c: diag224_cpu_names = kmalloc(PAGE_SIZE, GFP_KERNEL | GFP_DMA);
arch/s390/mm/extmem.c: struct qin64 *qin = kmalloc (sizeof(struct qin64), GFP_DMA);
arch/s390/mm/extmem.c: struct qout64 *qout = kmalloc (sizeof(struct qout64), GFP_DMA);
arch/s390/kernel/cpcmd.c: lowbuf = kmalloc(rlen, GFP_KERNEL | GFP_DMA);
drivers/block/ps3disk.c: dev->bounce_buf = kmalloc(BOUNCE_SIZE, GFP_DMA);
drivers/scsi/sr_vendor.c: buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/scsi/sr_vendor.c: buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/scsi/aha1542.c: SCpnt->host_scribble = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/scsi/ps3rom.c: dev->bounce_buf = kmalloc(BOUNCE_SIZE, GFP_DMA);
drivers/scsi/aacraid/commctrl.c: p = kmalloc(upsg->sg[i].count,GFP_KERNEL|__GFP_DMA);
drivers/scsi/aacraid/commctrl.c: p = kmalloc(usg->sg[i].count,GFP_KERNEL|__GFP_DMA);
drivers/scsi/aacraid/commctrl.c: p = kmalloc(usg->sg[i].count,GFP_KERNEL|__GFP_DMA);
drivers/scsi/pluto.c: fcs = kmalloc(sizeof (struct ctrl_inquiry) * fcscount, GFP_DMA);
drivers/scsi/sr_ioctl.c: buffer = kmalloc(32, GFP_KERNEL | SR_GFP_DMA(cd));
drivers/scsi/sr_ioctl.c: buffer = kmalloc(32, GFP_KERNEL | SR_GFP_DMA(cd));
drivers/scsi/sr_ioctl.c: char *buffer = kmalloc(32, GFP_KERNEL | SR_GFP_DMA(cd));
drivers/scsi/sr_ioctl.c: raw_sector = kmalloc(2048, GFP_KERNEL | SR_GFP_DMA(cd));
drivers/scsi/sr.c: buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/scsi/sr.c: buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/scsi/ch.c: buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/scsi/ch.c: buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/net/tokenring/3c359.c: xl_priv->xl_tx_ring = kmalloc((sizeof(struct xl_tx_desc) * XL_TX_RING_SIZE) + 7, GFP_DMA | GFP_KERNEL) ;
drivers/net/tokenring/3c359.c: xl_priv->xl_rx_ring = kmalloc((sizeof(struct xl_rx_desc) * XL_RX_RING_SIZE) +7, GFP_DMA | GFP_KERNEL) ;
drivers/net/znet.c: if (!(znet->rx_start = kmalloc (DMA_BUF_SIZE, GFP_KERNEL | GFP_DMA)))
drivers/net/znet.c: if (!(znet->tx_start = kmalloc (DMA_BUF_SIZE, GFP_KERNEL | GFP_DMA)))
drivers/net/lance.c: rx_buff = kmalloc(PKT_BUF_SZ, GFP_DMA | gfp);
drivers/net/wan/cosa.c: cosa->bouncebuf = kmalloc(COSA_MTU, GFP_KERNEL|GFP_DMA);
drivers/net/wan/cosa.c: if ((chan->rxdata = kmalloc(COSA_MTU, GFP_DMA|GFP_KERNEL)) == NULL) {
drivers/net/wan/cosa.c: if ((kbuf = kmalloc(count, GFP_KERNEL|GFP_DMA)) == NULL) {
drivers/net/irda/sa1100_ir.c: io->head = kmalloc(size, GFP_KERNEL | GFP_DMA);
drivers/net/irda/pxaficp_ir.c: io->head = kmalloc(size, GFP_KERNEL | GFP_DMA);
drivers/net/irda/vlsi_ir.c: rd->buf = kmalloc(len, GFP_KERNEL|GFP_DMA);
drivers/net/ni65.c: ret = ptr = kmalloc(T_BUF_SIZE,GFP_KERNEL | GFP_DMA);
drivers/usb/core/buffer.c:/* sometimes alloc/free could use kmalloc with GFP_DMA, for
drivers/media/video/arv.c: ar->line_buff = kmalloc(MAX_AR_LINE_BYTES, GFP_KERNEL | GFP_DMA);
drivers/media/dvb/dvb-usb/gp8psk.c: buf = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/atm/fore200e.c: data = kmalloc(tx_len, GFP_ATOMIC | GFP_DMA);
drivers/atm/iphase.c: cpcs = kmalloc(sizeof(*cpcs), GFP_KERNEL|GFP_DMA);
drivers/char/synclink.c: info->intermediate_rxbuffer = kmalloc(info->max_frame_size, GFP_KERNEL | GFP_DMA);
drivers/s390/net/qeth_main.c: kmalloc(QETH_BUFSIZE, GFP_DMA|GFP_KERNEL);
drivers/s390/net/smsgiucv.c: buffer = kmalloc(msg->length + 1, GFP_ATOMIC | GFP_DMA);
drivers/s390/char/tape_core.c: device->modeset_byte = kmalloc(1, GFP_KERNEL | GFP_DMA);
drivers/s390/char/vmur.c: kbuf = kmalloc(reclen, GFP_KERNEL | GFP_DMA);
drivers/s390/char/vmur.c: fcb = kmalloc(sizeof(*fcb), GFP_KERNEL | GFP_DMA);
drivers/s390/char/vmur.c: fcb = kmalloc(sizeof(*fcb), GFP_KERNEL | GFP_DMA);
drivers/s390/char/raw3270.c: rq->buffer = kmalloc(size, GFP_KERNEL | GFP_DMA);
drivers/s390/char/raw3270.c: rp = kmalloc(sizeof(struct raw3270), GFP_KERNEL | GFP_DMA);
drivers/s390/char/tape_3590.c: int_kekls = kmalloc(sizeof(*int_kekls), GFP_KERNEL|GFP_DMA);
drivers/s390/char/tape_3590.c: rdc_data = kmalloc(sizeof(*rdc_data), GFP_KERNEL | GFP_DMA);
drivers/s390/cio/device_ops.c: buf = kmalloc(32*sizeof(char), GFP_DMA|GFP_KERNEL);
drivers/s390/cio/device_ops.c: buf2 = kmalloc(32*sizeof(char), GFP_DMA|GFP_KERNEL);
drivers/s390/cio/css.c: sch = kmalloc (sizeof (*sch), GFP_KERNEL | GFP_DMA);

2007-08-13 22:48:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Mon, Aug 13, 2007 at 03:22:25PM -0700, Christoph Lameter wrote:
> On Tue, 14 Aug 2007, Andi Kleen wrote:
>
> > > x86_64 is the only platforms that uses ZONE_DMA32. Ia64 and other 64 bit
> > > platforms use ZONE_DMA for <4GB allocs.
> >
> > Yes, but ZONE_DMA32 == ZONE_DMA.
>
> I am not sure what you mean by that. Ia64 ZONE_DMA == x86_84 ZONE_DMA32?

Hmm, when I wrote GFP_DMA32 it was a #define GFP_DMA32 GFP_DMA
on ia64 so that drivers not need to ifdef. Someone nasty
seems to have removed that too. I guess it would be best
to readd.

>
> > Also when the slab users of GFP_DMA are all gone ia64 won't need
> > the slab support anymore. So either you change your ifdef in slub or
> > switch to ZONE_DMA32 for IA64.
>
> If you have gotten rid of all slab users of GFP_DMA (and also all arch
> uses of it) then we can drop the code in SLAB.

No, e.g. s390 and some other architectures still use it.
You'll need to bug their respective maintainers.


> 1. Drop sl?b support for GFP_DMA.

Not yet.

>
> 2. Drop GFP_DMA32 support.
>
> Then we only allow page allocator allocs using GFP_DMA? That may be the

Kind of yes.

> least invasive for arch code.

I would prefer for GFP_DMA to go away on x86 (but GFP_DMA32 stay). This way
we get clean compile errors instead of subtle breakage. Silently
changing the semantics would be bad.

But then it wouldn't make sense to have GFP_DMA on ia64 and GFP_DMA32
on x86. Since driver writers are more likely to test on x86
I would recommend ia64 having compatible semantics. It'll
save everybody trouble long term. This means it wouldn't
help on IA64 machines that don't have a DMA zone -- they
would still need to validate drivers especially -- but at least
the others.

Also from my driver review driver authors often seem to have
trouble understanding what GFP_DMA really does. With GFP_DMA32 it
is clearer that it applies to a address range and is not
some synonym for pci_map_*

-Andi

2007-08-13 22:49:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Mon, Aug 13, 2007 at 03:38:10PM -0700, Christoph Lameter wrote:
> I just did a grep for GFP_DMA and I still see a large list of GFP_DMA
> kmallocs???

I converted all of those that applied to x86.

-Andi

2007-08-13 22:53:18

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Tue, 14 Aug 2007, Andi Kleen wrote:

> > I am not sure what you mean by that. Ia64 ZONE_DMA == x86_84 ZONE_DMA32?
>
> Hmm, when I wrote GFP_DMA32 it was a #define GFP_DMA32 GFP_DMA
> on ia64 so that drivers not need to ifdef. Someone nasty
> seems to have removed that too. I guess it would be best
> to readd.

What would be the point?

> But then it wouldn't make sense to have GFP_DMA on ia64 and GFP_DMA32
> on x86. Since driver writers are more likely to test on x86
> I would recommend ia64 having compatible semantics. It'll
> save everybody trouble long term. This means it wouldn't
> help on IA64 machines that don't have a DMA zone -- they
> would still need to validate drivers especially -- but at least
> the others.

There are no compatible semantics. ZONE_DMA may mean something different
for each arch depending on its need. An arch may not have a need for a 4GB
boundary (such as s390).

> Also from my driver review driver authors often seem to have
> trouble understanding what GFP_DMA really does. With GFP_DMA32 it
> is clearer that it applies to a address range and is not
> some synonym for pci_map_*

GFP_DMA32 is clear because there are no other arches to muddy up the
waters.

2007-08-13 22:55:27

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Tue, 14 Aug 2007, Andi Kleen wrote:

> On Mon, Aug 13, 2007 at 03:38:10PM -0700, Christoph Lameter wrote:
> > I just did a grep for GFP_DMA and I still see a large list of GFP_DMA
> > kmallocs???
>
> I converted all of those that applied to x86.

Converted to what?

drivers/scsi/sr_ioctl.c: buffer = kmalloc(32, GFP_KERNEL | SR_GFP_DMA(cd));
drivers/scsi/sr_ioctl.c: buffer = kmalloc(32, GFP_KERNEL | SR_GFP_DMA(cd));
drivers/scsi/sr_ioctl.c: char *buffer = kmalloc(32, GFP_KERNEL | SR_GFP_DMA(cd));
drivers/scsi/sr_ioctl.c: raw_sector = kmalloc(2048, GFP_KERNEL | SR_GFP_DMA(cd));
drivers/scsi/sr.c: buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/scsi/sr.c: buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/scsi/ch.c: buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/scsi/ch.c: buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);

^^^ looks generic to mme.

drivers/net/tokenring/3c359.c: xl_priv->xl_tx_ring = kmalloc((sizeof(struct xl_tx_desc) * XL_TX_RING_SIZE) + 7, GFP_DMA | GFP_KERNEL) ;
drivers/net/tokenring/3c359.c: xl_priv->xl_rx_ring = kmalloc((sizeof(struct xl_rx_desc) * XL_RX_RING_SIZE) +7, GFP_DMA | GFP_KERNEL) ;

Tokenring not supported on x86?

drivers/net/znet.c: if (!(znet->rx_start = kmalloc (DMA_BUF_SIZE, GFP_KERNEL | GFP_DMA)))
drivers/net/znet.c: if (!(znet->tx_start = kmalloc (DMA_BUF_SIZE, GFP_KERNEL | GFP_DMA)))
drivers/net/lance.c: rx_buff = kmalloc(PKT_BUF_SZ, GFP_DMA | gfp);
drivers/net/wan/cosa.c: cosa->bouncebuf = kmalloc(COSA_MTU, GFP_KERNEL|GFP_DMA);
drivers/net/wan/cosa.c: if ((chan->rxdata = kmalloc(COSA_MTU, GFP_DMA|GFP_KERNEL)) == NULL) {
drivers/net/wan/cosa.c: if ((kbuf = kmalloc(count, GFP_KERNEL|GFP_DMA)) == NULL) {

None of the above are supported on x86?

2007-08-13 23:01:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Mon, Aug 13, 2007 at 03:52:54PM -0700, Christoph Lameter wrote:
> On Tue, 14 Aug 2007, Andi Kleen wrote:
>
> > > I am not sure what you mean by that. Ia64 ZONE_DMA == x86_84 ZONE_DMA32?
> >
> > Hmm, when I wrote GFP_DMA32 it was a #define GFP_DMA32 GFP_DMA
> > on ia64 so that drivers not need to ifdef. Someone nasty
> > seems to have removed that too. I guess it would be best
> > to readd.
>
> What would be the point?

"so that drivers not need to ifdef"

> > But then it wouldn't make sense to have GFP_DMA on ia64 and GFP_DMA32
> > on x86. Since driver writers are more likely to test on x86
> > I would recommend ia64 having compatible semantics. It'll
> > save everybody trouble long term. This means it wouldn't
> > help on IA64 machines that don't have a DMA zone -- they
> > would still need to validate drivers especially -- but at least
> > the others.
>
> There are no compatible semantics. ZONE_DMA may mean something different

Yes current GFP_DMA is a mess.

But GFP_DMA32 is relatively clean and it means that same
as GFP_DMA on IA64. So ia64 could relatively painless switch
to it.

Anyways, I must admit ia64 is not my main concern. If you or Tony
don't like GFP_DMA32 then keep using GFP_DMA if you want, but just don't
complain about driver porting problems. I'm just trying to make
your lives easier (that is why i did the #define originally),
not be annoying.

> for each arch depending on its need. An arch may not have a need for a 4GB
> boundary (such as s390).

I don't worry about s390 too much because the s390 driver universe
doesn't really overlap with the x86 driver universe. So whatever
semantics they have it's their own issue.

If they ever build s390s with PCI slots that can take arbitary card
they might have an issue, but I won't worry about this.

-Andi

2007-08-13 23:06:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Mon, Aug 13, 2007 at 03:54:34PM -0700, Christoph Lameter wrote:
> On Tue, 14 Aug 2007, Andi Kleen wrote:
>
> > On Mon, Aug 13, 2007 at 03:38:10PM -0700, Christoph Lameter wrote:
> > > I just did a grep for GFP_DMA and I still see a large list of GFP_DMA
> > > kmallocs???
> >
> > I converted all of those that applied to x86.
>
> Converted to what?

Hmm, do you actually read my emails? I spelled that out at least two
times now. It's converted to a new dma page allocator that specifies
an address mask.

>
> drivers/net/tokenring/3c359.c: xl_priv->xl_tx_ring = kmalloc((sizeof(struct xl_tx_desc) * XL_TX_RING_SIZE) + 7, GFP_DMA | GFP_KERNEL) ;
> drivers/net/tokenring/3c359.c: xl_priv->xl_rx_ring = kmalloc((sizeof(struct xl_rx_desc) * XL_RX_RING_SIZE) +7, GFP_DMA | GFP_KERNEL) ;
>
> Tokenring not supported on x86?

It can be easily converted to a page allocation.

The only tricky part were skbs in a few drivers, but luckily they are only
needed for bouncing which can be done without a skb too. For RX it adds
one copy, but we can live with that because they're only slow devices.

Right now it is a little inefficient because it's one page per packet
even though two would fit, but these devices have all tiny RX rings
so it's not that big a waste.

-Andi

2007-08-13 23:12:36

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Tue, 14 Aug 2007, Andi Kleen wrote:

> > What would be the point?
>
> "so that drivers not need to ifdef"

But they use GFP_DMA right now and drivers cannot use DMA32 if they want
to be cross platforms compatible? Doesnt the dma API completely do away
with these things?

2007-08-13 23:15:25

by Alan

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

> The only tricky part were skbs in a few drivers, but luckily they are only
> needed for bouncing which can be done without a skb too. For RX it adds
> one copy, but we can live with that because they're only slow devices.

Usually found on slow hardware that can't cope with extra copies very
well.

Alan

2007-08-13 23:16:31

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Tue, 14 Aug 2007, Andi Kleen wrote:

> > > I converted all of those that applied to x86.
> >
> > Converted to what?
>
> Hmm, do you actually read my emails? I spelled that out at least two
> times now. It's converted to a new dma page allocator that specifies
> an address mask.

Yes I do but frankly I am weirdly puzzled by what is going on.

> > drivers/net/tokenring/3c359.c: xl_priv->xl_tx_ring = kmalloc((sizeof(struct xl_tx_desc) * XL_TX_RING_SIZE) + 7, GFP_DMA | GFP_KERNEL) ;
> > drivers/net/tokenring/3c359.c: xl_priv->xl_rx_ring = kmalloc((sizeof(struct xl_rx_desc) * XL_RX_RING_SIZE) +7, GFP_DMA | GFP_KERNEL) ;
> >
> > Tokenring not supported on x86?
>
> It can be easily converted to a page allocation.

Ok then lets do it.

2007-08-13 23:20:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Tue, Aug 14, 2007 at 12:22:23AM +0100, Alan Cox wrote:
> > The only tricky part were skbs in a few drivers, but luckily they are only
> > needed for bouncing which can be done without a skb too. For RX it adds
> > one copy, but we can live with that because they're only slow devices.
>
> Usually found on slow hardware that can't cope with extra copies very
> well.

It's essentially only lance, meth, b44 and bcm43xx and lots of s390.

meth is only used on SGI O2s which are not that slow and unlikely
to work in tree anyways.

b44 and bcm43xx run in fast enough new systems to have no trouble
with copies.

s390 won't change.

That only leaves lance. If it runs in a system with <= 16MB
of memory is fine. I checked with David if he would consider
adding a second destructor to the skb for this case and he
said no. Which was an answer which was fine for m.e

So the only systems really affected are lance systems with >16MB.
I don't think we can stop Linux evolution for those sorry. They'll
just have to live with it.

-Andi

2007-08-13 23:22:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Mon, Aug 13, 2007 at 04:12:17PM -0700, Christoph Lameter wrote:
> On Tue, 14 Aug 2007, Andi Kleen wrote:
>
> > > What would be the point?
> >
> > "so that drivers not need to ifdef"
>
> But they use GFP_DMA right now and drivers cannot use DMA32 if they want

The way it was originally designed was that they use GFP_DMA32,
which would map to itself on x86-64, to GFP_DMA on ia64 and to
GFP_KERNEL on i386. Unfortunately that seems to have bitrotted
(perhaps I should have better documented it)

> to be cross platforms compatible? Doesnt the dma API completely do away
> with these things?

No GFP_DMA32 in my current plan is still there.

-Andi

2007-08-13 23:23:14

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

> Ok then lets do it.

Conversion only makes sense once an API with a explicit mask
is in. Otherwise we have muddled semantics again.

-Andi

2007-08-13 23:25:38

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Tue, 14 Aug 2007, Andi Kleen wrote:

> > But they use GFP_DMA right now and drivers cannot use DMA32 if they want
>
> The way it was originally designed was that they use GFP_DMA32,
> which would map to itself on x86-64, to GFP_DMA on ia64 and to
> GFP_KERNEL on i386. Unfortunately that seems to have bitrotted
> (perhaps I should have better documented it)

The DMA boundaries are hardware depending. A 4GB boundary
may not make sense on certain platforms.

> > to be cross platforms compatible? Doesnt the dma API completely do away
> > with these things?
>
> No GFP_DMA32 in my current plan is still there.

AFAIK GFP_DMA32 is a x86_64 special that would be easy to remove. Dealing
with physical boundaries is current done via the dma interface right? Lets
keep it there?

2007-08-13 23:28:10

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Tue, 14 Aug 2007, Andi Kleen wrote:

> > Ok then lets do it.
>
> Conversion only makes sense once an API with a explicit mask
> is in. Otherwise we have muddled semantics again.

pci_set_consistent_dma_mask

has that.

2007-08-13 23:31:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Mon, Aug 13, 2007 at 04:25:23PM -0700, Christoph Lameter wrote:
> On Tue, 14 Aug 2007, Andi Kleen wrote:
>
> > > But they use GFP_DMA right now and drivers cannot use DMA32 if they want
> >
> > The way it was originally designed was that they use GFP_DMA32,
> > which would map to itself on x86-64, to GFP_DMA on ia64 and to
> > GFP_KERNEL on i386. Unfortunately that seems to have bitrotted
> > (perhaps I should have better documented it)
>
> The DMA boundaries are hardware depending. A 4GB boundary
> may not make sense on certain platforms.

If it makes sense for the driver it has to make sense
for the platform too. If not it cannot run the driver
(which might be fine; a lot of architectures only
run their own specialized drivers)

>
> > > to be cross platforms compatible? Doesnt the dma API completely do away
> > > with these things?
> >
> > No GFP_DMA32 in my current plan is still there.
>
> AFAIK GFP_DMA32 is a x86_64 special that would be easy to remove. Dealing
> with physical boundaries is current done via the dma interface right? Lets
> keep it there?

The difference between the low dma allocation and GFP_DMA32 is that
the low dma allocation zone is isolated. This means it is not shared
with user pages, no LRU, no vmscan, no try_to_free_pages etc.

But that cannot be obviously done for a full 4GB, only for a small
area. Right now on swiotlb systems we reserved 82MB for this
(64MB swiotlb + 16MB ZONE_DMA). So the new zone is by default <100MB
and can afford to be isolated.

If GFP_DMA32 was merged into the mask allocator then it would
need to learn about all that mess by itself. If GFP_DMA32
stays it can just use the current page allocator which avoids
a lot of code duplication.

The DMA allocator calls the normal page allocator implicitely
though for a 4GB mask, but that still needs such a bit to specify.

-Andi

2007-08-13 23:32:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Mon, Aug 13, 2007 at 04:27:42PM -0700, Christoph Lameter wrote:
> On Tue, 14 Aug 2007, Andi Kleen wrote:
>
> > > Ok then lets do it.
> >
> > Conversion only makes sense once an API with a explicit mask
> > is in. Otherwise we have muddled semantics again.
>
> pci_set_consistent_dma_mask
>
> has that.

While on x86 it is roughly identical (although the low level
allocator is currently not very reliable) it makes a significant
difference on some platforms. e.g. I was told on PA-RISC
consistent memory is much more costly than non consistent ones.
That's probably true on anything that's not full IO cache
consistent.

So while it would be reasonable semantics for x86 and IA64
it's not for everybody else.

-Andi

>

2007-08-13 23:37:38

by Alan

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

> So the only systems really affected are lance systems with >16MB.
> I don't think we can stop Linux evolution for those sorry. They'll
> just have to live with it.

I think thats fair enough too.

2007-08-14 19:12:21

by Andy Isaacson

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Tue, Aug 14, 2007 at 02:14:41AM +0200, Andi Kleen wrote:
> On Tue, Aug 14, 2007 at 12:22:23AM +0100, Alan Cox wrote:
> > > The only tricky part were skbs in a few drivers, but luckily they are only
> > > needed for bouncing which can be done without a skb too. For RX it adds
> > > one copy, but we can live with that because they're only slow devices.
> >
> > Usually found on slow hardware that can't cope with extra copies very
> > well.
>
> It's essentially only lance, meth, b44 and bcm43xx and lots of s390.
>
> meth is only used on SGI O2s which are not that slow and unlikely
> to work in tree anyways.
>
> b44 and bcm43xx run in fast enough new systems to have no trouble
> with copies.

bcm43xx hardware does show up on low-end MIPS boxes (wrt54g anybody?)
that would be sorely hurt by excess copies.

-andy

2007-08-14 19:29:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

> bcm43xx hardware does show up on low-end MIPS boxes (wrt54g anybody?)
> that would be sorely hurt by excess copies.

Lowend boxes don't have more than 1GB of RAM. With <= 1GB you don't
need to copy on bcm43xx.

-Andi

2007-08-14 19:44:30

by Andy Isaacson

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Tue, Aug 14, 2007 at 10:23:51PM +0200, Andi Kleen wrote:
> > bcm43xx hardware does show up on low-end MIPS boxes (wrt54g anybody?)
> > that would be sorely hurt by excess copies.
>
> Lowend boxes don't have more than 1GB of RAM. With <= 1GB you don't
> need to copy on bcm43xx.

OK, that makes sense and is reassuring, but note that some MIPS boxes
have only part of their physical memory below 1GB; IIRC the
BCM4704/BCM5836 supports up to 512MB of physical memory, with 256MB in
the first GB and the second 256MB located above the 1GB line. (But it's
been a while since I've run such a machine, so I could be misremembering
the sizes and offsets.)

Yeah, if you stick a PCI chip with a 30-bit PCI DMA mask into a machine
with memory above 1GB, then copying has to happen. Unless the memory
allocator can avoid returning memory in the un-dma-able region...

-andy

2007-08-14 19:56:22

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Tue, 14 Aug 2007, Andi Kleen wrote:

> > pci_set_consistent_dma_mask
> >
> > has that.
>
> While on x86 it is roughly identical (although the low level
> allocator is currently not very reliable) it makes a significant
> difference on some platforms. e.g. I was told on PA-RISC
> consistent memory is much more costly than non consistent ones.
> That's probably true on anything that's not full IO cache
> consistent.
>
> So while it would be reasonable semantics for x86 and IA64
> it's not for everybody else.

Right. That is the point of the function. It isolates these strange
platform dependencies. That is why there is no need for ZONE_DMA32 on any
other platform.

2007-08-14 20:12:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

> Yeah, if you stick a PCI chip with a 30-bit PCI DMA mask into a machine
> with memory above 1GB, then copying has to happen. Unless the memory
> allocator can avoid returning memory in the un-dma-able region...

With GFP_DMA this was possible, but the capability will be gone
for x86 systems if your required DMA region is < 4GB.

Besides I'm currently not planning to change mips so it might
stay the same if Ralf prefers that.

-Andi

2007-08-15 11:38:24

by Ralf Baechle

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Tue, Aug 14, 2007 at 02:14:41AM +0200, Andi Kleen wrote:

> meth is only used on SGI O2s which are not that slow and unlikely
> to work in tree anyways.

O2 doesn't enable CONFIG_ZONE_DMA so there is no point in using GFP_DMA in
an O2-specific device driver. Will send out patch in separate mail.

Ralf

2007-08-15 12:05:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Wed, Aug 15, 2007 at 12:37:49PM +0100, Ralf Baechle wrote:
> On Tue, Aug 14, 2007 at 02:14:41AM +0200, Andi Kleen wrote:
>
> > meth is only used on SGI O2s which are not that slow and unlikely
> > to work in tree anyways.
>
> O2 doesn't enable CONFIG_ZONE_DMA so there is no point in using GFP_DMA in
> an O2-specific device driver. Will send out patch in separate mail.

Great.

BTW I suspect this is true for some other GFP_DMA usages too.
Some driver writers seem to confuse it with the PCI DMA
APIs or believe it is always needed for them.

-Andi

2007-08-15 12:33:17

by Ralf Baechle

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Wed, Aug 15, 2007 at 02:59:20PM +0200, Andi Kleen wrote:

> On Wed, Aug 15, 2007 at 12:37:49PM +0100, Ralf Baechle wrote:
> > On Tue, Aug 14, 2007 at 02:14:41AM +0200, Andi Kleen wrote:
> >
> > > meth is only used on SGI O2s which are not that slow and unlikely
> > > to work in tree anyways.
> >
> > O2 doesn't enable CONFIG_ZONE_DMA so there is no point in using GFP_DMA in
> > an O2-specific device driver. Will send out patch in separate mail.
>
> Great.
>
> BTW I suspect this is true for some other GFP_DMA usages too.
> Some driver writers seem to confuse it with the PCI DMA
> APIs or believe it is always needed for them.

In case of the O2 the full story is a little more complicated. The O2 is
not I/O coherent. That's a well know and handled configuration for the
R5000. But the R10000 which was never designed for such use will do very
stupid things such as actually performing speculative stores and scribbling
into DMA buffers left and right. There are two possible workarounds for
this; one is software only the other is using a little CPLD codenamed Juice
on the O2 R10000 board and would require a special memory region for DMA
buffers, so the ZONE_DMA was really there as some sort of future proofing.
But in the end the same arguments as for the Lance apply here.

Ralf

2007-08-15 20:00:48

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/4] Embed zone_id information within the zonelist->zones pointer

On Wed, 15 Aug 2007, Andi Kleen wrote:

> BTW I suspect this is true for some other GFP_DMA usages too.
> Some driver writers seem to confuse it with the PCI DMA
> APIs or believe it is always needed for them.

See especially ARM.