2013-07-03 08:35:26

by Joonsoo Kim

[permalink] [raw]
Subject: [RFC PATCH 0/5] Support multiple pages allocation

Hello.

This patchset introduces multiple pages allocation feature to buddy
allocator. Currently, there is no ability to allocate multiple pages
at once, so we should invoke single page allocation logic repeatedly.
This has some overheads like as overhead of function call with many
arguments and overhead for finding proper node and zone.

With this patchset, we can reduce these overheads.
Here goes some experimental result of allocation test.
I did the test on below setup.
CPU: 4 cpus, 3.00GHz.
RAM: 4 GB
Kernel: v3.10 vanilla

Each case of result is an average of 20 runs.

Time(us) : Improvement Percentage

Before Patched 1 page Patched 2 page Patched 4 page
--------------------------------------------------------------------------------------
128KB 5.3 0 4.45 16.04% 3.25 38.68% 3.75 29.25%
256KB 13.15 0 10.15 22.81% 8.8 33.08% 8.5 35.36%
512KB 72.3 0 34.65 52.07% 82.65 -14.32% 25 65.42%
1024KB 114.9 0 112.95 1.70% 87.55 23.80% 64.7 43.69%
2MB 131.65 0 102.35 22.26% 91.95 30.16% 126.05 4.25%
4MB 225.55 0 213.2 5.48% 181.95 19.33% 200.8 10.97%
8MB 408.6 0 442.85 -8.38% 350.4 14.24% 365.15 10.63%
16MB 730.55 0 683.35 6.46% 735.5 -0.68% 698.3 4.41%
32MB 1682.6 0 1665.85 1.00% 1445.1 14.12% 1157.05 31.23%
64MB 3229.4 0 3463.2 -7.24% 2538.4 21.40% 1850.55 42.70%
128MB 5465.6 0 4816.2 11.88% 4448.3 18.61% 3528.25 35.45%
256MB 9526.9 0 10091.75 -5.93% 8514.5 10.63% 7978.2 16.26%
512MB 19029.05 0 20079.7 -5.52% 17059.05 10.35% 14713.65 22.68%
1024MB 37284.9 0 39453.75 -5.82% 32969.7 11.57% 28161.65 24.47%



Before Patched 8 page Patched 16 page Patched 32 page
---------------------------------------------------------------------------------------
128KB 5.3 0 3.05 42.45% 2.65 50.00% 2.85 46.23%
256KB 13.15 0 8.2 37.64% 7.45 43.35% 7.95 39.54%
512KB 72.3 0 16.8 76.76% 17.7 75.52% 14.55 79.88%
1024KB 114.9 0 60.05 47.74% 93.65 18.49% 74.2 35.42%
2MB 131.65 0 119.8 9.00% 72.6 44.85% 84.7 35.66%
4MB 225.55 0 227.3 -0.78% 149.95 33.52% 153.6 31.90%
8MB 408.6 0 372.5 8.84% 304.95 25.37% 340.55 16.65%
16MB 730.55 0 772.2 -5.70% 567.4 22.33% 618.3 15.37%
32MB 1682.6 0 1217.7 27.63% 1098.25 34.73% 1168.7 30.54%
64MB 3229.4 0 2237.75 30.71% 1817.8 43.71% 1998.25 38.12%
128MB 5465.6 0 3504.25 35.89% 3466.75 36.57% 3159.35 42.20%
256MB 9526.9 0 7071.2 25.78% 7095.05 25.53% 6800.9 28.61%
512MB 19029.05 0 13640.85 28.32% 13098.2 31.17% 12778.1 32.85%
1024MB 37284.9 0 25897.15 30.54% 24875.6 33.28% 24179.3 35.15%



For one page allocation at once, this patchset makes allocator slower than
before (-5%). But, for more page allocation at once, this patchset makes
allocator faster than before greately.

At first, we can apply this feature to page cache readahead logic which
allocate single page repeatedly. I attach sample implementation to this
patchset(Patch 2-5).

Current implementation is not yet complete. Before polishing this feature,
I want to hear expert's opinion. I don't have any trouble with
current allocator, however, I think that we need this feature soon,
because device I/O is getting faster rapidly and allocator should
catch up this speed.

Thanks.

Joonsoo Kim (5):
mm, page_alloc: support multiple pages allocation
mm, page_alloc: introduce alloc_pages_exact_node_multiple()
radix-tree: introduce radix_tree_[next/prev]_present()
readahead: remove end range check
readhead: support multiple pages allocation for readahead

include/linux/gfp.h | 16 ++++++++++--
include/linux/pagemap.h | 19 +++++++++-----
include/linux/radix-tree.h | 4 +++
lib/radix-tree.c | 34 ++++++++++++++++++++++++
mm/filemap.c | 18 ++++++++-----
mm/mempolicy.c | 6 +++--
mm/page_alloc.c | 62 +++++++++++++++++++++++++++++++++++---------
mm/readahead.c | 46 ++++++++++++++++++++++----------
8 files changed, 162 insertions(+), 43 deletions(-)

--
1.7.9.5


2013-07-03 08:35:30

by Joonsoo Kim

[permalink] [raw]
Subject: [RFC PATCH 3/5] radix-tree: introduce radix_tree_[next/prev]_present()

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..045b325 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -230,6 +230,10 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan);
unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan);
+unsigned long radix_tree_next_present(struct radix_tree_root *root,
+ unsigned long index, unsigned long max_scan);
+unsigned long radix_tree_prev_present(struct radix_tree_root *root,
+ unsigned long index, unsigned long max_scan);
int radix_tree_preload(gfp_t gfp_mask);
void radix_tree_init(void);
void *radix_tree_tag_set(struct radix_tree_root *root,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..e02e3b8 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -984,6 +984,40 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
}
EXPORT_SYMBOL(radix_tree_prev_hole);

+unsigned long radix_tree_next_present(struct radix_tree_root *root,
+ unsigned long index, unsigned long max_scan)
+{
+ unsigned long i;
+
+ for (i = 0; i < max_scan; i++) {
+ if (radix_tree_lookup(root, index))
+ break;
+ index++;
+ if (index == 0)
+ break;
+ }
+
+ return index;
+}
+EXPORT_SYMBOL(radix_tree_next_present);
+
+unsigned long radix_tree_prev_present(struct radix_tree_root *root,
+ unsigned long index, unsigned long max_scan)
+{
+ unsigned long i;
+
+ for (i = 0; i < max_scan; i++) {
+ if (radix_tree_lookup(root, index))
+ break;
+ index--;
+ if (index == ULONG_MAX)
+ break;
+ }
+
+ return index;
+}
+EXPORT_SYMBOL(radix_tree_prev_present);
+
/**
* radix_tree_gang_lookup - perform multiple lookup on a radix tree
* @root: radix tree root
--
1.7.9.5

2013-07-03 08:35:54

by Joonsoo Kim

[permalink] [raw]
Subject: [RFC PATCH 4/5] readahead: remove end range check

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/readahead.c b/mm/readahead.c
index daed28d..3932f28 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -166,6 +166,8 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
goto out;

end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+ if (offset + nr_to_read > end_index + 1)
+ nr_to_read = end_index - offset + 1;

/*
* Preallocate as many pages as we will need.
@@ -173,9 +175,6 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
pgoff_t page_offset = offset + page_idx;

- if (page_offset > end_index)
- break;
-
rcu_read_lock();
page = radix_tree_lookup(&mapping->page_tree, page_offset);
rcu_read_unlock();
--
1.7.9.5

2013-07-03 08:35:53

by Joonsoo Kim

[permalink] [raw]
Subject: [RFC PATCH 5/5] readhead: support multiple pages allocation for readahead

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..eb1472c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -217,28 +217,33 @@ static inline void page_unfreeze_refs(struct page *page, int count)
}

#ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t gfp,
+ unsigned long *nr_pages, struct page **pages);
#else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp,
+ unsigned long *nr_pages, struct page **pages)
{
- return alloc_pages(gfp, 0);
+ return alloc_pages_exact_node_multiple(numa_node_id(),
+ gfp, nr_pages, pages);
}
#endif

static inline struct page *page_cache_alloc(struct address_space *x)
{
- return __page_cache_alloc(mapping_gfp_mask(x));
+ return __page_cache_alloc(mapping_gfp_mask(x), NULL, NULL);
}

static inline struct page *page_cache_alloc_cold(struct address_space *x)
{
- return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+ return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD, NULL, NULL);
}

-static inline struct page *page_cache_alloc_readahead(struct address_space *x)
+static inline struct page *page_cache_alloc_readahead(struct address_space *x,
+ unsigned long *nr_pages, struct page **pages)
{
return __page_cache_alloc(mapping_gfp_mask(x) |
- __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN);
+ __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN,
+ nr_pages, pages);
}

typedef int filler_t(void *, struct page *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 7905fe7..0bbfda9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -510,7 +510,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
EXPORT_SYMBOL_GPL(add_to_page_cache_lru);

#ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp,
+ unsigned long *nr_pages, struct page **pages)
{
int n;
struct page *page;
@@ -520,12 +521,14 @@ struct page *__page_cache_alloc(gfp_t gfp)
do {
cpuset_mems_cookie = get_mems_allowed();
n = cpuset_mem_spread_node();
- page = alloc_pages_exact_node(n, gfp, 0);
+ page = alloc_pages_exact_node_multiple(n,
+ gfp, nr_pages, pages);
} while (!put_mems_allowed(cpuset_mems_cookie) && !page);

return page;
}
- return alloc_pages(gfp, 0);
+ return alloc_pages_exact_node_multiple(numa_node_id(), gfp,
+ nr_pages, pages);
}
EXPORT_SYMBOL(__page_cache_alloc);
#endif
@@ -789,7 +792,7 @@ struct page *find_or_create_page(struct address_space *mapping,
repeat:
page = find_lock_page(mapping, index);
if (!page) {
- page = __page_cache_alloc(gfp_mask);
+ page = __page_cache_alloc(gfp_mask, NULL, NULL);
if (!page)
return NULL;
/*
@@ -1053,7 +1056,8 @@ grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
page_cache_release(page);
return NULL;
}
- page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
+ page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
+ NULL, NULL);
if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
page_cache_release(page);
page = NULL;
@@ -1806,7 +1810,7 @@ static struct page *__read_cache_page(struct address_space *mapping,
repeat:
page = find_get_page(mapping, index);
if (!page) {
- page = __page_cache_alloc(gfp | __GFP_COLD);
+ page = __page_cache_alloc(gfp | __GFP_COLD, NULL, NULL);
if (!page)
return ERR_PTR(-ENOMEM);
err = add_to_page_cache_lru(page, mapping, index, gfp);
@@ -2281,7 +2285,7 @@ repeat:
if (page)
goto found;

- page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+ page = __page_cache_alloc(gfp_mask & ~gfp_notmask, NULL, NULL);
if (!page)
return NULL;
status = add_to_page_cache_lru(page, mapping, index,
diff --git a/mm/readahead.c b/mm/readahead.c
index 3932f28..3e2c377 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -157,40 +157,61 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
struct inode *inode = mapping->host;
struct page *page;
unsigned long end_index; /* The last page we want to read */
+ unsigned long end, i;
LIST_HEAD(page_pool);
int page_idx;
int ret = 0;
loff_t isize = i_size_read(inode);
+ struct page **pages;

if (isize == 0)
goto out;

end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+ if (offset > end_index)
+ goto out;
+
if (offset + nr_to_read > end_index + 1)
nr_to_read = end_index - offset + 1;

+ pages = kmalloc(sizeof(struct page *) * nr_to_read, GFP_KERNEL);
+ if (!pages)
+ goto out;
+
/*
* Preallocate as many pages as we will need.
*/
for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
pgoff_t page_offset = offset + page_idx;
+ unsigned long nr_pages;

rcu_read_lock();
- page = radix_tree_lookup(&mapping->page_tree, page_offset);
+ end = radix_tree_next_present(&mapping->page_tree,
+ page_offset, nr_to_read - page_idx);
rcu_read_unlock();
- if (page)
+ nr_pages = end - page_offset;
+ if (!nr_pages)
continue;

- page = page_cache_alloc_readahead(mapping);
- if (!page)
- break;
- page->index = page_offset;
- list_add(&page->lru, &page_pool);
- if (page_idx == nr_to_read - lookahead_size)
- SetPageReadahead(page);
- ret++;
+ page_cache_alloc_readahead(mapping, &nr_pages, pages);
+ if (!nr_pages)
+ goto start_io;
+
+ for (i = 0; i < nr_pages; i++) {
+ page = pages[i];
+ page->index = page_offset + i;
+ list_add(&page->lru, &page_pool);
+ if (page_idx == nr_to_read - lookahead_size)
+ SetPageReadahead(page);
+ ret++;
+ }
+
+ /* Skip already checked page */
+ page_idx += nr_pages;
}

+start_io:
+ kfree(pages);
/*
* Now start the IO. We ignore I/O errors - if the page is not
* uptodate then the caller will launch readpage again, and
--
1.7.9.5

2013-07-03 08:36:39

by Joonsoo Kim

[permalink] [raw]
Subject: [RFC PATCH 1/5] mm, page_alloc: support multiple pages allocation

This patch introduces multiple pages allocation feature to buddy
allocator. Currently, there is no ability to allocate multiple
pages at once, so we should invoke single page allocation logic
repeatedly. This has some overheads like as function call
overhead with many arguments and overhead for finding proper
node and zone.

With this patchset, we can reduce these overheads. Device I/O is
getting faster rapidly and allocator should catch up this speed.
This patch help this situation.

In this patch, I introduce new arguments, nr_pages and pages, to
core function of allocator and try to allocate multiple pages
in first attempt(fast path). I think that multiple page allocation
is not valid for slow path, so current implementation consider
just fast path.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0f615eb..8bfa87b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -298,13 +298,15 @@ static inline void arch_alloc_page(struct page *page, int order) { }

struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist, nodemask_t *nodemask);
+ struct zonelist *zonelist, nodemask_t *nodemask,
+ unsigned long *nr_pages, struct page **pages);

static inline struct page *
__alloc_pages(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist)
{
- return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
+ return __alloc_pages_nodemask(gfp_mask, order,
+ zonelist, NULL, NULL, NULL);
}

static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7431001..b17e48c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2004,7 +2004,8 @@ retry_cpuset:
}
page = __alloc_pages_nodemask(gfp, order,
policy_zonelist(gfp, pol, node),
- policy_nodemask(gfp, pol));
+ policy_nodemask(gfp, pol),
+ NULL, NULL);
if (unlikely(mpol_needs_cond_ref(pol)))
__mpol_put(pol);
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
@@ -2052,7 +2053,8 @@ retry_cpuset:
else
page = __alloc_pages_nodemask(gfp, order,
policy_zonelist(gfp, pol, numa_node_id()),
- policy_nodemask(gfp, pol));
+ policy_nodemask(gfp, pol),
+ NULL, NULL);

if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3edb62..1fbbc9a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1846,7 +1846,8 @@ static inline void init_zone_allows_reclaim(int nid)
static struct page *
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
- struct zone *preferred_zone, int migratetype)
+ struct zone *preferred_zone, int migratetype,
+ unsigned long *nr_pages, struct page **pages)
{
struct zoneref *z;
struct page *page = NULL;
@@ -1855,6 +1856,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */
+ unsigned long count = 0;
+ unsigned long mark;

classzone_idx = zone_idx(preferred_zone);
zonelist_scan:
@@ -1902,7 +1905,6 @@ zonelist_scan:

BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
- unsigned long mark;
int ret;

mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
@@ -1966,10 +1968,30 @@ zonelist_scan:
}

try_this_zone:
- page = buffered_rmqueue(preferred_zone, zone, order,
- gfp_mask, migratetype);
- if (page)
+ do {
+ page = buffered_rmqueue(preferred_zone, zone, order,
+ gfp_mask, migratetype);
+ if (!page)
+ break;
+
+ if (!nr_pages) {
+ count++;
+ break;
+ }
+
+ pages[count++] = page;
+ if (count >= *nr_pages)
+ break;
+
+ mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+ if (!zone_watermark_ok(zone, order, mark,
+ classzone_idx, alloc_flags))
+ break;
+ } while (1);
+
+ if (count > 0)
break;
+
this_zone_full:
if (IS_ENABLED(CONFIG_NUMA))
zlc_mark_zone_full(zonelist, z);
@@ -1981,6 +2003,12 @@ this_zone_full:
goto zonelist_scan;
}

+ if (nr_pages) {
+ *nr_pages = count;
+ if (count > 0)
+ page = pages[0];
+ }
+
if (page)
/*
* page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
@@ -2125,7 +2153,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
order, zonelist, high_zoneidx,
ALLOC_WMARK_HIGH|ALLOC_CPUSET,
- preferred_zone, migratetype);
+ preferred_zone, migratetype,
+ NULL, NULL);
if (page)
goto out;

@@ -2188,7 +2217,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
page = get_page_from_freelist(gfp_mask, nodemask,
order, zonelist, high_zoneidx,
alloc_flags & ~ALLOC_NO_WATERMARKS,
- preferred_zone, migratetype);
+ preferred_zone, migratetype,
+ NULL, NULL);
if (page) {
preferred_zone->compact_blockskip_flush = false;
preferred_zone->compact_considered = 0;
@@ -2282,7 +2312,8 @@ retry:
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx,
alloc_flags & ~ALLOC_NO_WATERMARKS,
- preferred_zone, migratetype);
+ preferred_zone, migratetype,
+ NULL, NULL);

/*
* If an allocation failed after direct reclaim, it could be because
@@ -2312,7 +2343,8 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
do {
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
- preferred_zone, migratetype);
+ preferred_zone, migratetype,
+ NULL, NULL);

if (!page && gfp_mask & __GFP_NOFAIL)
wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
@@ -2449,7 +2481,8 @@ rebalance:
/* This is the last chance, in general, before the goto nopage. */
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
- preferred_zone, migratetype);
+ preferred_zone, migratetype,
+ NULL, NULL);
if (page)
goto got_pg;

@@ -2598,7 +2631,8 @@ got_pg:
*/
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist, nodemask_t *nodemask)
+ struct zonelist *zonelist, nodemask_t *nodemask,
+ unsigned long *nr_pages, struct page **pages)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
struct zone *preferred_zone;
@@ -2608,6 +2642,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET;
struct mem_cgroup *memcg = NULL;

+ VM_BUG_ON(nr_pages && !pages);
+
gfp_mask &= gfp_allowed_mask;

lockdep_trace_alloc(gfp_mask);
@@ -2647,9 +2683,11 @@ retry_cpuset:
alloc_flags |= ALLOC_CMA;
#endif
/* First allocation attempt */
+ /* We only try to allocate nr_pages in first attempt */
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, alloc_flags,
- preferred_zone, migratetype);
+ preferred_zone, migratetype,
+ nr_pages, pages);
if (unlikely(!page)) {
/*
* Runtime PM, block IO and its error handling path
--
1.7.9.5

2013-07-03 08:36:38

by Joonsoo Kim

[permalink] [raw]
Subject: [RFC PATCH 2/5] mm, page_alloc: introduce alloc_pages_exact_node_multiple()

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 8bfa87b..f8cde28 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -327,6 +327,16 @@ static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}

+static inline struct page *alloc_pages_exact_node_multiple(int nid,
+ gfp_t gfp_mask, unsigned long *nr_pages, struct page **pages)
+{
+ VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
+
+ return __alloc_pages_nodemask(gfp_mask, 0,
+ node_zonelist(nid, gfp_mask), NULL,
+ nr_pages, pages);
+}
+
#ifdef CONFIG_NUMA
extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);

--
1.7.9.5

2013-07-03 15:28:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Support multiple pages allocation

On Wed 03-07-13 17:34:15, Joonsoo Kim wrote:
[...]
> For one page allocation at once, this patchset makes allocator slower than
> before (-5%).

Slowing down the most used path is a no-go. Where does this slow down
come from?

[...]
--
Michal Hocko
SUSE Labs

2013-07-03 15:52:04

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Support multiple pages allocation

On 07/03/2013 11:28 PM, Michal Hocko wrote:
> On Wed 03-07-13 17:34:15, Joonsoo Kim wrote:
> [...]
>> For one page allocation at once, this patchset makes allocator slower than
>> before (-5%).
>
> Slowing down the most used path is a no-go. Where does this slow down
> come from?

I guess, it might be: for one page allocation at once, comparing to the original
code, this patch adds two parameters nr_pages and pages and will do extra checks
for the parameter nr_pages in the allocation path.

>
> [...]


--
Thanks.
Zhang Yanfei

Subject: Re: [RFC PATCH 1/5] mm, page_alloc: support multiple pages allocation

On Wed, 3 Jul 2013, Joonsoo Kim wrote:

> @@ -298,13 +298,15 @@ static inline void arch_alloc_page(struct page *page, int order) { }
>
> struct page *
> __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> - struct zonelist *zonelist, nodemask_t *nodemask);
> + struct zonelist *zonelist, nodemask_t *nodemask,
> + unsigned long *nr_pages, struct page **pages);
>

Add a separate function for the allocation of multiple pages instead?

2013-07-03 16:01:58

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Support multiple pages allocation

On 07/03/2013 11:51 PM, Zhang Yanfei wrote:
> On 07/03/2013 11:28 PM, Michal Hocko wrote:
>> On Wed 03-07-13 17:34:15, Joonsoo Kim wrote:
>> [...]
>>> For one page allocation at once, this patchset makes allocator slower than
>>> before (-5%).
>>
>> Slowing down the most used path is a no-go. Where does this slow down
>> come from?
>
> I guess, it might be: for one page allocation at once, comparing to the original
> code, this patch adds two parameters nr_pages and pages and will do extra checks
> for the parameter nr_pages in the allocation path.
>

If so, adding a separate path for the multiple allocations seems better.

>>
>> [...]
>
>


--
Thanks.
Zhang Yanfei

2013-07-04 04:24:54

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Support multiple pages allocation

On Thu, Jul 04, 2013 at 12:01:43AM +0800, Zhang Yanfei wrote:
> On 07/03/2013 11:51 PM, Zhang Yanfei wrote:
> > On 07/03/2013 11:28 PM, Michal Hocko wrote:
> >> On Wed 03-07-13 17:34:15, Joonsoo Kim wrote:
> >> [...]
> >>> For one page allocation at once, this patchset makes allocator slower than
> >>> before (-5%).
> >>
> >> Slowing down the most used path is a no-go. Where does this slow down
> >> come from?
> >
> > I guess, it might be: for one page allocation at once, comparing to the original
> > code, this patch adds two parameters nr_pages and pages and will do extra checks
> > for the parameter nr_pages in the allocation path.
> >
>
> If so, adding a separate path for the multiple allocations seems better.

Hello, all.

I modify the code for optimizing one page allocation via likely macro.
I attach a new one at the end of this mail.

In this case, performance degradation for one page allocation at once is -2.5%.
I guess, remained overhead comes from two added parameters.
Is it unreasonable cost to support this new feature?
I think that readahead path is one of the most used path, so this penalty looks
endurable. And after supporting this feature, we can find more use cases.

I will try to add a new function for the multiple allocations and test it. But,
IMHO, adding a new function is not good idea, because we should duplicate
various checks which are already in __alloc_pages_nodemask and even if
we introduce a new function, we cannot avoid to pass two parameters
to get_page_from_freelist(), so slight performance degradation on
one page allocation is inevitable. Anyway, I will do and test it.

Thanks.

-------------------------------8<----------------------------
>From cee05ad3bcf1c5774fabf797b5dc8f78f812ca36 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <[email protected]>
Date: Wed, 26 Jun 2013 13:37:57 +0900
Subject: [PATCH] mm, page_alloc: support multiple pages allocation

This patch introduces multiple pages allocation feature to buddy
allocator. Currently, there is no ability to allocate multiple
pages at once, so we should invoke single page allocation logic
repeatedly. This has some overheads like as function call
overhead with many arguments and overhead for finding proper
node and zone.

With this patchset, we can reduce these overheads. Device I/O is
getting faster rapidly and allocator should catch up this speed.
This patch help this situation.

In this patch, I introduce new arguments, nr_pages and pages, to
core function of allocator and try to allocate multiple pages
in first attempt(fast path). I think that multiple page allocation
is not valid for slow path, so current implementation consider
just fast path.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0f615eb..8bfa87b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -298,13 +298,15 @@ static inline void arch_alloc_page(struct page *page, int order) { }

struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist, nodemask_t *nodemask);
+ struct zonelist *zonelist, nodemask_t *nodemask,
+ unsigned long *nr_pages, struct page **pages);

static inline struct page *
__alloc_pages(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist)
{
- return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
+ return __alloc_pages_nodemask(gfp_mask, order,
+ zonelist, NULL, NULL, NULL);
}

static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7431001..b17e48c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2004,7 +2004,8 @@ retry_cpuset:
}
page = __alloc_pages_nodemask(gfp, order,
policy_zonelist(gfp, pol, node),
- policy_nodemask(gfp, pol));
+ policy_nodemask(gfp, pol),
+ NULL, NULL);
if (unlikely(mpol_needs_cond_ref(pol)))
__mpol_put(pol);
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
@@ -2052,7 +2053,8 @@ retry_cpuset:
else
page = __alloc_pages_nodemask(gfp, order,
policy_zonelist(gfp, pol, numa_node_id()),
- policy_nodemask(gfp, pol));
+ policy_nodemask(gfp, pol),
+ NULL, NULL);

if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3edb62..0ba9f63 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1846,7 +1846,8 @@ static inline void init_zone_allows_reclaim(int nid)
static struct page *
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
- struct zone *preferred_zone, int migratetype)
+ struct zone *preferred_zone, int migratetype,
+ unsigned long *nr_pages, struct page **pages)
{
struct zoneref *z;
struct page *page = NULL;
@@ -1968,8 +1969,33 @@ zonelist_scan:
try_this_zone:
page = buffered_rmqueue(preferred_zone, zone, order,
gfp_mask, migratetype);
- if (page)
+ if (page) {
+ unsigned long mark;
+ unsigned long count;
+ unsigned long nr;
+
+ if (likely(!nr_pages))
+ break;
+
+ count = 0;
+ pages[count++] = page;
+ mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+ nr = *nr_pages;
+ while (count < nr) {
+ if (!zone_watermark_ok(zone, order, mark,
+ classzone_idx, alloc_flags))
+ break;
+ page = buffered_rmqueue(preferred_zone, zone,
+ order, gfp_mask, migratetype);
+ if (!page)
+ break;
+ pages[count++] = page;
+ }
+ *nr_pages = count;
+ page = pages[0];
break;
+ }
+
this_zone_full:
if (IS_ENABLED(CONFIG_NUMA))
zlc_mark_zone_full(zonelist, z);
@@ -2125,7 +2151,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
order, zonelist, high_zoneidx,
ALLOC_WMARK_HIGH|ALLOC_CPUSET,
- preferred_zone, migratetype);
+ preferred_zone, migratetype,
+ NULL, NULL);
if (page)
goto out;

@@ -2188,7 +2215,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
page = get_page_from_freelist(gfp_mask, nodemask,
order, zonelist, high_zoneidx,
alloc_flags & ~ALLOC_NO_WATERMARKS,
- preferred_zone, migratetype);
+ preferred_zone, migratetype,
+ NULL, NULL);
if (page) {
preferred_zone->compact_blockskip_flush = false;
preferred_zone->compact_considered = 0;
@@ -2282,7 +2310,8 @@ retry:
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx,
alloc_flags & ~ALLOC_NO_WATERMARKS,
- preferred_zone, migratetype);
+ preferred_zone, migratetype,
+ NULL, NULL);

/*
* If an allocation failed after direct reclaim, it could be because
@@ -2312,7 +2341,8 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
do {
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
- preferred_zone, migratetype);
+ preferred_zone, migratetype,
+ NULL, NULL);

if (!page && gfp_mask & __GFP_NOFAIL)
wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
@@ -2449,7 +2479,8 @@ rebalance:
/* This is the last chance, in general, before the goto nopage. */
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
- preferred_zone, migratetype);
+ preferred_zone, migratetype,
+ NULL, NULL);
if (page)
goto got_pg;

@@ -2598,7 +2629,8 @@ got_pg:
*/
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist, nodemask_t *nodemask)
+ struct zonelist *zonelist, nodemask_t *nodemask,
+ unsigned long *nr_pages, struct page **pages)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
struct zone *preferred_zone;
@@ -2647,9 +2679,11 @@ retry_cpuset:
alloc_flags |= ALLOC_CMA;
#endif
/* First allocation attempt */
+ /* We only try to allocate nr_pages in first attempt */
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, alloc_flags,
- preferred_zone, migratetype);
+ preferred_zone, migratetype,
+ nr_pages, pages);
if (unlikely(!page)) {
/*
* Runtime PM, block IO and its error handling path
--
1.7.9.5

2013-07-04 04:29:27

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm, page_alloc: support multiple pages allocation

On Wed, Jul 03, 2013 at 03:57:45PM +0000, Christoph Lameter wrote:
> On Wed, 3 Jul 2013, Joonsoo Kim wrote:
>
> > @@ -298,13 +298,15 @@ static inline void arch_alloc_page(struct page *page, int order) { }
> >
> > struct page *
> > __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> > - struct zonelist *zonelist, nodemask_t *nodemask);
> > + struct zonelist *zonelist, nodemask_t *nodemask,
> > + unsigned long *nr_pages, struct page **pages);
> >
>
> Add a separate function for the allocation of multiple pages instead?

Hello.

I will try it, but, I don't like to implement a separate function.
Please reference my reply to [0/5].

Thanks.

2013-07-04 10:00:54

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Support multiple pages allocation

On Thu 04-07-13 13:24:50, Joonsoo Kim wrote:
> On Thu, Jul 04, 2013 at 12:01:43AM +0800, Zhang Yanfei wrote:
> > On 07/03/2013 11:51 PM, Zhang Yanfei wrote:
> > > On 07/03/2013 11:28 PM, Michal Hocko wrote:
> > >> On Wed 03-07-13 17:34:15, Joonsoo Kim wrote:
> > >> [...]
> > >>> For one page allocation at once, this patchset makes allocator slower than
> > >>> before (-5%).
> > >>
> > >> Slowing down the most used path is a no-go. Where does this slow down
> > >> come from?
> > >
> > > I guess, it might be: for one page allocation at once, comparing to the original
> > > code, this patch adds two parameters nr_pages and pages and will do extra checks
> > > for the parameter nr_pages in the allocation path.
> > >
> >
> > If so, adding a separate path for the multiple allocations seems better.
>
> Hello, all.
>
> I modify the code for optimizing one page allocation via likely macro.
> I attach a new one at the end of this mail.
>
> In this case, performance degradation for one page allocation at once is -2.5%.
> I guess, remained overhead comes from two added parameters.
> Is it unreasonable cost to support this new feature?

Which benchmark you are using for this testing?

> I think that readahead path is one of the most used path, so this penalty looks
> endurable. And after supporting this feature, we can find more use cases.

What about page faults? I would oppose that page faults are _much_ more
frequent than read ahead so you really cannot slow them down.

[...]
--
Michal Hocko
SUSE Labs

2013-07-10 00:31:46

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Support multiple pages allocation

On Thu, Jul 04, 2013 at 12:00:44PM +0200, Michal Hocko wrote:
> On Thu 04-07-13 13:24:50, Joonsoo Kim wrote:
> > On Thu, Jul 04, 2013 at 12:01:43AM +0800, Zhang Yanfei wrote:
> > > On 07/03/2013 11:51 PM, Zhang Yanfei wrote:
> > > > On 07/03/2013 11:28 PM, Michal Hocko wrote:
> > > >> On Wed 03-07-13 17:34:15, Joonsoo Kim wrote:
> > > >> [...]
> > > >>> For one page allocation at once, this patchset makes allocator slower than
> > > >>> before (-5%).
> > > >>
> > > >> Slowing down the most used path is a no-go. Where does this slow down
> > > >> come from?
> > > >
> > > > I guess, it might be: for one page allocation at once, comparing to the original
> > > > code, this patch adds two parameters nr_pages and pages and will do extra checks
> > > > for the parameter nr_pages in the allocation path.
> > > >
> > >
> > > If so, adding a separate path for the multiple allocations seems better.
> >
> > Hello, all.
> >
> > I modify the code for optimizing one page allocation via likely macro.
> > I attach a new one at the end of this mail.
> >
> > In this case, performance degradation for one page allocation at once is -2.5%.
> > I guess, remained overhead comes from two added parameters.
> > Is it unreasonable cost to support this new feature?
>
> Which benchmark you are using for this testing?

I use my own module which do allocation repeatedly.

>
> > I think that readahead path is one of the most used path, so this penalty looks
> > endurable. And after supporting this feature, we can find more use cases.
>
> What about page faults? I would oppose that page faults are _much_ more
> frequent than read ahead so you really cannot slow them down.

You mean page faults for anon?
Yes. I also think that it is much more frequent than read ahead.
Before futher discussion, I will try to add a separate path
for the multiple allocations.

Thanks.

>
> [...]
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-10 01:20:37

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Support multiple pages allocation

于 2013/7/10 8:31, Joonsoo Kim 写道:
> On Thu, Jul 04, 2013 at 12:00:44PM +0200, Michal Hocko wrote:
>> On Thu 04-07-13 13:24:50, Joonsoo Kim wrote:
>>> On Thu, Jul 04, 2013 at 12:01:43AM +0800, Zhang Yanfei wrote:
>>>> On 07/03/2013 11:51 PM, Zhang Yanfei wrote:
>>>>> On 07/03/2013 11:28 PM, Michal Hocko wrote:
>>>>>> On Wed 03-07-13 17:34:15, Joonsoo Kim wrote:
>>>>>> [...]
>>>>>>> For one page allocation at once, this patchset makes allocator slower than
>>>>>>> before (-5%).
>>>>>>
>>>>>> Slowing down the most used path is a no-go. Where does this slow down
>>>>>> come from?
>>>>>
>>>>> I guess, it might be: for one page allocation at once, comparing to the original
>>>>> code, this patch adds two parameters nr_pages and pages and will do extra checks
>>>>> for the parameter nr_pages in the allocation path.
>>>>>
>>>>
>>>> If so, adding a separate path for the multiple allocations seems better.
>>>
>>> Hello, all.
>>>
>>> I modify the code for optimizing one page allocation via likely macro.
>>> I attach a new one at the end of this mail.
>>>
>>> In this case, performance degradation for one page allocation at once is -2.5%.
>>> I guess, remained overhead comes from two added parameters.
>>> Is it unreasonable cost to support this new feature?
>>
>> Which benchmark you are using for this testing?
>
> I use my own module which do allocation repeatedly.
>
>>
>>> I think that readahead path is one of the most used path, so this penalty looks
>>> endurable. And after supporting this feature, we can find more use cases.
>>
>> What about page faults? I would oppose that page faults are _much_ more
>> frequent than read ahead so you really cannot slow them down.
>
> You mean page faults for anon?
> Yes. I also think that it is much more frequent than read ahead.
> Before futher discussion, I will try to add a separate path
> for the multiple allocations.

Some days ago, I was thinking that this multiple allocation behaviour
may be useful for vmalloc allocations. So I think it is worth trying.

>
> Thanks.
>
>>
>> [...]
>> --
>> Michal Hocko
>> SUSE Labs
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>


--
Thanks.
Zhang Yanfei

2013-07-10 09:17:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Support multiple pages allocation

On Wed 10-07-13 09:31:42, Joonsoo Kim wrote:
> On Thu, Jul 04, 2013 at 12:00:44PM +0200, Michal Hocko wrote:
> > On Thu 04-07-13 13:24:50, Joonsoo Kim wrote:
> > > On Thu, Jul 04, 2013 at 12:01:43AM +0800, Zhang Yanfei wrote:
> > > > On 07/03/2013 11:51 PM, Zhang Yanfei wrote:
> > > > > On 07/03/2013 11:28 PM, Michal Hocko wrote:
> > > > >> On Wed 03-07-13 17:34:15, Joonsoo Kim wrote:
> > > > >> [...]
> > > > >>> For one page allocation at once, this patchset makes allocator slower than
> > > > >>> before (-5%).
> > > > >>
> > > > >> Slowing down the most used path is a no-go. Where does this slow down
> > > > >> come from?
> > > > >
> > > > > I guess, it might be: for one page allocation at once, comparing to the original
> > > > > code, this patch adds two parameters nr_pages and pages and will do extra checks
> > > > > for the parameter nr_pages in the allocation path.
> > > > >
> > > >
> > > > If so, adding a separate path for the multiple allocations seems better.
> > >
> > > Hello, all.
> > >
> > > I modify the code for optimizing one page allocation via likely macro.
> > > I attach a new one at the end of this mail.
> > >
> > > In this case, performance degradation for one page allocation at once is -2.5%.
> > > I guess, remained overhead comes from two added parameters.
> > > Is it unreasonable cost to support this new feature?
> >
> > Which benchmark you are using for this testing?
>
> I use my own module which do allocation repeatedly.

I am not sure this microbenchmark will tell us much. Allocations are
usually not short lived so the longer time might get amortized.
If you want to use the multi page allocation for read ahead then try to
model your numbers on read-ahead workloads.
--
Michal Hocko
SUSE Labs

2013-07-10 09:55:38

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Support multiple pages allocation

On Wed, Jul 10, 2013 at 11:17:03AM +0200, Michal Hocko wrote:
> On Wed 10-07-13 09:31:42, Joonsoo Kim wrote:
> > On Thu, Jul 04, 2013 at 12:00:44PM +0200, Michal Hocko wrote:
> > > On Thu 04-07-13 13:24:50, Joonsoo Kim wrote:
> > > > On Thu, Jul 04, 2013 at 12:01:43AM +0800, Zhang Yanfei wrote:
> > > > > On 07/03/2013 11:51 PM, Zhang Yanfei wrote:
> > > > > > On 07/03/2013 11:28 PM, Michal Hocko wrote:
> > > > > >> On Wed 03-07-13 17:34:15, Joonsoo Kim wrote:
> > > > > >> [...]
> > > > > >>> For one page allocation at once, this patchset makes allocator slower than
> > > > > >>> before (-5%).
> > > > > >>
> > > > > >> Slowing down the most used path is a no-go. Where does this slow down
> > > > > >> come from?
> > > > > >
> > > > > > I guess, it might be: for one page allocation at once, comparing to the original
> > > > > > code, this patch adds two parameters nr_pages and pages and will do extra checks
> > > > > > for the parameter nr_pages in the allocation path.
> > > > > >
> > > > >
> > > > > If so, adding a separate path for the multiple allocations seems better.
> > > >
> > > > Hello, all.
> > > >
> > > > I modify the code for optimizing one page allocation via likely macro.
> > > > I attach a new one at the end of this mail.
> > > >
> > > > In this case, performance degradation for one page allocation at once is -2.5%.
> > > > I guess, remained overhead comes from two added parameters.
> > > > Is it unreasonable cost to support this new feature?
> > >
> > > Which benchmark you are using for this testing?
> >
> > I use my own module which do allocation repeatedly.
>
> I am not sure this microbenchmark will tell us much. Allocations are
> usually not short lived so the longer time might get amortized.
> If you want to use the multi page allocation for read ahead then try to
> model your numbers on read-ahead workloads.

Of couse. In later, I will get the result on read-ahead workloads or
vmalloc workload which is recommended by Zhang.

I think, without this microbenchmark, we cannot know this modification's
performance effect to single page allocation accurately. Because the impact
to single page allocation is relatively small and it is easily hidden by
other factors.

Now, I tried several implementation for this feature and found that
separate path also makes single page allocation slower (-1.0~-1.5%).
I didn't find any reason except the fact that
text size of page_alloc.o is 1500 bytes more than before.

Before
text data bss dec hex filename
34466 1389 640 36495 8e8f mm/page_alloc.o

sep
text data bss dec hex filename
36074 1413 640 38127 94ef mm/page_alloc.o

Not yet posted implementation which pass two more arguments to
__alloc_pages_nodemask() also makes single page allocation
(-1.0~-1.5%) slower. So in later, I will work with this implementation,
not separate path implementation.

Thanks for comment!

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-10 09:56:11

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Support multiple pages allocation

On Wed, Jul 10, 2013 at 09:20:27AM +0800, Zhang Yanfei wrote:
> 于 2013/7/10 8:31, Joonsoo Kim 写道:
> > On Thu, Jul 04, 2013 at 12:00:44PM +0200, Michal Hocko wrote:
> >> On Thu 04-07-13 13:24:50, Joonsoo Kim wrote:
> >>> On Thu, Jul 04, 2013 at 12:01:43AM +0800, Zhang Yanfei wrote:
> >>>> On 07/03/2013 11:51 PM, Zhang Yanfei wrote:
> >>>>> On 07/03/2013 11:28 PM, Michal Hocko wrote:
> >>>>>> On Wed 03-07-13 17:34:15, Joonsoo Kim wrote:
> >>>>>> [...]
> >>>>>>> For one page allocation at once, this patchset makes allocator slower than
> >>>>>>> before (-5%).
> >>>>>>
> >>>>>> Slowing down the most used path is a no-go. Where does this slow down
> >>>>>> come from?
> >>>>>
> >>>>> I guess, it might be: for one page allocation at once, comparing to the original
> >>>>> code, this patch adds two parameters nr_pages and pages and will do extra checks
> >>>>> for the parameter nr_pages in the allocation path.
> >>>>>
> >>>>
> >>>> If so, adding a separate path for the multiple allocations seems better.
> >>>
> >>> Hello, all.
> >>>
> >>> I modify the code for optimizing one page allocation via likely macro.
> >>> I attach a new one at the end of this mail.
> >>>
> >>> In this case, performance degradation for one page allocation at once is -2.5%.
> >>> I guess, remained overhead comes from two added parameters.
> >>> Is it unreasonable cost to support this new feature?
> >>
> >> Which benchmark you are using for this testing?
> >
> > I use my own module which do allocation repeatedly.
> >
> >>
> >>> I think that readahead path is one of the most used path, so this penalty looks
> >>> endurable. And after supporting this feature, we can find more use cases.
> >>
> >> What about page faults? I would oppose that page faults are _much_ more
> >> frequent than read ahead so you really cannot slow them down.
> >
> > You mean page faults for anon?
> > Yes. I also think that it is much more frequent than read ahead.
> > Before futher discussion, I will try to add a separate path
> > for the multiple allocations.
>
> Some days ago, I was thinking that this multiple allocation behaviour
> may be useful for vmalloc allocations. So I think it is worth trying.

Yeh! I think so!

Thanks.

>
> >
> > Thanks.
> >
> >>
> >> [...]
> >> --
> >> Michal Hocko
> >> SUSE Labs
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to [email protected]. For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
>
>
> --
> Thanks.
> Zhang Yanfei
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-10 11:27:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Support multiple pages allocation

On Wed 10-07-13 18:55:33, Joonsoo Kim wrote:
> On Wed, Jul 10, 2013 at 11:17:03AM +0200, Michal Hocko wrote:
> > On Wed 10-07-13 09:31:42, Joonsoo Kim wrote:
> > > On Thu, Jul 04, 2013 at 12:00:44PM +0200, Michal Hocko wrote:
[...]
> > > > Which benchmark you are using for this testing?
> > >
> > > I use my own module which do allocation repeatedly.
> >
> > I am not sure this microbenchmark will tell us much. Allocations are
> > usually not short lived so the longer time might get amortized.
> > If you want to use the multi page allocation for read ahead then try to
> > model your numbers on read-ahead workloads.
>
> Of couse. In later, I will get the result on read-ahead workloads or
> vmalloc workload which is recommended by Zhang.
>
> I think, without this microbenchmark, we cannot know this modification's
> performance effect to single page allocation accurately. Because the impact
> to single page allocation is relatively small and it is easily hidden by
> other factors.

The main thing is whether the numbers you get from an artificial
microbenchmark matter at all. You might see a regression which cannot be
hit in practice because other effects are of magnitude more significant.
--
Michal Hocko
SUSE Labs

2013-07-10 22:53:24

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm, page_alloc: support multiple pages allocation

On 07/03/2013 01:34 AM, Joonsoo Kim wrote:
> - if (page)
> + do {
> + page = buffered_rmqueue(preferred_zone, zone, order,
> + gfp_mask, migratetype);
> + if (!page)
> + break;
> +
> + if (!nr_pages) {
> + count++;
> + break;
> + }
> +
> + pages[count++] = page;
> + if (count >= *nr_pages)
> + break;
> +
> + mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
> + if (!zone_watermark_ok(zone, order, mark,
> + classzone_idx, alloc_flags))
> + break;
> + } while (1);

I'm really surprised this works as well as it does. Calling
buffered_rmqueue() a bunch of times enables/disables interrupts a bunch
of times, and mucks with the percpu pages lists a whole bunch.
buffered_rmqueue() is really meant for _single_ pages, not to be called
a bunch of times in a row.

Why not just do a single rmqueue_bulk() call?

2013-07-11 01:02:49

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm, page_alloc: support multiple pages allocation

On Wed, Jul 10, 2013 at 03:52:42PM -0700, Dave Hansen wrote:
> On 07/03/2013 01:34 AM, Joonsoo Kim wrote:
> > - if (page)
> > + do {
> > + page = buffered_rmqueue(preferred_zone, zone, order,
> > + gfp_mask, migratetype);
> > + if (!page)
> > + break;
> > +
> > + if (!nr_pages) {
> > + count++;
> > + break;
> > + }
> > +
> > + pages[count++] = page;
> > + if (count >= *nr_pages)
> > + break;
> > +
> > + mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
> > + if (!zone_watermark_ok(zone, order, mark,
> > + classzone_idx, alloc_flags))
> > + break;
> > + } while (1);
>
> I'm really surprised this works as well as it does. Calling
> buffered_rmqueue() a bunch of times enables/disables interrupts a bunch
> of times, and mucks with the percpu pages lists a whole bunch.
> buffered_rmqueue() is really meant for _single_ pages, not to be called
> a bunch of times in a row.
>
> Why not just do a single rmqueue_bulk() call?

Hello, Dave.

There are some reasons why I implement the feature in this way.

rmqueue_bulk() needs a zone lock. If we allocate not so many pages,
for example, 2 or 3 pages, it can have much more overhead rather than
allocationg 1 page multiple times. So, IMHO, it is better that
multiple pages allocation is supported on top of percpu pages list.

And I think that enables/disables interrupts a bunch of times help
to avoid a latency problem. If we disable interrupts until the whole works
is finished, interrupts can be handled too lately.
free_hot_cold_page_list() already do enables/disalbed interrupts a bunch of
times.

Thanks for helpful comment!

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-11 01:05:11

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Support multiple pages allocation

On Wed, Jul 10, 2013 at 01:27:37PM +0200, Michal Hocko wrote:
> On Wed 10-07-13 18:55:33, Joonsoo Kim wrote:
> > On Wed, Jul 10, 2013 at 11:17:03AM +0200, Michal Hocko wrote:
> > > On Wed 10-07-13 09:31:42, Joonsoo Kim wrote:
> > > > On Thu, Jul 04, 2013 at 12:00:44PM +0200, Michal Hocko wrote:
> [...]
> > > > > Which benchmark you are using for this testing?
> > > >
> > > > I use my own module which do allocation repeatedly.
> > >
> > > I am not sure this microbenchmark will tell us much. Allocations are
> > > usually not short lived so the longer time might get amortized.
> > > If you want to use the multi page allocation for read ahead then try to
> > > model your numbers on read-ahead workloads.
> >
> > Of couse. In later, I will get the result on read-ahead workloads or
> > vmalloc workload which is recommended by Zhang.
> >
> > I think, without this microbenchmark, we cannot know this modification's
> > performance effect to single page allocation accurately. Because the impact
> > to single page allocation is relatively small and it is easily hidden by
> > other factors.
>
> The main thing is whether the numbers you get from an artificial
> microbenchmark matter at all. You might see a regression which cannot be
> hit in practice because other effects are of magnitude more significant.

Okay. I will keep this in mind.

Thanks for your comment.

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-11 05:38:23

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm, page_alloc: support multiple pages allocation

On 07/10/2013 06:02 PM, Joonsoo Kim wrote:
> On Wed, Jul 10, 2013 at 03:52:42PM -0700, Dave Hansen wrote:
>> On 07/03/2013 01:34 AM, Joonsoo Kim wrote:
>>> - if (page)
>>> + do {
>>> + page = buffered_rmqueue(preferred_zone, zone, order,
>>> + gfp_mask, migratetype);
>>> + if (!page)
>>> + break;
>>> +
>>> + if (!nr_pages) {
>>> + count++;
>>> + break;
>>> + }
>>> +
>>> + pages[count++] = page;
>>> + if (count >= *nr_pages)
>>> + break;
>>> +
>>> + mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
>>> + if (!zone_watermark_ok(zone, order, mark,
>>> + classzone_idx, alloc_flags))
>>> + break;
>>> + } while (1);
>>
>> I'm really surprised this works as well as it does. Calling
>> buffered_rmqueue() a bunch of times enables/disables interrupts a bunch
>> of times, and mucks with the percpu pages lists a whole bunch.
>> buffered_rmqueue() is really meant for _single_ pages, not to be called
>> a bunch of times in a row.
>>
>> Why not just do a single rmqueue_bulk() call?
>
> rmqueue_bulk() needs a zone lock. If we allocate not so many pages,
> for example, 2 or 3 pages, it can have much more overhead rather than
> allocationg 1 page multiple times. So, IMHO, it is better that
> multiple pages allocation is supported on top of percpu pages list.

It makes _zero_ sense to be doing a number of buffered_rmqueue() calls
that are approaching the size of the percpu pages batches. If you end
up doing that, you pay both the overhead in manipulating both the percpu
page data _and_ the buddy lists.

You're probably right for small numbers of pages. But, if we're talking
about things that are more than, say, 100 pages (isn't the pcp batch
size clamped to 128 4k pages?) you surely don't want to be doing
buffered_rmqueue().

I'd also like to see some scalability numbers on this. How do your
tests look when all the CPUs on the system are hammering away?

> And I think that enables/disables interrupts a bunch of times help
> to avoid a latency problem. If we disable interrupts until the whole works
> is finished, interrupts can be handled too lately.
> free_hot_cold_page_list() already do enables/disalbed interrupts a bunch of
> times.

I don't think interrupt latency is going to be a problem on the scale
we're talking about here. There are much, much, much worse offenders in
the kernel.

2013-07-11 06:12:07

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm, page_alloc: support multiple pages allocation

On Wed, Jul 10, 2013 at 10:38:20PM -0700, Dave Hansen wrote:
> On 07/10/2013 06:02 PM, Joonsoo Kim wrote:
> > On Wed, Jul 10, 2013 at 03:52:42PM -0700, Dave Hansen wrote:
> >> On 07/03/2013 01:34 AM, Joonsoo Kim wrote:
> >>> - if (page)
> >>> + do {
> >>> + page = buffered_rmqueue(preferred_zone, zone, order,
> >>> + gfp_mask, migratetype);
> >>> + if (!page)
> >>> + break;
> >>> +
> >>> + if (!nr_pages) {
> >>> + count++;
> >>> + break;
> >>> + }
> >>> +
> >>> + pages[count++] = page;
> >>> + if (count >= *nr_pages)
> >>> + break;
> >>> +
> >>> + mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
> >>> + if (!zone_watermark_ok(zone, order, mark,
> >>> + classzone_idx, alloc_flags))
> >>> + break;
> >>> + } while (1);
> >>
> >> I'm really surprised this works as well as it does. Calling
> >> buffered_rmqueue() a bunch of times enables/disables interrupts a bunch
> >> of times, and mucks with the percpu pages lists a whole bunch.
> >> buffered_rmqueue() is really meant for _single_ pages, not to be called
> >> a bunch of times in a row.
> >>
> >> Why not just do a single rmqueue_bulk() call?
> >
> > rmqueue_bulk() needs a zone lock. If we allocate not so many pages,
> > for example, 2 or 3 pages, it can have much more overhead rather than
> > allocationg 1 page multiple times. So, IMHO, it is better that
> > multiple pages allocation is supported on top of percpu pages list.
>
> It makes _zero_ sense to be doing a number of buffered_rmqueue() calls
> that are approaching the size of the percpu pages batches. If you end
> up doing that, you pay both the overhead in manipulating both the percpu
> page data _and_ the buddy lists.
>
> You're probably right for small numbers of pages. But, if we're talking
> about things that are more than, say, 100 pages (isn't the pcp batch
> size clamped to 128 4k pages?) you surely don't want to be doing
> buffered_rmqueue().

Yes, you are right.
Firstly, I thought that I can use this for readahead. On my machine,
readahead reads (maximum) 32 pages in advance if faulted. And batch size
of percpu pages list is close to or larger than 32 pages
on today's machine. So I didn't consider more than 32 pages before.
But to cope with a request for more pages, using rmqueue_bulk() is
a right way. How about using rmqueue_bulk() conditionally?

>
> I'd also like to see some scalability numbers on this. How do your
> tests look when all the CPUs on the system are hammering away?

What test do you mean?
Please elaborate on this more.

> > And I think that enables/disables interrupts a bunch of times help
> > to avoid a latency problem. If we disable interrupts until the whole works
> > is finished, interrupts can be handled too lately.
> > free_hot_cold_page_list() already do enables/disalbed interrupts a bunch of
> > times.
>
> I don't think interrupt latency is going to be a problem on the scale
> we're talking about here. There are much, much, much worse offenders in
> the kernel.

Hmm, rmqueue_bulk() doesn't stop until all requested pages are allocated.
If we request too many pages (1024 pages or more), interrupt latency can
be a problem.

Thanks!!

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-11 15:51:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm, page_alloc: support multiple pages allocation

On 07/10/2013 11:12 PM, Joonsoo Kim wrote:
>> > I'd also like to see some scalability numbers on this. How do your
>> > tests look when all the CPUs on the system are hammering away?
> What test do you mean?
> Please elaborate on this more

Your existing tests looked single-threaded. That's certainly part of
the problem. Will your patches have implications for larger systems,
though? How much do your patches speed up or slow things down if we
have many allocations proceeding on many CPUs in parallel?

2013-07-12 16:31:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm, page_alloc: support multiple pages allocation

On 07/10/2013 11:12 PM, Joonsoo Kim wrote:
> On Wed, Jul 10, 2013 at 10:38:20PM -0700, Dave Hansen wrote:
>> You're probably right for small numbers of pages. But, if we're talking
>> about things that are more than, say, 100 pages (isn't the pcp batch
>> size clamped to 128 4k pages?) you surely don't want to be doing
>> buffered_rmqueue().
>
> Yes, you are right.
> Firstly, I thought that I can use this for readahead. On my machine,
> readahead reads (maximum) 32 pages in advance if faulted. And batch size
> of percpu pages list is close to or larger than 32 pages
> on today's machine. So I didn't consider more than 32 pages before.
> But to cope with a request for more pages, using rmqueue_bulk() is
> a right way. How about using rmqueue_bulk() conditionally?

How about you test it both ways and see what is faster?

> Hmm, rmqueue_bulk() doesn't stop until all requested pages are allocated.
> If we request too many pages (1024 pages or more), interrupt latency can
> be a problem.

OK, so only call it for the number of pages you believe allows it to
have acceptable interrupt latency. If you want 200 pages, and you can
only disable interrupts for 100 pages, then just do it in two batches.

The point is that you want to avoid messing with the buffering by the
percpu structures. They're just overhead in your case.

2013-07-16 00:26:54

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm, page_alloc: support multiple pages allocation

On Thu, Jul 11, 2013 at 08:51:22AM -0700, Dave Hansen wrote:
> On 07/10/2013 11:12 PM, Joonsoo Kim wrote:
> >> > I'd also like to see some scalability numbers on this. How do your
> >> > tests look when all the CPUs on the system are hammering away?
> > What test do you mean?
> > Please elaborate on this more
>
> Your existing tests looked single-threaded. That's certainly part of
> the problem. Will your patches have implications for larger systems,
> though? How much do your patches speed up or slow things down if we
> have many allocations proceeding on many CPUs in parallel?

Hello.

Okay, I will do that and attach the result to v2.

Thanks.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-16 00:37:55

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm, page_alloc: support multiple pages allocation

On Fri, Jul 12, 2013 at 09:31:42AM -0700, Dave Hansen wrote:
> On 07/10/2013 11:12 PM, Joonsoo Kim wrote:
> > On Wed, Jul 10, 2013 at 10:38:20PM -0700, Dave Hansen wrote:
> >> You're probably right for small numbers of pages. But, if we're talking
> >> about things that are more than, say, 100 pages (isn't the pcp batch
> >> size clamped to 128 4k pages?) you surely don't want to be doing
> >> buffered_rmqueue().
> >
> > Yes, you are right.
> > Firstly, I thought that I can use this for readahead. On my machine,
> > readahead reads (maximum) 32 pages in advance if faulted. And batch size
> > of percpu pages list is close to or larger than 32 pages
> > on today's machine. So I didn't consider more than 32 pages before.
> > But to cope with a request for more pages, using rmqueue_bulk() is
> > a right way. How about using rmqueue_bulk() conditionally?
>
> How about you test it both ways and see what is faster?

It is not easy to test which one is better, because a difference may be
appeared on certain circumstances only. Do not grab the global lock
as much as possible is preferable approach to me.

>
> > Hmm, rmqueue_bulk() doesn't stop until all requested pages are allocated.
> > If we request too many pages (1024 pages or more), interrupt latency can
> > be a problem.
>
> OK, so only call it for the number of pages you believe allows it to
> have acceptable interrupt latency. If you want 200 pages, and you can
> only disable interrupts for 100 pages, then just do it in two batches.
>
> The point is that you want to avoid messing with the buffering by the
> percpu structures. They're just overhead in your case.

Okay.

Thanks.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>