Hi Andrew,
Here is a restacked version of the grouping pages by mobility patches
based on the patches currently in your tree. It should be a drop-in
replacement for what is in 2.6.23-rc4-mm1 and is what I propose for merging
to mainline. The change from what you have already is that the redundant
patches are removed. For example, the patches that made grouping pages by
mobility configurable and later removed that ability do not exist in this
set. Simiarly, the patches for grouping high-order atomic allocations together
does not exist. Also note that the first patch related to IA-64 in this set
appears unrelated but it's required by patches and having the change at the
start makes the patchset more comprehensible in terms of dependencies. This
rebasing work is largely the work of Andy Whitcroft. Thanks Andy.
The patches replaced in -mm are as follows;
add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages.patch
split-the-free-lists-for-movable-and-unmovable-allocations.patch
choose-pages-from-the-per-cpu-list-based-on-migration-type.patch
add-a-configure-option-to-group-pages-by-mobility.patch
drain-per-cpu-lists-when-high-order-allocations-fail.patch
move-free-pages-between-lists-on-steal.patch
group-short-lived-and-reclaimable-kernel-allocations.patch
group-high-order-atomic-allocations.patch
do-not-group-pages-by-mobility-type-on-low-memory-systems.patch
bias-the-placement-of-kernel-pages-at-lower-pfns.patch
be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback.patch
fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2.patch
fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2-fix.patch
fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2-fix-fix.patch
bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks.patch
remove-page_group_by_mobility.patch
dont-group-high-order-atomic-allocations.patch
fix-calculation-in-move_freepages_block-for-counting-pages.patch
do-not-depend-on-max_order-when-grouping-pages-by-mobility.patch
print-out-statistics-in-relation-to-fragmentation-avoidance-to-proc-pagetypeinfo.patch
Note that the patch
breakout-page_order-to-internalh-to-avoid-special-knowledge-of-the-buddy-allocator.patch
is not in the list and remains in -mm as part of page-owner tracking. In
the series file, the breakout patch is placed after this new patchset.
To refresh;
The objective of this patchset is to keep the system in a state where actions
such as page reclaim or memory compaction will reduce external fragmentation
in the system. It works by grouping pages of similar mobility together in
PAGEBLOCK_NR_PAGES areas. The types of mobility are
UNMOVABLE - Pages that cannot be trivially reclaimed or moved
MOVABLE - Pages that can be moved using the page migration mechanism
RECLAIMABLE - Pages that the kernel can often directly reclaim such as
those used for inode caches
RESERVE - The areas where min_free_kbyte-related pages should be stored
Instead of having one MAX_ORDER-sized array of free lists in struct free_area,
there is one for each type of mobility. Once a 2^pageblock_order (typically
the size of the system large page) area of pages is split for a type of
allocation, the remaining unused portion is placed on the free-lists for
that type prioritising its use for compatible mobility allocations. Hence,
over time, pages of the different types can be clustered together.
When the preferred freelists are expired, the largest possible block is
taken from an alternative list. Again, the unused portion is placed on the
free lists of the preferred allocation-type.
This grouping clearly requires additional work in the page allocator.
kernbench shows effectively no performance difference varying between -0.2%
and +1% on a variety of test machines. Success rates for huge page allocation
are dramatically increased. For example, on a ppc64 machine, the vanilla
kernel was only able to allocate 1% of memory as a hugepage and this was
due to a single hugepage reserved as min_free_kbytes. With these patches
applied, 40% was allocatable as superpages.
These patches work in conjunction with the ZONE_MOVABLE patches that were
merged for 2.6.23-rc1, particularly the allocations that have already been
flagged as __GFP_MOVABLE.
Changelog Since V29
o Remove redundant patches
o Keep min_free_pages contiguous as much as possible
o Agressively group RECLAIMABLE pages together
o Bug fixes that were applied during the time in -mm
Changelog Since V28
o Group high-order atomic allocations together
o It is no longer required to set min_free_kbytes to 10% of memory. A value
of 16384 in most cases will be sufficient
o Now applied with zone-based anti-fragmentation
o Fix incorrect VM_BUG_ON within buffered_rmqueue()
o Reorder the stack so later patches do not back out work from earlier patches
o Fix bug were journal pages were being treated as movable
o Bias placement of non-movable pages to lower PFNs
o More agressive clustering of reclaimable pages in reactions to workloads
like updatedb that flood the size of inode caches
Changelog Since V27
o Renamed anti-fragmentation to Page Clustering. Anti-fragmentation was giving
the mistaken impression that it was the 100% solution for high order
allocations. Instead, it greatly increases the chances high-order
allocations will succeed and lays the foundation for defragmentation and
memory hot-remove to work properly
o Redefine page groupings based on ability to migrate or reclaim instead of
basing on reclaimability alone
o Get rid of spurious inits
o Per-cpu lists are no longer split up per-type. Instead the per-cpu list is
searched for a page of the appropriate type
o Added more explanation commentary
o Fix up bug in pageblock code where bitmap was used before being initalised
Changelog Since V26
o Fix double init of lists in setup_pageset
Changelog Since V25
o Fix loop order of for_each_rclmtype_order so that order of loop matches args
o gfpflags_to_rclmtype uses gfp_t instead of unsigned long
o Rename get_pageblock_type() to get_page_rclmtype()
o Fix alignment problem in move_freepages()
o Add mechanism for assigning flags to blocks of pages instead of page->flags
o On fallback, do not examine the preferred list of free pages a second time
Following this email are 14 patches that implement the page grouping feature.
These apply to mainline but can also act as a drop-in replacement for the
patches that are in -mm.
The first patch changes how IA-64 parses the hugepagesz parameter so that
is occurs before memory initialisation. The second patch adds a bitmap that
stores flags per PAGEBLOCK_NR_PAGES block in the system. The third patch
is a fix to the pageblock flags patch that still exists due to it being
developed by Bob Picco.
The fourth patch splits the free lists between movable and all other
allocations. Following that is a patch that deals with per-cpu pages so that
the free-lists are not containimated by pages of the wrong mobility type.
Next is patch to group temporary and reclaimable pages together in the
same areas and the last functionality patch drains the per-cpu lists when
a high-order allocation fails.
The remaining patches in the set deal with controlling the situations that
can lead to external fragmentation later. They include biasing the location of
unmovable pages to the lower PFNs and being more aggressive about clustering
reclaimable pages together rather than letting them get scattered throughout
the address space that would happen during such activities as updatedb.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Subject: ia64: parse kernel parameter hugepagesz= in early boot
Parse hugepagesz with early_param() instead of __setup(). __setup()
is called after the memory allocator has been initialised and the
pageblock bitmaps already setup. In tests on one IA64 there did not
seem to be any problem with using early_param() and in fact may be
more correct as it guarantees the parameter is handled before the
parsing of hugepages=.
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Andy Whitcroft <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
arch/ia64/Kconfig | 5 +++++
arch/ia64/mm/hugetlbpage.c | 4 ++--
2 files changed, 7 insertions(+), 2 deletions(-)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-clean/arch/ia64/Kconfig linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/arch/ia64/Kconfig
--- linux-2.6.23-rc5-clean/arch/ia64/Kconfig 2007-09-01 07:08:24.000000000 +0100
+++ linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/arch/ia64/Kconfig 2007-09-02 16:18:48.000000000 +0100
@@ -54,6 +54,11 @@ config ARCH_HAS_ILOG2_U64
bool
default n
+config HUGETLB_PAGE_SIZE_VARIABLE
+ bool
+ depends on HUGETLB_PAGE
+ default y
+
config GENERIC_FIND_NEXT_BIT
bool
default y
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-clean/arch/ia64/mm/hugetlbpage.c linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/arch/ia64/mm/hugetlbpage.c
--- linux-2.6.23-rc5-clean/arch/ia64/mm/hugetlbpage.c 2007-09-01 07:08:24.000000000 +0100
+++ linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/arch/ia64/mm/hugetlbpage.c 2007-09-02 16:18:48.000000000 +0100
@@ -194,6 +194,6 @@ static int __init hugetlb_setup_sz(char
* override here with new page shift.
*/
ia64_set_rr(HPAGE_REGION_BASE, hpage_shift << 2);
- return 1;
+ return 0;
}
-__setup("hugepagesz=", hugetlb_setup_sz);
+early_param("hugepagesz", hugetlb_setup_sz);
Subject: Add a bitmap that is used to track flags affecting a block of pages
The grouping pages by mobility patchset needs to track if pages within a block
can be moved or reclaimed so that pages are freed to the appropriate list.
This patch adds a bitmap for flags affecting a whole a pageblock_nr_pages
block of pages.
In non-SPARSEMEM configurations, the bitmap is stored in the struct zone
and allocated during initialisation. SPARSEMEM dynamically allocates the
bitmap in a struct mem_section as required.
Additional credit to Andy Whitcroft who reviewed up an earlier implementation
of the mechanism an suggested how to make it a *lot* cleaner.
Signed-off-by: Mel Gorman <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
include/linux/mmzone.h | 13 +++
include/linux/pageblock-flags.h | 74 ++++++++++++++++++
mm/page_alloc.c | 137 +++++++++++++++++++++++++++++++++++
3 files changed, 224 insertions(+)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/include/linux/mmzone.h linux-2.6.23-rc5-002-add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages/include/linux/mmzone.h
--- linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/include/linux/mmzone.h 2007-09-02 16:18:27.000000000 +0100
+++ linux-2.6.23-rc5-002-add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages/include/linux/mmzone.h 2007-09-02 16:19:05.000000000 +0100
@@ -13,6 +13,7 @@
#include <linux/init.h>
#include <linux/seqlock.h>
#include <linux/nodemask.h>
+#include <linux/pageblock-flags.h>
#include <asm/atomic.h>
#include <asm/page.h>
@@ -222,6 +223,14 @@ struct zone {
#endif
struct free_area free_area[MAX_ORDER];
+#ifndef CONFIG_SPARSEMEM
+ /*
+ * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
+ * In SPARSEMEM, this map is stored in struct mem_section
+ */
+ unsigned long *pageblock_flags;
+#endif /* CONFIG_SPARSEMEM */
+
ZONE_PADDING(_pad1_)
@@ -708,6 +717,9 @@ extern struct zone *next_zone(struct zon
#define PAGES_PER_SECTION (1UL << PFN_SECTION_SHIFT)
#define PAGE_SECTION_MASK (~(PAGES_PER_SECTION-1))
+#define SECTION_BLOCKFLAGS_BITS \
+ ((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS)
+
#if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
#error Allocator MAX_ORDER exceeds SECTION_SIZE
#endif
@@ -727,6 +739,7 @@ struct mem_section {
* before using it wrong.
*/
unsigned long section_mem_map;
+ DECLARE_BITMAP(pageblock_flags, SECTION_BLOCKFLAGS_BITS);
};
#ifdef CONFIG_SPARSEMEM_EXTREME
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/include/linux/pageblock-flags.h linux-2.6.23-rc5-002-add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages/include/linux/pageblock-flags.h
--- linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/include/linux/pageblock-flags.h 2007-09-02 16:18:27.000000000 +0100
+++ linux-2.6.23-rc5-002-add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages/include/linux/pageblock-flags.h 2007-09-02 16:19:05.000000000 +0100
@@ -0,0 +1,74 @@
+/*
+ * Macros for manipulating and testing flags related to a
+ * pageblock_nr_pages number of pages.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation version 2 of the License
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2006
+ *
+ * Original author, Mel Gorman
+ * Major cleanups and reduction of bit operations, Andy Whitcroft
+ */
+#ifndef PAGEBLOCK_FLAGS_H
+#define PAGEBLOCK_FLAGS_H
+
+#include <linux/types.h>
+
+/* Macro to aid the definition of ranges of bits */
+#define PB_range(name, required_bits) \
+ name, name ## _end = (name + required_bits) - 1
+
+/* Bit indices that affect a whole block of pages */
+enum pageblock_bits {
+ NR_PAGEBLOCK_BITS
+};
+
+#ifdef CONFIG_HUGETLB_PAGE
+
+#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
+
+/* Huge page sizes are variable */
+extern int pageblock_order;
+
+#else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
+
+/* Huge pages are a constant size */
+#define pageblock_order HUGETLB_PAGE_ORDER
+
+#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
+
+#else /* CONFIG_HUGETLB_PAGE */
+
+/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
+#define pageblock_order (MAX_ORDER-1)
+
+#endif /* CONFIG_HUGETLB_PAGE */
+
+#define pageblock_nr_pages (1UL << pageblock_order)
+
+/* Forward declaration */
+struct page;
+
+/* Declarations for getting and setting flags. See mm/page_alloc.c */
+unsigned long get_pageblock_flags_group(struct page *page,
+ int start_bitidx, int end_bitidx);
+void set_pageblock_flags_group(struct page *page, unsigned long flags,
+ int start_bitidx, int end_bitidx);
+
+#define get_pageblock_flags(page) \
+ get_pageblock_flags_group(page, 0, NR_PAGEBLOCK_BITS-1)
+#define set_pageblock_flags(page) \
+ set_pageblock_flags_group(page, 0, NR_PAGEBLOCK_BITS-1)
+
+#endif /* PAGEBLOCK_FLAGS_H */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/mm/page_alloc.c linux-2.6.23-rc5-002-add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages/mm/page_alloc.c
--- linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/mm/page_alloc.c 2007-09-02 16:18:31.000000000 +0100
+++ linux-2.6.23-rc5-002-add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages/mm/page_alloc.c 2007-09-02 16:19:05.000000000 +0100
@@ -59,6 +59,10 @@ unsigned long totalreserve_pages __read_
long nr_swap_pages;
int percpu_pagelist_fraction;
+#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
+int pageblock_order __read_mostly;
+#endif
+
static void __free_pages_ok(struct page *page, unsigned int order);
/*
@@ -2901,6 +2905,62 @@ static void __meminit calculate_node_tot
realtotalpages);
}
+#ifndef CONFIG_SPARSEMEM
+/*
+ * Calculate the size of the zone->blockflags rounded to an unsigned long
+ * Start by making sure zonesize is a multiple of pageblock_order by rounding
+ * up. Then use 1 NR_PAGEBLOCK_BITS worth of bits per pageblock, finally
+ * round what is now in bits to nearest long in bits, then return it in
+ * bytes.
+ */
+static unsigned long __init usemap_size(unsigned long zonesize)
+{
+ unsigned long usemapsize;
+
+ usemapsize = roundup(zonesize, pageblock_nr_pages);
+ usemapsize = usemapsize >> pageblock_order;
+ usemapsize *= NR_PAGEBLOCK_BITS;
+ usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long));
+
+ return usemapsize / 8;
+}
+
+static void __init setup_usemap(struct pglist_data *pgdat,
+ struct zone *zone, unsigned long zonesize)
+{
+ unsigned long usemapsize = usemap_size(zonesize);
+ zone->pageblock_flags = NULL;
+ if (usemapsize) {
+ zone->pageblock_flags = alloc_bootmem_node(pgdat, usemapsize);
+ memset(zone->pageblock_flags, 0, usemapsize);
+ }
+}
+#else
+static void inline setup_usemap(struct pglist_data *pgdat,
+ struct zone *zone, unsigned long zonesize) {}
+#endif /* CONFIG_SPARSEMEM */
+
+#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
+/* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
+static inline void __init set_pageblock_order(unsigned int order)
+{
+ /* Check that pageblock_nr_pages has not already been setup */
+ if (pageblock_order)
+ return;
+
+ /*
+ * Assume the largest contiguous order of interest is a huge page.
+ * This value may be variable depending on boot parameters on IA64
+ */
+ pageblock_order = order;
+}
+#else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
+
+/* Defined this way to avoid accidently referencing HUGETLB_PAGE_ORDER */
+#define set_pageblock_order(x) do {} while (0)
+
+#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
+
/*
* Set up the zone data structures:
* - mark all pages reserved
@@ -2981,6 +3041,8 @@ static void __meminit free_area_init_cor
if (!size)
continue;
+ set_pageblock_order(HUGETLB_PAGE_ORDER);
+ setup_usemap(pgdat, zone, size);
ret = init_currently_empty_zone(zone, zone_start_pfn,
size, MEMMAP_EARLY);
BUG_ON(ret);
@@ -3934,4 +3996,79 @@ EXPORT_SYMBOL(pfn_to_page);
EXPORT_SYMBOL(page_to_pfn);
#endif /* CONFIG_OUT_OF_LINE_PFN_TO_PAGE */
+/* Return a pointer to the bitmap storing bits affecting a block of pages */
+static inline unsigned long *get_pageblock_bitmap(struct zone *zone,
+ unsigned long pfn)
+{
+#ifdef CONFIG_SPARSEMEM
+ return __pfn_to_section(pfn)->pageblock_flags;
+#else
+ return zone->pageblock_flags;
+#endif /* CONFIG_SPARSEMEM */
+}
+static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
+{
+#ifdef CONFIG_SPARSEMEM
+ pfn &= (PAGES_PER_SECTION-1);
+ return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
+#else
+ pfn = pfn - zone->zone_start_pfn;
+ return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
+#endif /* CONFIG_SPARSEMEM */
+}
+
+/**
+ * get_pageblock_flags_group - Return the requested group of flags for the pageblock_nr_pages block of pages
+ * @page: The page within the block of interest
+ * @start_bitidx: The first bit of interest to retrieve
+ * @end_bitidx: The last bit of interest
+ * returns pageblock_bits flags
+ */
+unsigned long get_pageblock_flags_group(struct page *page,
+ int start_bitidx, int end_bitidx)
+{
+ struct zone *zone;
+ unsigned long *bitmap;
+ unsigned long pfn, bitidx;
+ unsigned long flags = 0;
+ unsigned long value = 1;
+
+ zone = page_zone(page);
+ pfn = page_to_pfn(page);
+ bitmap = get_pageblock_bitmap(zone, pfn);
+ bitidx = pfn_to_bitidx(zone, pfn);
+
+ for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
+ if (test_bit(bitidx + start_bitidx, bitmap))
+ flags |= value;
+
+ return flags;
+}
+
+/**
+ * set_pageblock_flags_group - Set the requested group of flags for a pageblock_nr_pages block of pages
+ * @page: The page within the block of interest
+ * @start_bitidx: The first bit of interest
+ * @end_bitidx: The last bit of interest
+ * @flags: The flags to set
+ */
+void set_pageblock_flags_group(struct page *page, unsigned long flags,
+ int start_bitidx, int end_bitidx)
+{
+ struct zone *zone;
+ unsigned long *bitmap;
+ unsigned long pfn, bitidx;
+ unsigned long value = 1;
+
+ zone = page_zone(page);
+ pfn = page_to_pfn(page);
+ bitmap = get_pageblock_bitmap(zone, pfn);
+ bitidx = pfn_to_bitidx(zone, pfn);
+
+ for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
+ if (flags & value)
+ __set_bit(bitidx + start_bitidx, bitmap);
+ else
+ __clear_bit(bitidx + start_bitidx, bitmap);
+}
Subject: Fix corruption of memmap on ia64-sparsemem when mem_section is not a power of 2
There are problems in the use of SPARSEMEM and pageblock flags that causes
problems on ia64.
The first part of the problem is that units are incorrect in
SECTION_BLOCKFLAGS_BITS computation. This results in a map_section's
section_mem_map being treated as part of a bitmap which isn't good. This
was evident with an invalid virtual address when mem_init attempted to free
bootmem pages while relinquishing control from the bootmem allocator.
The second part of the problem occurs because the pageblock flags bitmap is
be located with the mem_section. The SECTIONS_PER_ROOT computation using
sizeof (mem_section) may not be a power of 2 depending on the size of the
bitmap. This renders masks and other such things not power of 2 base.
This issue was seen with SPARSEMEM_EXTREME on ia64. This patch moves the
bitmap outside of mem_section and uses a pointer instead in the
mem_section. The bitmaps are allocated when the section is being
initialised.
Note that sparse_early_usemap_alloc() does not use alloc_remap() like
sparse_early_mem_map_alloc(). The allocation required for the bitmap on
x86, the only architecture that uses alloc_remap is typically smaller than
a cache line. alloc_remap() pads out allocations to the cache size which
would be a needless waste.
Credit to Bob Picco for identifying the original problem and effecting a
fix for the SECTION_BLOCKFLAGS_BITS calculation. Credit to Andy Whitcroft
for devising the best way of allocating the bitmaps only when required for
the section.
From: Bob Picco <[email protected]>
[[email protected]: warning fix]
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Andy Whitcroft <[email protected]>
Cc: "Luck, Tony" <[email protected]>
Signed-off-by: William Irwin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
include/linux/mmzone.h | 4 ++-
mm/sparse.c | 54 +++++++++++++++++++++++++++++++++++++++++---
2 files changed, 54 insertions(+), 4 deletions(-)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-002-add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages/include/linux/mmzone.h linux-2.6.23-rc5-003-fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2/include/linux/mmzone.h
--- linux-2.6.23-rc5-002-add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages/include/linux/mmzone.h 2007-09-02 16:19:05.000000000 +0100
+++ linux-2.6.23-rc5-003-fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2/include/linux/mmzone.h 2007-09-02 16:19:16.000000000 +0100
@@ -739,7 +739,9 @@ struct mem_section {
* before using it wrong.
*/
unsigned long section_mem_map;
- DECLARE_BITMAP(pageblock_flags, SECTION_BLOCKFLAGS_BITS);
+
+ /* See declaration of similar field in struct zone */
+ unsigned long *pageblock_flags;
};
#ifdef CONFIG_SPARSEMEM_EXTREME
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-002-add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages/mm/sparse.c linux-2.6.23-rc5-003-fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2/mm/sparse.c
--- linux-2.6.23-rc5-002-add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages/mm/sparse.c 2007-09-02 16:18:56.000000000 +0100
+++ linux-2.6.23-rc5-003-fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2/mm/sparse.c 2007-09-02 16:19:16.000000000 +0100
@@ -204,14 +204,16 @@ struct page *sparse_decode_mem_map(unsig
}
static int __meminit sparse_init_one_section(struct mem_section *ms,
- unsigned long pnum, struct page *mem_map)
+ unsigned long pnum, struct page *mem_map,
+ unsigned long *pageblock_bitmap)
{
if (!present_section(ms))
return -EINVAL;
ms->section_mem_map &= ~SECTION_MAP_MASK;
ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
SECTION_HAS_MEM_MAP;
+ ms->pageblock_flags = pageblock_bitmap;
return 1;
}
@@ -221,6 +223,38 @@ void *alloc_bootmem_high_node(pg_data_t
return NULL;
}
+static unsigned long usemap_size(void)
+{
+ unsigned long size_bytes;
+ size_bytes = roundup(SECTION_BLOCKFLAGS_BITS, 8) / 8;
+ size_bytes = roundup(size_bytes, sizeof(unsigned long));
+ return size_bytes;
+}
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static unsigned long *__kmalloc_section_usemap(void)
+{
+ return kmalloc(usemap_size(), GFP_KERNEL);
+}
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+static unsigned long *sparse_early_usemap_alloc(unsigned long pnum)
+{
+ unsigned long *usemap;
+ struct mem_section *ms = __nr_to_section(pnum);
+ int nid = sparse_early_nid(ms);
+
+ usemap = alloc_bootmem_node(NODE_DATA(nid), usemap_size());
+ if (usemap)
+ return usemap;
+
+ /* Stupid: suppress gcc warning for SPARSEMEM && !NUMA */
+ nid = 0;
+
+ printk(KERN_WARNING "%s: allocation failed\n", __FUNCTION__);
+ return NULL;
+}
+
struct page __init *sparse_early_mem_map_alloc(unsigned long pnum)
{
struct page *map;
@@ -254,6 +288,7 @@ void __init sparse_init(void)
{
unsigned long pnum;
struct page *map;
+ unsigned long *usemap;
for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
if (!valid_section_nr(pnum))
@@ -262,7 +297,13 @@ void __init sparse_init(void)
map = sparse_early_mem_map_alloc(pnum);
if (!map)
continue;
- sparse_init_one_section(__nr_to_section(pnum), pnum, map);
+
+ usemap = sparse_early_usemap_alloc(pnum);
+ if (!usemap)
+ continue;
+
+ sparse_init_one_section(__nr_to_section(pnum), pnum, map,
+ usemap);
}
}
@@ -318,6 +359,7 @@ int sparse_add_one_section(struct zone *
struct pglist_data *pgdat = zone->zone_pgdat;
struct mem_section *ms;
struct page *memmap;
+ unsigned long *usemap;
unsigned long flags;
int ret;
@@ -327,6 +369,7 @@ int sparse_add_one_section(struct zone *
*/
sparse_index_init(section_nr, pgdat->node_id);
memmap = __kmalloc_section_memmap(nr_pages);
+ usemap = __kmalloc_section_usemap();
pgdat_resize_lock(pgdat, &flags);
@@ -335,9 +378,14 @@ int sparse_add_one_section(struct zone *
ret = -EEXIST;
goto out;
}
+
+ if (!usemap) {
+ ret = -ENOMEM;
+ goto out;
+ }
ms->section_mem_map |= SECTION_MARKED_PRESENT;
- ret = sparse_init_one_section(ms, section_nr, memmap);
+ ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
out:
pgdat_resize_unlock(pgdat, &flags);
Subject: Split the free lists for movable and unmovable allocations
This patch adds the core of the fragmentation reduction strategy. It works by
grouping pages together based on their ability to move. Basically, it works by
breaking the list in zone->free_area list into MIGRATE_TYPES number of lists.
Mobility grouping works at an abitrary order less than or equal to MAX_ORDER.
Generally this is a fixed sized defined at compile time. However, on
platforms like ia64 where the huge page size is runtime configurable it
is desirable to group at a this order. On x86_64 and occasionally on x86,
the hugepage size may not always be MAX_ORDER_NR_PAGES.
This patch groups pages together based on the value of HUGETLB_PAGE_ORDER. It
uses a compile-time constant if possible and a variable where the huge page
size is runtime configurable.
It is assumed that grouping should be done at the lowest sensible order and
that the user would not want to override this. If this is not true,
page_block order could be forced to a variable initialised via a boot-time
kernel parameter.
Note that many allocations are already flagged as __GFP_MOVABLE which is
re-used by this patch to determine how pages should be grouped.
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Andy Whitcroft <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
include/linux/mmzone.h | 10 ++
include/linux/pageblock-flags.h | 1
mm/page_alloc.c | 143 +++++++++++++++++++++++++++++------
3 files changed, 129 insertions(+), 25 deletions(-)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-003-fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2/include/linux/mmzone.h linux-2.6.23-rc5-004-split-the-free-lists-for-movable-and-unmovable-allocations/include/linux/mmzone.h
--- linux-2.6.23-rc5-003-fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2/include/linux/mmzone.h 2007-09-02 16:19:16.000000000 +0100
+++ linux-2.6.23-rc5-004-split-the-free-lists-for-movable-and-unmovable-allocations/include/linux/mmzone.h 2007-09-02 16:19:34.000000000 +0100
@@ -33,8 +33,16 @@
*/
#define PAGE_ALLOC_COSTLY_ORDER 3
+#define MIGRATE_UNMOVABLE 0
+#define MIGRATE_MOVABLE 1
+#define MIGRATE_TYPES 2
+
+#define for_each_migratetype_order(order, type) \
+ for (order = 0; order < MAX_ORDER; order++) \
+ for (type = 0; type < MIGRATE_TYPES; type++)
+
struct free_area {
- struct list_head free_list;
+ struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
};
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-003-fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2/include/linux/pageblock-flags.h linux-2.6.23-rc5-004-split-the-free-lists-for-movable-and-unmovable-allocations/include/linux/pageblock-flags.h
--- linux-2.6.23-rc5-003-fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2/include/linux/pageblock-flags.h 2007-09-02 16:19:05.000000000 +0100
+++ linux-2.6.23-rc5-004-split-the-free-lists-for-movable-and-unmovable-allocations/include/linux/pageblock-flags.h 2007-09-02 16:19:34.000000000 +0100
@@ -31,6 +31,7 @@
/* Bit indices that affect a whole block of pages */
enum pageblock_bits {
+ PB_range(PB_migrate, 1), /* 1 bit required for migrate types */
NR_PAGEBLOCK_BITS
};
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-003-fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2/mm/page_alloc.c linux-2.6.23-rc5-004-split-the-free-lists-for-movable-and-unmovable-allocations/mm/page_alloc.c
--- linux-2.6.23-rc5-003-fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2/mm/page_alloc.c 2007-09-02 16:19:09.000000000 +0100
+++ linux-2.6.23-rc5-004-split-the-free-lists-for-movable-and-unmovable-allocations/mm/page_alloc.c 2007-09-02 16:19:34.000000000 +0100
@@ -154,6 +154,22 @@ int nr_node_ids __read_mostly = MAX_NUMN
EXPORT_SYMBOL(nr_node_ids);
#endif
+static inline int get_pageblock_migratetype(struct page *page)
+{
+ return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
+}
+
+static void set_pageblock_migratetype(struct page *page, int migratetype)
+{
+ set_pageblock_flags_group(page, (unsigned long)migratetype,
+ PB_migrate, PB_migrate_end);
+}
+
+static inline int allocflags_to_migratetype(gfp_t gfp_flags)
+{
+ return ((gfp_flags & __GFP_MOVABLE) != 0);
+}
+
#ifdef CONFIG_DEBUG_VM
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
@@ -408,6 +424,7 @@ static inline void __free_one_page(struc
{
unsigned long page_idx;
int order_size = 1 << order;
+ int migratetype = get_pageblock_migratetype(page);
if (unlikely(PageCompound(page)))
destroy_compound_page(page, order);
@@ -420,7 +437,6 @@ static inline void __free_one_page(struc
__mod_zone_page_state(zone, NR_FREE_PAGES, order_size);
while (order < MAX_ORDER-1) {
unsigned long combined_idx;
- struct free_area *area;
struct page *buddy;
buddy = __page_find_buddy(page, page_idx, order);
@@ -428,8 +444,7 @@ static inline void __free_one_page(struc
break; /* Move the buddy up one level. */
list_del(&buddy->lru);
- area = zone->free_area + order;
- area->nr_free--;
+ zone->free_area[order].nr_free--;
rmv_page_order(buddy);
combined_idx = __find_combined_index(page_idx, order);
page = page + (combined_idx - page_idx);
@@ -437,7 +452,8 @@ static inline void __free_one_page(struc
order++;
}
set_page_order(page, order);
- list_add(&page->lru, &zone->free_area[order].free_list);
+ list_add(&page->lru,
+ &zone->free_area[order].free_list[migratetype]);
zone->free_area[order].nr_free++;
}
@@ -571,7 +587,8 @@ void fastcall __init __free_pages_bootme
* -- wli
*/
static inline void expand(struct zone *zone, struct page *page,
- int low, int high, struct free_area *area)
+ int low, int high, struct free_area *area,
+ int migratetype)
{
unsigned long size = 1 << high;
@@ -580,7 +597,7 @@ static inline void expand(struct zone *z
high--;
size >>= 1;
VM_BUG_ON(bad_range(zone, &page[size]));
- list_add(&page[size].lru, &area->free_list);
+ list_add(&page[size].lru, &area->free_list[migratetype]);
area->nr_free++;
set_page_order(&page[size], high);
}
@@ -632,31 +649,96 @@ static int prep_new_page(struct page *pa
return 0;
}
+/*
+ * This array describes the order lists are fallen back to when
+ * the free lists for the desirable migrate type are depleted
+ */
+static int fallbacks[MIGRATE_TYPES][MIGRATE_TYPES-1] = {
+ [MIGRATE_UNMOVABLE] = { MIGRATE_MOVABLE },
+ [MIGRATE_MOVABLE] = { MIGRATE_UNMOVABLE },
+};
+
+/* Remove an element from the buddy allocator from the fallback list */
+static struct page *__rmqueue_fallback(struct zone *zone, int order,
+ int start_migratetype)
+{
+ struct free_area *area;
+ int current_order;
+ struct page *page;
+ int migratetype, i;
+
+ /* Find the largest possible block of pages in the other list */
+ for (current_order = MAX_ORDER-1; current_order >= order;
+ --current_order) {
+ for (i = 0; i < MIGRATE_TYPES - 1; i++) {
+ migratetype = fallbacks[start_migratetype][i];
+
+ area = &(zone->free_area[current_order]);
+ if (list_empty(&area->free_list[migratetype]))
+ continue;
+
+ page = list_entry(area->free_list[migratetype].next,
+ struct page, lru);
+ area->nr_free--;
+
+ /*
+ * If breaking a large block of pages, place the buddies
+ * on the preferred allocation list
+ */
+ if (unlikely(current_order >= (pageblock_order >> 1)))
+ migratetype = start_migratetype;
+
+ /* Remove the page from the freelists */
+ list_del(&page->lru);
+ rmv_page_order(page);
+ __mod_zone_page_state(zone, NR_FREE_PAGES,
+ -(1UL << order));
+
+ if (current_order == pageblock_order)
+ set_pageblock_migratetype(page,
+ start_migratetype);
+
+ expand(zone, page, order, current_order, area,
+ migratetype);
+ return page;
+ }
+ }
+
+ return NULL;
+}
+
/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static struct page *__rmqueue(struct zone *zone, unsigned int order,
+ int migratetype)
{
struct free_area * area;
unsigned int current_order;
struct page *page;
+ /* Find a page of the appropriate size in the preferred list */
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
- area = zone->free_area + current_order;
- if (list_empty(&area->free_list))
+ area = &(zone->free_area[current_order]);
+ if (list_empty(&area->free_list[migratetype]))
continue;
- page = list_entry(area->free_list.next, struct page, lru);
+ page = list_entry(area->free_list[migratetype].next,
+ struct page, lru);
list_del(&page->lru);
rmv_page_order(page);
area->nr_free--;
__mod_zone_page_state(zone, NR_FREE_PAGES, - (1UL << order));
- expand(zone, page, order, current_order, area);
- return page;
+ expand(zone, page, order, current_order, area, migratetype);
+ goto got_page;
}
- return NULL;
+ page = __rmqueue_fallback(zone, order, migratetype);
+
+got_page:
+
+ return page;
}
/*
@@ -665,13 +747,14 @@ static struct page *__rmqueue(struct zon
* Returns the number of new pages which were placed at *list.
*/
static int rmqueue_bulk(struct zone *zone, unsigned int order,
- unsigned long count, struct list_head *list)
+ unsigned long count, struct list_head *list,
+ int migratetype)
{
int i;
spin_lock(&zone->lock);
for (i = 0; i < count; ++i) {
- struct page *page = __rmqueue(zone, order);
+ struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
list_add_tail(&page->lru, list);
@@ -736,7 +819,7 @@ void mark_free_pages(struct zone *zone)
{
unsigned long pfn, max_zone_pfn;
unsigned long flags;
- int order;
+ int order, t;
struct list_head *curr;
if (!zone->spanned_pages)
@@ -753,15 +836,15 @@ void mark_free_pages(struct zone *zone)
swsusp_unset_page_free(page);
}
- for (order = MAX_ORDER - 1; order >= 0; --order)
- list_for_each(curr, &zone->free_area[order].free_list) {
+ for_each_migratetype_order(order, t) {
+ list_for_each(curr, &zone->free_area[order].free_list[t]) {
unsigned long i;
pfn = page_to_pfn(list_entry(curr, struct page, lru));
for (i = 0; i < (1UL << order); i++)
swsusp_set_page_free(pfn_to_page(pfn + i));
}
-
+ }
spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -850,6 +933,7 @@ static struct page *buffered_rmqueue(str
struct page *page;
int cold = !!(gfp_flags & __GFP_COLD);
int cpu;
+ int migratetype = allocflags_to_migratetype(gfp_flags);
again:
cpu = get_cpu();
@@ -860,7 +944,7 @@ again:
local_irq_save(flags);
if (!pcp->count) {
pcp->count = rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list);
+ pcp->batch, &pcp->list, migratetype);
if (unlikely(!pcp->count))
goto failed;
}
@@ -869,7 +953,7 @@ again:
pcp->count--;
} else {
spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, migratetype);
spin_unlock(&zone->lock);
if (!page)
goto failed;
@@ -2208,6 +2292,17 @@ void __meminit memmap_init_zone(unsigned
init_page_count(page);
reset_page_mapcount(page);
SetPageReserved(page);
+
+ /*
+ * Mark the block movable so that blocks are reserved for
+ * movable at startup. This will force kernel allocations
+ * to reserve their blocks rather than leaking throughout
+ * the address space during boot when many long-lived
+ * kernel allocations are made
+ */
+ if ((pfn & (pageblock_nr_pages-1)))
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+
INIT_LIST_HEAD(&page->lru);
#ifdef WANT_PAGE_VIRTUAL
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
@@ -2220,9 +2315,9 @@ void __meminit memmap_init_zone(unsigned
static void __meminit zone_init_free_lists(struct pglist_data *pgdat,
struct zone *zone, unsigned long size)
{
- int order;
- for (order = 0; order < MAX_ORDER ; order++) {
- INIT_LIST_HEAD(&zone->free_area[order].free_list);
+ int order, t;
+ for_each_migratetype_order(order, t) {
+ INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
zone->free_area[order].nr_free = 0;
}
}
Subject: Choose pages from the per cpu list-based on migration type
The freelists for each migrate type can slowly become polluted due to the
per-cpu list. Consider what happens when the following happens
1. A 2^pageblock_order list is reserved for __GFP_MOVABLE pages
2. An order-0 page is allocated from the newly reserved block
3. The page is freed and placed on the per-cpu list
4. alloc_page() is called with GFP_KERNEL as the gfp_mask
5. The per-cpu list is used to satisfy the allocation
This results in a kernel page is in the middle of a migratable region. This
patch prevents this leak occuring by storing the MIGRATE_ type of the page in
page->private. On allocate, a page will only be returned of the desired type,
else more pages will be allocated. This may temporarily allow a per-cpu list
to go over the pcp->high limit but it'll be corrected on the next free. Care
is taken to preserve the hotness of pages recently freed.
The additional code is not measurably slower for the workloads we've tested.
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/page_alloc.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-004-split-the-free-lists-for-movable-and-unmovable-allocations/mm/page_alloc.c linux-2.6.23-rc5-005-choose-pages-from-the-per-cpu-list-based-on-migration-type/mm/page_alloc.c
--- linux-2.6.23-rc5-004-split-the-free-lists-for-movable-and-unmovable-allocations/mm/page_alloc.c 2007-09-02 16:19:34.000000000 +0100
+++ linux-2.6.23-rc5-005-choose-pages-from-the-per-cpu-list-based-on-migration-type/mm/page_alloc.c 2007-09-02 16:20:09.000000000 +0100
@@ -757,7 +757,8 @@ static int rmqueue_bulk(struct zone *zon
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
- list_add_tail(&page->lru, list);
+ list_add(&page->lru, list);
+ set_page_private(page, migratetype);
}
spin_unlock(&zone->lock);
return i;
@@ -884,6 +885,7 @@ static void fastcall free_hot_cold_page(
local_irq_save(flags);
__count_vm_event(PGFREE);
list_add(&page->lru, &pcp->list);
+ set_page_private(page, get_pageblock_migratetype(page));
pcp->count++;
if (pcp->count >= pcp->high) {
free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
@@ -948,7 +950,19 @@ again:
if (unlikely(!pcp->count))
goto failed;
}
- page = list_entry(pcp->list.next, struct page, lru);
+
+ /* Find a page of the appropriate migrate type */
+ list_for_each_entry(page, &pcp->list, lru)
+ if (page_private(page) == migratetype)
+ break;
+
+ /* Allocate more to the pcp list if necessary */
+ if (unlikely(&page->lru == &pcp->list)) {
+ pcp->count += rmqueue_bulk(zone, 0,
+ pcp->batch, &pcp->list, migratetype);
+ page = list_entry(pcp->list.next, struct page, lru);
+ }
+
list_del(&page->lru);
pcp->count--;
} else {
Subject: Group short-lived and reclaimable kernel allocations
This patch marks a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations. When something
like updatedb is called, long-lived and unmovable kernel allocations tend to
be spread throughout the address space which increases fragmentation.
This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be
reclaimed on demand, but not moved. i.e. they can be migrated by deleting
them and re-reading the information from elsewhere.
Signed-off-by: Mel Gorman <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/buffer.c | 3 ++-
fs/jbd/journal.c | 4 ++--
fs/jbd/revoke.c | 6 ++++--
fs/proc/base.c | 13 +++++++------
fs/proc/generic.c | 2 +-
include/linux/gfp.h | 15 ++++++++++++---
include/linux/mmzone.h | 5 +++--
include/linux/pageblock-flags.h | 2 +-
include/linux/slab.h | 4 +++-
kernel/cpuset.c | 2 +-
lib/radix-tree.c | 6 ++++--
mm/page_alloc.c | 10 +++++++---
mm/shmem.c | 4 ++--
mm/slab.c | 2 ++
mm/slub.c | 3 +++
15 files changed, 54 insertions(+), 27 deletions(-)
Index: linux-2.6.23-rc4-mm1-redropped/fs/buffer.c
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/fs/buffer.c 2007-09-09 18:23:34.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/fs/buffer.c 2007-09-09 18:26:16.000000000 +0100
@@ -3100,7 +3100,8 @@
struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
{
- struct buffer_head *ret = kmem_cache_zalloc(bh_cachep, gfp_flags);
+ struct buffer_head *ret = kmem_cache_zalloc(bh_cachep,
+ set_migrateflags(gfp_flags, __GFP_RECLAIMABLE));
if (ret) {
INIT_LIST_HEAD(&ret->b_assoc_buffers);
get_cpu_var(bh_accounting).nr++;
Index: linux-2.6.23-rc4-mm1-redropped/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/fs/jbd/journal.c 2007-08-28 02:32:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/fs/jbd/journal.c 2007-09-09 18:26:17.000000000 +0100
@@ -1710,7 +1710,7 @@
journal_head_cache = kmem_cache_create("journal_head",
sizeof(struct journal_head),
0, /* offset */
- 0, /* flags */
+ SLAB_TEMPORARY, /* flags */
NULL); /* ctor */
retval = 0;
if (journal_head_cache == 0) {
@@ -2006,7 +2006,7 @@
jbd_handle_cache = kmem_cache_create("journal_handle",
sizeof(handle_t),
0, /* offset */
- 0, /* flags */
+ SLAB_TEMPORARY, /* flags */
NULL); /* ctor */
if (jbd_handle_cache == NULL) {
printk(KERN_EMERG "JBD: failed to create handle cache\n");
Index: linux-2.6.23-rc4-mm1-redropped/fs/jbd/revoke.c
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/fs/jbd/revoke.c 2007-08-28 02:32:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/fs/jbd/revoke.c 2007-09-09 18:23:35.000000000 +0100
@@ -170,13 +170,15 @@
{
revoke_record_cache = kmem_cache_create("revoke_record",
sizeof(struct jbd_revoke_record_s),
- 0, SLAB_HWCACHE_ALIGN, NULL);
+ 0,
+ SLAB_HWCACHE_ALIGN|SLAB_TEMPORARY,
+ NULL);
if (revoke_record_cache == 0)
return -ENOMEM;
revoke_table_cache = kmem_cache_create("revoke_table",
sizeof(struct jbd_revoke_table_s),
- 0, 0, NULL);
+ 0, SLAB_TEMPORARY, NULL);
if (revoke_table_cache == 0) {
kmem_cache_destroy(revoke_record_cache);
revoke_record_cache = NULL;
Index: linux-2.6.23-rc4-mm1-redropped/fs/proc/base.c
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/fs/proc/base.c 2007-08-28 02:32:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/fs/proc/base.c 2007-09-09 18:27:51.000000000 +0100
@@ -492,7 +492,7 @@
count = PROC_BLOCK_SIZE;
length = -ENOMEM;
- if (!(page = __get_free_page(GFP_KERNEL)))
+ if (!(page = __get_free_page(GFP_TEMPORARY)))
goto out;
length = PROC_I(inode)->op.proc_read(task, (char*)page);
@@ -532,7 +532,7 @@
goto out;
ret = -ENOMEM;
- page = (char *)__get_free_page(GFP_USER);
+ page = (char *)__get_free_page(GFP_TEMPORARY);
if (!page)
goto out;
@@ -602,7 +602,7 @@
goto out;
copied = -ENOMEM;
- page = (char *)__get_free_page(GFP_USER);
+ page = (char *)__get_free_page(GFP_TEMPORARY);
if (!page)
goto out;
@@ -788,7 +788,7 @@
/* No partial writes. */
return -EINVAL;
}
- page = (char*)__get_free_page(GFP_USER);
+ page = (char*)__get_free_page(GFP_TEMPORARY);
if (!page)
return -ENOMEM;
length = -EFAULT;
@@ -954,7 +954,8 @@
char __user *buffer, int buflen)
{
struct inode * inode;
- char *tmp = (char*)__get_free_page(GFP_KERNEL), *path;
+ char *tmp = (char*)__get_free_page(GFP_TEMPORARY);
+ char *path;
int len;
if (!tmp)
@@ -1726,7 +1727,7 @@
goto out;
length = -ENOMEM;
- page = (char*)__get_free_page(GFP_USER);
+ page = (char*)__get_free_page(GFP_TEMPORARY);
if (!page)
goto out;
Index: linux-2.6.23-rc4-mm1-redropped/fs/proc/generic.c
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/fs/proc/generic.c 2007-08-28 02:32:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/fs/proc/generic.c 2007-09-09 18:23:35.000000000 +0100
@@ -74,7 +74,7 @@
nbytes = MAX_NON_LFS - pos;
dp = PDE(inode);
- if (!(page = (char*) __get_free_page(GFP_KERNEL)))
+ if (!(page = (char*) __get_free_page(GFP_TEMPORARY)))
return -ENOMEM;
while ((nbytes > 0) && !eof) {
Index: linux-2.6.23-rc4-mm1-redropped/include/linux/gfp.h
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/include/linux/gfp.h 2007-09-09 18:23:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/include/linux/gfp.h 2007-09-09 18:28:08.000000000 +0100
@@ -48,9 +48,10 @@
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
-#define __GFP_MOVABLE ((__force gfp_t)0x80000u) /* Page is movable */
+#define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
+#define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
-#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
@@ -60,6 +61,8 @@
#define GFP_NOIO (__GFP_WAIT)
#define GFP_NOFS (__GFP_WAIT | __GFP_IO)
#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
+#define GFP_TEMPORARY (__GFP_WAIT | __GFP_IO | __GFP_FS | \
+ __GFP_RECLAIMABLE)
#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
__GFP_HIGHMEM)
@@ -80,7 +83,7 @@
#endif
/* This mask makes up all the page movable related flags */
-#define GFP_MOVABLE_MASK (__GFP_MOVABLE)
+#define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
/* Control page allocator reclaim behavior */
#define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
@@ -129,6 +132,12 @@
return base + ZONE_NORMAL;
}
+static inline gfp_t set_migrateflags(gfp_t gfp, gfp_t migrate_flags)
+{
+ BUG_ON((gfp & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
+ return (gfp & ~(GFP_MOVABLE_MASK)) | migrate_flags;
+}
+
/*
* There is only one page-allocator function, and two main namespaces to
* it. The alloc_page*() variants return 'struct page *' and as such
Index: linux-2.6.23-rc4-mm1-redropped/include/linux/mmzone.h
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/include/linux/mmzone.h 2007-09-09 18:23:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/include/linux/mmzone.h 2007-09-09 18:27:53.000000000 +0100
@@ -34,8 +34,9 @@
#define PAGE_ALLOC_COSTLY_ORDER 3
#define MIGRATE_UNMOVABLE 0
-#define MIGRATE_MOVABLE 1
-#define MIGRATE_TYPES 2
+#define MIGRATE_RECLAIMABLE 1
+#define MIGRATE_MOVABLE 2
+#define MIGRATE_TYPES 3
#define for_each_migratetype_order(order, type) \
for (order = 0; order < MAX_ORDER; order++) \
Index: linux-2.6.23-rc4-mm1-redropped/include/linux/pageblock-flags.h
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/include/linux/pageblock-flags.h 2007-09-09 18:23:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/include/linux/pageblock-flags.h 2007-09-09 18:27:49.000000000 +0100
@@ -31,7 +31,7 @@
/* Bit indices that affect a whole block of pages */
enum pageblock_bits {
- PB_range(PB_migrate, 1), /* 1 bit required for migrate types */
+ PB_range(PB_migrate, 2), /* 2 bits required for migrate types */
NR_PAGEBLOCK_BITS
};
Index: linux-2.6.23-rc4-mm1-redropped/include/linux/slab.h
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/include/linux/slab.h 2007-08-28 02:32:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/include/linux/slab.h 2007-09-09 18:23:35.000000000 +0100
@@ -24,12 +24,14 @@
#define SLAB_HWCACHE_ALIGN 0x00002000UL /* Align objs on cache lines */
#define SLAB_CACHE_DMA 0x00004000UL /* Use GFP_DMA memory */
#define SLAB_STORE_USER 0x00010000UL /* DEBUG: Store the last owner for bug hunting */
-#define SLAB_RECLAIM_ACCOUNT 0x00020000UL /* Objects are reclaimable */
#define SLAB_PANIC 0x00040000UL /* Panic if kmem_cache_create() fails */
#define SLAB_DESTROY_BY_RCU 0x00080000UL /* Defer freeing slabs to RCU */
#define SLAB_MEM_SPREAD 0x00100000UL /* Spread some memory over cpuset */
#define SLAB_TRACE 0x00200000UL /* Trace allocations and frees */
+/* The following flags affect the page allocator grouping pages by mobility */
+#define SLAB_RECLAIM_ACCOUNT 0x00020000UL /* Objects are reclaimable */
+#define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */
/*
* ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
*
Index: linux-2.6.23-rc4-mm1-redropped/kernel/cpuset.c
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/kernel/cpuset.c 2007-09-09 18:23:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/kernel/cpuset.c 2007-09-09 18:27:30.000000000 +0100
@@ -1463,7 +1463,7 @@
ssize_t retval = 0;
char *s;
- if (!(page = (char *)__get_free_page(GFP_KERNEL)))
+ if (!(page = (char *)__get_free_page(GFP_TEMPORARY)))
return -ENOMEM;
s = page;
Index: linux-2.6.23-rc4-mm1-redropped/lib/radix-tree.c
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/lib/radix-tree.c 2007-09-09 18:23:33.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/lib/radix-tree.c 2007-09-09 18:23:35.000000000 +0100
@@ -98,7 +98,8 @@
struct radix_tree_node *ret;
gfp_t gfp_mask = root_gfp_mask(root);
- ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+ ret = kmem_cache_alloc(radix_tree_node_cachep,
+ set_migrateflags(gfp_mask, __GFP_RECLAIMABLE));
if (ret == NULL && !(gfp_mask & __GFP_WAIT)) {
struct radix_tree_preload *rtp;
@@ -142,7 +143,8 @@
rtp = &__get_cpu_var(radix_tree_preloads);
while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
preempt_enable();
- node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+ node = kmem_cache_alloc(radix_tree_node_cachep,
+ set_migrateflags(gfp_mask, __GFP_RECLAIMABLE));
if (node == NULL)
goto out;
preempt_disable();
Index: linux-2.6.23-rc4-mm1-redropped/mm/page_alloc.c
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/mm/page_alloc.c 2007-09-09 18:23:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/mm/page_alloc.c 2007-09-09 18:27:53.000000000 +0100
@@ -175,7 +175,10 @@
static inline int allocflags_to_migratetype(gfp_t gfp_flags)
{
- return ((gfp_flags & __GFP_MOVABLE) != 0);
+ WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
+
+ return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
+ ((gfp_flags & __GFP_RECLAIMABLE) != 0);
}
#ifdef CONFIG_DEBUG_VM
@@ -662,8 +665,9 @@
* the free lists for the desirable migrate type are depleted
*/
static int fallbacks[MIGRATE_TYPES][MIGRATE_TYPES-1] = {
- [MIGRATE_UNMOVABLE] = { MIGRATE_MOVABLE },
- [MIGRATE_MOVABLE] = { MIGRATE_UNMOVABLE },
+ [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE },
+ [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE },
+ [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE },
};
/* Remove an element from the buddy allocator from the fallback list */
Index: linux-2.6.23-rc4-mm1-redropped/mm/shmem.c
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/mm/shmem.c 2007-09-09 18:23:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/mm/shmem.c 2007-09-09 18:27:48.000000000 +0100
@@ -95,9 +95,9 @@
* BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
* might be reconsidered if it ever diverges from PAGE_SIZE.
*
- * __GFP_MOVABLE is masked out as swap vectors cannot move
+ * Mobility flags are masked out as swap vectors cannot move
*/
- return alloc_pages((gfp_mask & ~__GFP_MOVABLE) | __GFP_ZERO,
+ return alloc_pages((gfp_mask & ~GFP_MOVABLE_MASK) | __GFP_ZERO,
PAGE_CACHE_SHIFT-PAGE_SHIFT);
}
Index: linux-2.6.23-rc4-mm1-redropped/mm/slab.c
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/mm/slab.c 2007-09-09 18:23:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/mm/slab.c 2007-09-09 18:26:54.000000000 +0100
@@ -1643,6 +1643,8 @@
#endif
flags |= cachep->gfpflags;
+ if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
+ flags |= __GFP_RECLAIMABLE;
page = alloc_pages_node(nodeid, flags, cachep->gfporder);
if (!page)
Index: linux-2.6.23-rc4-mm1-redropped/mm/slub.c
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/mm/slub.c 2007-09-09 18:23:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/mm/slub.c 2007-09-09 18:27:50.000000000 +0100
@@ -1046,6 +1046,9 @@
if (s->flags & SLAB_CACHE_DMA)
flags |= SLUB_DMA;
+ if (s->flags & SLAB_RECLAIM_ACCOUNT)
+ flags |= __GFP_RECLAIMABLE;
+
if (node == -1)
page = alloc_pages(flags, s->order);
else
Subject: Drain per-cpu lists when high-order allocations fail
Per-cpu pages can accidentally cause fragmentation because they are free, but
pinned pages in an otherwise contiguous block. When this patch is applied,
the per-cpu caches are drained after the direct-reclaim is entered if the
requested order is greater than 0. It simply reuses the code used by suspend
and hotplug.
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/page_alloc.c | 24 +++++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-006-group-short-lived-and-reclaimable-kernel-allocations/mm/page_alloc.c linux-2.6.23-rc5-007-drain-per-cpu-lists-when-high-order-allocations-fail/mm/page_alloc.c
--- linux-2.6.23-rc5-006-group-short-lived-and-reclaimable-kernel-allocations/mm/page_alloc.c 2007-09-02 16:20:31.000000000 +0100
+++ linux-2.6.23-rc5-007-drain-per-cpu-lists-when-high-order-allocations-fail/mm/page_alloc.c 2007-09-02 16:20:48.000000000 +0100
@@ -852,6 +852,7 @@ void mark_free_pages(struct zone *zone)
}
spin_unlock_irqrestore(&zone->lock, flags);
}
+#endif /* CONFIG_PM */
/*
* Spill all of this CPU's per-cpu pages back into the buddy allocator.
@@ -864,7 +865,25 @@ void drain_local_pages(void)
__drain_pages(smp_processor_id());
local_irq_restore(flags);
}
-#endif /* CONFIG_HIBERNATION */
+
+void smp_drain_local_pages(void *arg)
+{
+ drain_local_pages();
+}
+
+/*
+ * Spill all the per-cpu pages from all CPUs back into the buddy allocator
+ */
+void drain_all_local_pages(void)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __drain_pages(smp_processor_id());
+ local_irq_restore(flags);
+
+ smp_call_function(smp_drain_local_pages, NULL, 0, 1);
+}
/*
* Free a 0-order page
@@ -1452,6 +1471,9 @@ nofail_alloc:
cond_resched();
+ if (order != 0)
+ drain_all_local_pages();
+
if (likely(did_some_progress)) {
page = get_page_from_freelist(gfp_mask, order,
zonelist, alloc_flags);
Subject: Move free pages between lists on steal
When a fallback is forced to steal a page from a block of a different
type and more than half of the block is free reassign that block to the
new type and move the free pages over to the new type's free lists.
Signed-off-by: Mel Gorman <[email protected]>
[[email protected]: fix BUG_ON check at move_freepages()]
[[email protected]: Move to using pfn_valid_within()]
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Yasunori Goto <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Signed-off-by: Andy Whitcroft <[email protected]>
Cc: Bob Picco <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/page_alloc.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 70 insertions(+), 2 deletions(-)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-007-drain-per-cpu-lists-when-high-order-allocations-fail/mm/page_alloc.c linux-2.6.23-rc5-008-move-free-pages-between-lists-on-steal/mm/page_alloc.c
--- linux-2.6.23-rc5-007-drain-per-cpu-lists-when-high-order-allocations-fail/mm/page_alloc.c 2007-09-02 16:20:48.000000000 +0100
+++ linux-2.6.23-rc5-008-move-free-pages-between-lists-on-steal/mm/page_alloc.c 2007-09-02 16:21:09.000000000 +0100
@@ -662,6 +662,72 @@ static int fallbacks[MIGRATE_TYPES][MIGR
[MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE },
};
+/*
+ * Move the free pages in a range to the free lists of the requested type.
+ * Note that start_page and end_pages are not aligned on a pageblock
+ * boundary. If alignment is required, use move_freepages_block()
+ */
+int move_freepages(struct zone *zone,
+ struct page *start_page, struct page *end_page,
+ int migratetype)
+{
+ struct page *page;
+ unsigned long order;
+ int blocks_moved = 0;
+
+#ifndef CONFIG_HOLES_IN_ZONE
+ /*
+ * page_zone is not safe to call in this context when
+ * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant
+ * anyway as we check zone boundaries in move_freepages_block().
+ * Remove at a later date when no bug reports exist related to
+ * grouping pages by mobility
+ */
+ BUG_ON(page_zone(start_page) != page_zone(end_page));
+#endif
+
+ for (page = start_page; page <= end_page;) {
+ if (!pfn_valid_within(page_to_pfn(page))) {
+ page++;
+ continue;
+ }
+
+ if (!PageBuddy(page)) {
+ page++;
+ continue;
+ }
+
+ order = page_order(page);
+ list_del(&page->lru);
+ list_add(&page->lru,
+ &zone->free_area[order].free_list[migratetype]);
+ page += 1 << order;
+ blocks_moved++;
+ }
+
+ return blocks_moved;
+}
+
+int move_freepages_block(struct zone *zone, struct page *page, int migratetype)
+{
+ unsigned long start_pfn, end_pfn;
+ struct page *start_page, *end_page;
+
+ start_pfn = page_to_pfn(page);
+ start_pfn = start_pfn & ~(pageblock_nr_pages-1);
+ start_page = pfn_to_page(start_pfn);
+ end_page = start_page + pageblock_nr_pages - 1;
+ end_pfn = start_pfn + pageblock_nr_pages - 1;
+
+ /* Do not cross zone boundaries */
+ if (start_pfn < zone->zone_start_pfn)
+ start_page = page;
+ if (end_pfn >= zone->zone_start_pfn + zone->spanned_pages)
+ return 0;
+
+ return move_freepages(zone, start_page, end_page, migratetype);
+}
+
/* Remove an element from the buddy allocator from the fallback list */
static struct page *__rmqueue_fallback(struct zone *zone, int order,
int start_migratetype)
@@ -686,11 +752,13 @@ static struct page *__rmqueue_fallback(s
area->nr_free--;
/*
- * If breaking a large block of pages, place the buddies
- * on the preferred allocation list
+ * If breaking a large block of pages, move all free
+ * pages to the preferred allocation list
*/
if (unlikely(current_order >= (pageblock_order >> 1)))
migratetype = start_migratetype;
+ move_freepages_block(zone, page, migratetype);
+ }
/* Remove the page from the freelists */
list_del(&page->lru);
Subject: Do not group pages by mobility type on low memory systems
Where there are fewer than one pageblock in the system per mobility
type mixing is inevitable and any attempt to prevent it will fail
in a costly manner. This patch checks the size of vm_total_pages in
build_all_zonelists(). If there are not enough areas, mobility is effectivly
disabled by considering all allocations as the same type (UNMOVABLE).
This is achived via a __read_mostly flag.
This patch removes any need to disable grouping pages by mobility at
compile time.
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Andy Whitcroft <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/page_alloc.c | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-008-move-free-pages-between-lists-on-steal/mm/page_alloc.c linux-2.6.23-rc5-009-do-not-group-pages-by-mobility-type-on-low-memory-systems/mm/page_alloc.c
--- linux-2.6.23-rc5-008-move-free-pages-between-lists-on-steal/mm/page_alloc.c 2007-09-02 16:21:09.000000000 +0100
+++ linux-2.6.23-rc5-009-do-not-group-pages-by-mobility-type-on-low-memory-systems/mm/page_alloc.c 2007-09-02 16:21:30.000000000 +0100
@@ -154,8 +154,13 @@ int nr_node_ids __read_mostly = MAX_NUMN
EXPORT_SYMBOL(nr_node_ids);
#endif
+int page_group_by_mobility_disabled __read_mostly;
+
static inline int get_pageblock_migratetype(struct page *page)
{
+ if (unlikely(page_group_by_mobility_disabled))
+ return MIGRATE_UNMOVABLE;
+
return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
}
@@ -169,6 +174,10 @@ static inline int allocflags_to_migratet
{
WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
+ if (unlikely(page_group_by_mobility_disabled))
+ return MIGRATE_UNMOVABLE;
+
+ /* Cluster based on mobility */
return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
((gfp_flags & __GFP_RECLAIMABLE) != 0);
}
@@ -2294,9 +2303,23 @@ void build_all_zonelists(void)
/* cpuset refresh routine should be here */
}
vm_total_pages = nr_free_pagecache_pages();
- printk("Built %i zonelists in %s order. Total pages: %ld\n",
+ /*
+ * Disable grouping by mobility if the number of pages in the
+ * system is too low to allow the mechanism to work. It would be
+ * more accurate, but expensive to check per-zone. This check is
+ * made on memory-hotadd so a system can start with mobility
+ * disabled and enable it later
+ */
+ if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES))
+ page_group_by_mobility_disabled = 1;
+ else
+ page_group_by_mobility_disabled = 0;
+
+ printk("Built %i zonelists in %s order, mobility grouping %s. "
+ "Total pages: %ld\n",
num_online_nodes(),
zonelist_order_name[current_zonelist_order],
+ page_group_by_mobility_disabled ? "off" : "on",
vm_total_pages);
#ifdef CONFIG_NUMA
printk("Policy zone: %s\n", zone_names[policy_zone]);
Subject: Bias the location of pages freed for min_free_kbytes in the same pageblock_nr_pages areas
The standard buddy allocator always favours splitting the smallest block of
pages. The effect of this is that the pages free to satisfy min_free_kbytes
tends to be preserved since boot time at the same location of memory for a
very long time, remaining contiguous. When an administrator sets the
reserve at 16384 at boot time, it tends to be the same MAX_ORDER blocks
that remain free. This allows the occasional high atomic allocation
to succeed up until the point the blocks are split. In practice, it is
difficult to split these blocks but when they do split, the benefit of
having min_free_kbytes for contiguous blocks disappears. Additionally,
increasing min_free_kbytes once the system has been running for some time
has no guarantee of creating contiguous blocks.
On the other hand, grouping pages by mobility favours the splitting of large
blocks when there are no free pages of the appropriate type available. A
side-effect of this is that all blocks in memory tends to be used up and
the contiguous free blocks from boot time are not preserved like in the
vanilla allocator. This can cause a problem if a new caller is unwilling
to reclaim or does not reclaim for long enough.
A failure scenario was found for a wireless network device allocating
order-1 atomic allocations but the allocations were not intense or frequent
enough for a whole block of pages to be preserved for MIGRATE_HIGHALLOC.
This was reproduced on a desktop by booting with mem=256mb, forcing the
driver to allocate at order-1, running a bittorrent client (downloading a
debian ISO) and building a kernel with -j2.
This patch addresses the problem on the desktop machine booted with mem=256mb.
It works by setting aside a reserve of pageblock_nr_pages blocks, the
number of which depends on the value of min_free_kbytes. These blocks are
only fallen back to when there is no other free pages. Then the smallest
possible page is used just like the normal buddy allocator instead of the
largest possible page to preserve contiguous pages. The pages in free lists
in the reserve blocks are never taken for another migrate type. The results
is that even if min_free_kbytes is set to a low value, contiguous blocks
will be preserved in the MIGRATE_RESERVE blocks as the pages will become
contiguous again on free.
This works better than the vanilla allocator because if min_free_kbytes is
increased, a new reserve block will be chosen based on the location of
reclaimable pages and the block will free up as contiguous pages. In the
vanilla allocator, no effort is made to target a block of pages to free as
contiguous pages and min_free_kbytes pages are scattered randomly.
This effect has been observed on the test machine. min_free_kbytes was
set initially low but it was kept as a contiguous free block within
MIGRATE_RESERVE. min_free_kbytes was then set to a higher value and
over a period of time, the free contiguous memory was found within the
reserve blocks. How long it takes to free up depends on how quickly the
LRU is rotating. Amusingly, this means that more activity will free the
blocks faster.
Credit to Mariusz Kozlowski for discovering the problem, describing the
failure scenario and testing patches and scenarios.
Signed-off-by: Mel Gorman <[email protected]>
[[email protected]: cleanups]
Acked-by: Andy Whitcroft <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
include/linux/mmzone.h | 3 -
mm/page_alloc.c | 129 +++++++++++++++++++++++++++++++++++---------
2 files changed, 105 insertions(+), 27 deletions(-)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-009-do-not-group-pages-by-mobility-type-on-low-memory-systems/include/linux/mmzone.h linux-2.6.23-rc5-010-bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks/include/linux/mmzone.h
--- linux-2.6.23-rc5-009-do-not-group-pages-by-mobility-type-on-low-memory-systems/include/linux/mmzone.h 2007-09-02 16:21:10.000000000 +0100
+++ linux-2.6.23-rc5-010-bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks/include/linux/mmzone.h 2007-09-02 16:22:04.000000000 +0100
@@ -36,7 +36,8 @@
#define MIGRATE_UNMOVABLE 0
#define MIGRATE_RECLAIMABLE 1
#define MIGRATE_MOVABLE 2
-#define MIGRATE_TYPES 3
+#define MIGRATE_RESERVE 3
+#define MIGRATE_TYPES 4
#define for_each_migratetype_order(order, type) \
for (order = 0; order < MAX_ORDER; order++) \
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-009-do-not-group-pages-by-mobility-type-on-low-memory-systems/mm/page_alloc.c linux-2.6.23-rc5-010-bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks/mm/page_alloc.c
--- linux-2.6.23-rc5-009-do-not-group-pages-by-mobility-type-on-low-memory-systems/mm/page_alloc.c 2007-09-02 16:21:30.000000000 +0100
+++ linux-2.6.23-rc5-010-bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks/mm/page_alloc.c 2007-09-02 16:22:04.000000000 +0100
@@ -662,13 +662,44 @@ static int prep_new_page(struct page *pa
}
/*
+ * Go through the free lists for the given migratetype and remove
+ * the smallest available page from the freelists
+ */
+static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
+ int migratetype)
+{
+ unsigned int current_order;
+ struct free_area *area;
+ struct page *page;
+
+ /* Find a page of the appropriate size in the preferred list */
+ for (current_order = order; current_order < MAX_ORDER; ++current_order) {
+ area = &(zone->free_area[current_order]);
+ if (list_empty(&area->free_list[migratetype]))
+ continue;
+
+ page = list_entry(area->free_list[migratetype].next,
+ struct page, lru);
+ list_del(&page->lru);
+ rmv_page_order(page);
+ area->nr_free--;
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+ expand(zone, page, order, current_order, area, migratetype);
+ return page;
+ }
+
+ return NULL;
+}
+
+/*
* This array describes the order lists are fallen back to when
* the free lists for the desirable migrate type are depleted
*/
static int fallbacks[MIGRATE_TYPES][MIGRATE_TYPES-1] = {
- [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE },
- [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE },
- [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE },
+ [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE },
+ [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE },
+ [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE },
+ [MIGRATE_RESERVE] = { MIGRATE_RESERVE, MIGRATE_RESERVE, MIGRATE_RESERVE }, /* Never used */
};
/*
@@ -752,6 +783,10 @@ static struct page *__rmqueue_fallback(s
for (i = 0; i < MIGRATE_TYPES - 1; i++) {
migratetype = fallbacks[start_migratetype][i];
+ /* MIGRATE_RESERVE handled later if necessary */
+ if (migratetype == MIGRATE_RESERVE)
+ continue;
+
area = &(zone->free_area[current_order]);
if (list_empty(&area->free_list[migratetype]))
continue;
@@ -785,39 +820,23 @@ static struct page *__rmqueue_fallback(s
}
}
- return NULL;
+ /* Use MIGRATE_RESERVE rather than fail an allocation */
+ return __rmqueue_smallest(zone, order, MIGRATE_RESERVE);
}
-/*
+/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
static struct page *__rmqueue(struct zone *zone, unsigned int order,
int migratetype)
{
- struct free_area * area;
- unsigned int current_order;
struct page *page;
- /* Find a page of the appropriate size in the preferred list */
- for (current_order = order; current_order < MAX_ORDER; ++current_order) {
- area = &(zone->free_area[current_order]);
- if (list_empty(&area->free_list[migratetype]))
- continue;
-
- page = list_entry(area->free_list[migratetype].next,
- struct page, lru);
- list_del(&page->lru);
- rmv_page_order(page);
- area->nr_free--;
- __mod_zone_page_state(zone, NR_FREE_PAGES, - (1UL << order));
- expand(zone, page, order, current_order, area, migratetype);
- goto got_page;
- }
-
- page = __rmqueue_fallback(zone, order, migratetype);
+ page = __rmqueue_smallest(zone, order, migratetype);
-got_page:
+ if (unlikely(!page))
+ page = __rmqueue_fallback(zone, order, migratetype);
return page;
}
@@ -2395,6 +2414,61 @@ static inline unsigned long wait_table_b
#define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
/*
+ * Mark a number of pageblocks as MIGRATE_RESERVE. The number
+ * of blocks reserved is based on zone->pages_min. The memory within the
+ * reserve will tend to store contiguous free pages. Setting min_free_kbytes
+ * higher will lead to a bigger reserve which will get freed as contiguous
+ * blocks as reclaim kicks in
+ */
+static void setup_zone_migrate_reserve(struct zone *zone)
+{
+ unsigned long start_pfn, pfn, end_pfn;
+ struct page *page;
+ unsigned long reserve, block_migratetype;
+
+ /* Get the start pfn, end pfn and the number of blocks to reserve */
+ start_pfn = zone->zone_start_pfn;
+ end_pfn = start_pfn + zone->spanned_pages;
+ reserve = roundup(zone->pages_min, pageblock_nr_pages) >>
+ pageblock_order;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+ if (!pfn_valid(pfn))
+ continue;
+ page = pfn_to_page(pfn);
+
+ /* Blocks with reserved pages will never free, skip them. */
+ if (PageReserved(page))
+ continue;
+
+ block_migratetype = get_pageblock_migratetype(page);
+
+ /* If this block is reserved, account for it */
+ if (reserve > 0 && block_migratetype == MIGRATE_RESERVE) {
+ reserve--;
+ continue;
+ }
+
+ /* Suitable for reserving if this block is movable */
+ if (reserve > 0 && block_migratetype == MIGRATE_MOVABLE) {
+ set_pageblock_migratetype(page, MIGRATE_RESERVE);
+ move_freepages_block(zone, page, MIGRATE_RESERVE);
+ reserve--;
+ continue;
+ }
+
+ /*
+ * If the reserve is met and this is a previous reserved block,
+ * take it back
+ */
+ if (block_migratetype == MIGRATE_RESERVE) {
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+ move_freepages_block(zone, page, MIGRATE_MOVABLE);
+ }
+ }
+}
+
+/*
* Initially all pages are reserved - free ones are freed
* up by free_all_bootmem() once the early boot process is
* done. Non-atomic initialization, single-pass.
@@ -2429,7 +2503,9 @@ void __meminit memmap_init_zone(unsigned
* movable at startup. This will force kernel allocations
* to reserve their blocks rather than leaking throughout
* the address space during boot when many long-lived
- * kernel allocations are made
+ * kernel allocations are made. Later some blocks near
+ * the start are marked MIGRATE_RESERVE by
+ * setup_zone_migrate_reserve()
*/
if ((pfn & (pageblock_nr_pages-1)))
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
@@ -3961,6 +4037,7 @@ void setup_per_zone_pages_min(void)
zone->pages_low = zone->pages_min + (tmp >> 2);
zone->pages_high = zone->pages_min + (tmp >> 1);
+ setup_zone_migrate_reserve(zone);
spin_unlock_irqrestore(&zone->lru_lock, flags);
}
Subject: Bias the placement of kernel pages at lower pfns
This patch chooses blocks with lower PFNs when placing kernel allocations.
This is particularly important during fallback in low memory situations to
stop unmovable pages being placed throughout the entire address space.
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/page_alloc.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-010-bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks/mm/page_alloc.c linux-2.6.23-rc5-011-bias-the-placement-of-kernel-pages-at-lower-pfns/mm/page_alloc.c
--- linux-2.6.23-rc5-010-bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks/mm/page_alloc.c 2007-09-02 16:22:04.000000000 +0100
+++ linux-2.6.23-rc5-011-bias-the-placement-of-kernel-pages-at-lower-pfns/mm/page_alloc.c 2007-09-02 16:22:27.000000000 +0100
@@ -768,6 +768,23 @@ int move_freepages_block(struct zone *zo
return move_freepages(zone, start_page, end_page, migratetype);
}
+/* Return the page with the lowest PFN in the list */
+static struct page *min_page(struct list_head *list)
+{
+ unsigned long min_pfn = -1UL;
+ struct page *min_page = NULL, *page;;
+
+ list_for_each_entry(page, list, lru) {
+ unsigned long pfn = page_to_pfn(page);
+ if (pfn < min_pfn) {
+ min_pfn = pfn;
+ min_page = page;
+ }
+ }
+
+ return min_page;
+}
+
/* Remove an element from the buddy allocator from the fallback list */
static struct page *__rmqueue_fallback(struct zone *zone, int order,
int start_migratetype)
@@ -791,8 +808,11 @@ static struct page *__rmqueue_fallback(s
if (list_empty(&area->free_list[migratetype]))
continue;
+ /* Bias kernel allocations towards low pfns */
page = list_entry(area->free_list[migratetype].next,
struct page, lru);
+ if (unlikely(start_migratetype != MIGRATE_MOVABLE))
+ page = min_page(&area->free_list[migratetype]);
area->nr_free--;
/*
Subject: Be more agressive about stealing when MIGRATE_RECLAIMABLE allocations fallback
MIGRATE_RECLAIMABLE allocations tend to be very bursty in nature like when
updatedb starts. It is likely this will occur in situations where MAX_ORDER
blocks of pages are not free. This means that updatedb can scatter
MIGRATE_RECLAIMABLE pages throughout the address space. This patch is more
agressive about stealing blocks of pages for MIGRATE_RECLAIMABLE.
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/page_alloc.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-011-bias-the-placement-of-kernel-pages-at-lower-pfns/mm/page_alloc.c linux-2.6.23-rc5-012-be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback/mm/page_alloc.c
--- linux-2.6.23-rc5-011-bias-the-placement-of-kernel-pages-at-lower-pfns/mm/page_alloc.c 2007-09-02 16:22:27.000000000 +0100
+++ linux-2.6.23-rc5-012-be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback/mm/page_alloc.c 2007-09-02 16:22:47.000000000 +0100
@@ -713,7 +713,7 @@ int move_freepages(struct zone *zone,
{
struct page *page;
unsigned long order;
- int blocks_moved = 0;
+ int pages_moved = 0;
#ifndef CONFIG_HOLES_IN_ZONE
/*
@@ -742,10 +742,10 @@ int move_freepages(struct zone *zone,
list_add(&page->lru,
&zone->free_area[order].free_list[migratetype]);
page += 1 << order;
- blocks_moved++;
+ pages_moved += 1 << order;
}
- return blocks_moved;
+ return pages_moved;
}
int move_freepages_block(struct zone *zone, struct page *page, int migratetype)
@@ -817,11 +817,22 @@ static struct page *__rmqueue_fallback(s
/*
* If breaking a large block of pages, move all free
- * pages to the preferred allocation list
+ * pages to the preferred allocation list. If falling
+ * back for a reclaimable kernel allocation, be more
+ * agressive about taking ownership of free pages
*/
- if (unlikely(current_order >= (pageblock_order >> 1)))
+ if (unlikely(current_order >= (pageblock_order >> 1)) ||
+ start_migratetype == MIGRATE_RECLAIMABLE) {
+ unsigned long pages;
+ pages = move_freepages_block(zone, page,
+ start_migratetype);
+
+ /* Claim the whole block if over half of it is free */
+ if (pages >= (1 << (pageblock_order-1)))
+ set_pageblock_migratetype(page,
+ start_migratetype);
+
migratetype = start_migratetype;
- move_freepages_block(zone, page, migratetype);
}
/* Remove the page from the freelists */
Subject: Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo
This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo.
The information is collected only on request so there is no runtime overhead.
The statistics are in three parts:
The first part prints information on the size of blocks that pages are
being grouped on and looks like
Page block order: 10
Pages per block: 1024
The second part is a more detailed version of /proc/buddyinfo and looks like
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Reclaimable 1 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Reserve 0 4 4 0 0 0 0 1 0 1 0
Node 0, zone Normal, type Unmovable 111 8 4 4 2 3 1 0 0 0 0
Node 0, zone Normal, type Reclaimable 293 89 8 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Movable 1 6 13 9 7 6 3 0 0 0 0
Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 4
The third part looks like
Number of blocks type Unmovable Reclaimable Movable Reserve
Node 0, zone DMA 0 1 2 1
Node 0, zone Normal 3 17 94 4
To walk the zones within a node with interrupts disabled, walk_zones_in_node()
is introduced and shared between /proc/buddyinfo, /proc/zoneinfo and
/proc/pagetypeinfo to reduce code duplication. It seems specific to what
vmstat.c requires but could be broken out as a general utility function in
mmzone.c if there were other other potential users.
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Andy Whitcroft <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/proc/proc_misc.c | 14 ++
include/linux/gfp.h | 12 +
include/linux/mmzone.h | 10 +
mm/page_alloc.c | 20 ---
mm/vmstat.c | 284 +++++++++++++++++++++++++++++++-------------
5 files changed, 240 insertions(+), 100 deletions(-)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-012-be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback/fs/proc/proc_misc.c linux-2.6.23-rc5-013-print-out-statistics-in-relation-to-fragmentation-avoidance-to-proc-pagetypeinfo/fs/proc/proc_misc.c
--- linux-2.6.23-rc5-012-be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback/fs/proc/proc_misc.c 2007-09-02 16:22:31.000000000 +0100
+++ linux-2.6.23-rc5-013-print-out-statistics-in-relation-to-fragmentation-avoidance-to-proc-pagetypeinfo/fs/proc/proc_misc.c 2007-09-02 16:23:11.000000000 +0100
@@ -230,6 +230,19 @@ static const struct file_operations frag
.release = seq_release,
};
+extern struct seq_operations pagetypeinfo_op;
+static int pagetypeinfo_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &pagetypeinfo_op);
+}
+
+static const struct file_operations pagetypeinfo_file_ops = {
+ .open = pagetypeinfo_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
extern struct seq_operations zoneinfo_op;
static int zoneinfo_open(struct inode *inode, struct file *file)
{
@@ -716,6 +729,7 @@ void __init proc_misc_init(void)
#endif
#endif
create_seq_entry("buddyinfo",S_IRUGO, &fragmentation_file_operations);
+ create_seq_entry("pagetypeinfo", S_IRUGO, &pagetypeinfo_file_ops);
create_seq_entry("vmstat",S_IRUGO, &proc_vmstat_file_operations);
create_seq_entry("zoneinfo",S_IRUGO, &proc_zoneinfo_file_operations);
#ifdef CONFIG_BLOCK
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-012-be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback/include/linux/gfp.h linux-2.6.23-rc5-013-print-out-statistics-in-relation-to-fragmentation-avoidance-to-proc-pagetypeinfo/include/linux/gfp.h
--- linux-2.6.23-rc5-012-be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback/include/linux/gfp.h 2007-09-02 16:22:29.000000000 +0100
+++ linux-2.6.23-rc5-013-print-out-statistics-in-relation-to-fragmentation-avoidance-to-proc-pagetypeinfo/include/linux/gfp.h 2007-09-02 16:23:11.000000000 +0100
@@ -100,6 +100,18 @@ struct vm_area_struct;
/* 4GB DMA on some platforms */
#define GFP_DMA32 __GFP_DMA32
+/* Convert GFP flags to their corresponding migrate type */
+static inline int allocflags_to_migratetype(gfp_t gfp_flags)
+{
+ WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
+
+ if (unlikely(page_group_by_mobility_disabled))
+ return MIGRATE_UNMOVABLE;
+
+ /* Group based on mobility */
+ return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
+ ((gfp_flags & __GFP_RECLAIMABLE) != 0);
+}
static inline enum zone_type gfp_zone(gfp_t flags)
{
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-012-be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback/include/linux/mmzone.h linux-2.6.23-rc5-013-print-out-statistics-in-relation-to-fragmentation-avoidance-to-proc-pagetypeinfo/include/linux/mmzone.h
--- linux-2.6.23-rc5-012-be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback/include/linux/mmzone.h 2007-09-02 16:22:29.000000000 +0100
+++ linux-2.6.23-rc5-013-print-out-statistics-in-relation-to-fragmentation-avoidance-to-proc-pagetypeinfo/include/linux/mmzone.h 2007-09-02 16:23:11.000000000 +0100
@@ -43,6 +43,16 @@
for (order = 0; order < MAX_ORDER; order++) \
for (type = 0; type < MIGRATE_TYPES; type++)
+extern int page_group_by_mobility_disabled;
+
+static inline int get_pageblock_migratetype(struct page *page)
+{
+ if (unlikely(page_group_by_mobility_disabled))
+ return MIGRATE_UNMOVABLE;
+
+ return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
+}
+
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-012-be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback/mm/page_alloc.c linux-2.6.23-rc5-013-print-out-statistics-in-relation-to-fragmentation-avoidance-to-proc-pagetypeinfo/mm/page_alloc.c
--- linux-2.6.23-rc5-012-be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback/mm/page_alloc.c 2007-09-02 16:22:47.000000000 +0100
+++ linux-2.6.23-rc5-013-print-out-statistics-in-relation-to-fragmentation-avoidance-to-proc-pagetypeinfo/mm/page_alloc.c 2007-09-02 16:23:11.000000000 +0100
@@ -156,32 +156,12 @@ EXPORT_SYMBOL(nr_node_ids);
int page_group_by_mobility_disabled __read_mostly;
-static inline int get_pageblock_migratetype(struct page *page)
-{
- if (unlikely(page_group_by_mobility_disabled))
- return MIGRATE_UNMOVABLE;
-
- return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
-}
-
static void set_pageblock_migratetype(struct page *page, int migratetype)
{
set_pageblock_flags_group(page, (unsigned long)migratetype,
PB_migrate, PB_migrate_end);
}
-static inline int allocflags_to_migratetype(gfp_t gfp_flags)
-{
- WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
-
- if (unlikely(page_group_by_mobility_disabled))
- return MIGRATE_UNMOVABLE;
-
- /* Cluster based on mobility */
- return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
- ((gfp_flags & __GFP_RECLAIMABLE) != 0);
-}
-
#ifdef CONFIG_DEBUG_VM
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-012-be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback/mm/vmstat.c linux-2.6.23-rc5-013-print-out-statistics-in-relation-to-fragmentation-avoidance-to-proc-pagetypeinfo/mm/vmstat.c
--- linux-2.6.23-rc5-012-be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback/mm/vmstat.c 2007-09-02 16:22:34.000000000 +0100
+++ linux-2.6.23-rc5-013-print-out-statistics-in-relation-to-fragmentation-avoidance-to-proc-pagetypeinfo/mm/vmstat.c 2007-09-02 16:23:11.000000000 +0100
@@ -398,6 +398,13 @@ void zone_statistics(struct zonelist *zo
#include <linux/seq_file.h>
+static char * const migratetype_names[MIGRATE_TYPES] = {
+ "Unmovable",
+ "Reclaimable",
+ "Movable",
+ "Reserve",
+};
+
static void *frag_start(struct seq_file *m, loff_t *pos)
{
pg_data_t *pgdat;
@@ -422,28 +429,144 @@ static void frag_stop(struct seq_file *m
{
}
-/*
- * This walks the free areas for each zone.
- */
-static int frag_show(struct seq_file *m, void *arg)
+/* Walk all the zones in a node and print using a callback */
+static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
+ void (*print)(struct seq_file *m, pg_data_t *, struct zone *))
{
- pg_data_t *pgdat = (pg_data_t *)arg;
struct zone *zone;
struct zone *node_zones = pgdat->node_zones;
unsigned long flags;
- int order;
for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
if (!populated_zone(zone))
continue;
spin_lock_irqsave(&zone->lock, flags);
- seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
- for (order = 0; order < MAX_ORDER; ++order)
- seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+ print(m, pgdat, zone);
spin_unlock_irqrestore(&zone->lock, flags);
+ }
+}
+
+static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
+ struct zone *zone)
+{
+ int order;
+
+ seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
+ for (order = 0; order < MAX_ORDER; ++order)
+ seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+ seq_putc(m, '\n');
+}
+
+/*
+ * This walks the free areas for each zone.
+ */
+static int frag_show(struct seq_file *m, void *arg)
+{
+ pg_data_t *pgdat = (pg_data_t *)arg;
+ walk_zones_in_node(m, pgdat, frag_show_print);
+ return 0;
+}
+
+static void pagetypeinfo_showfree_print(struct seq_file *m,
+ pg_data_t *pgdat, struct zone *zone)
+{
+ int order, mtype;
+
+ for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
+ seq_printf(m, "Node %4d, zone %8s, type %12s ",
+ pgdat->node_id,
+ zone->name,
+ migratetype_names[mtype]);
+ for (order = 0; order < MAX_ORDER; ++order) {
+ unsigned long freecount = 0;
+ struct free_area *area;
+ struct list_head *curr;
+
+ area = &(zone->free_area[order]);
+
+ list_for_each(curr, &area->free_list[mtype])
+ freecount++;
+ seq_printf(m, "%6lu ", freecount);
+ }
seq_putc(m, '\n');
}
+}
+
+/* Print out the free pages at each order for each migatetype */
+static int pagetypeinfo_showfree(struct seq_file *m, void *arg)
+{
+ int order;
+ pg_data_t *pgdat = (pg_data_t *)arg;
+
+ /* Print header */
+ seq_printf(m, "%-43s ", "Free pages count per migrate type at order");
+ for (order = 0; order < MAX_ORDER; ++order)
+ seq_printf(m, "%6d ", order);
+ seq_putc(m, '\n');
+
+ walk_zones_in_node(m, pgdat, pagetypeinfo_showfree_print);
+
+ return 0;
+}
+
+static void pagetypeinfo_showblockcount_print(struct seq_file *m,
+ pg_data_t *pgdat, struct zone *zone)
+{
+ int mtype;
+ unsigned long pfn;
+ unsigned long start_pfn = zone->zone_start_pfn;
+ unsigned long end_pfn = start_pfn + zone->spanned_pages;
+ unsigned long count[MIGRATE_TYPES] = { 0, };
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+ struct page *page;
+
+ if (!pfn_valid(pfn))
+ continue;
+
+ page = pfn_to_page(pfn);
+ mtype = get_pageblock_migratetype(page);
+
+ count[mtype]++;
+ }
+
+ /* Print counts */
+ seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
+ for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+ seq_printf(m, "%12lu ", count[mtype]);
+ seq_putc(m, '\n');
+}
+
+/* Print out the free pages at each order for each migratetype */
+static int pagetypeinfo_showblockcount(struct seq_file *m, void *arg)
+{
+ int mtype;
+ pg_data_t *pgdat = (pg_data_t *)arg;
+
+ seq_printf(m, "\n%-23s", "Number of blocks type ");
+ for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+ seq_printf(m, "%12s ", migratetype_names[mtype]);
+ seq_putc(m, '\n');
+ walk_zones_in_node(m, pgdat, pagetypeinfo_showblockcount_print);
+
+ return 0;
+}
+
+/*
+ * This prints out statistics in relation to grouping pages by mobility.
+ * It is expensive to collect so do not constantly read the file.
+ */
+static int pagetypeinfo_show(struct seq_file *m, void *arg)
+{
+ pg_data_t *pgdat = (pg_data_t *)arg;
+
+ seq_printf(m, "Page block order: %d\n", pageblock_order);
+ seq_printf(m, "Pages per block: %lu\n", pageblock_nr_pages);
+ seq_putc(m, '\n');
+ pagetypeinfo_showfree(m, pgdat);
+ pagetypeinfo_showblockcount(m, pgdat);
+
return 0;
}
@@ -454,6 +577,13 @@ const struct seq_operations fragmentatio
.show = frag_show,
};
+const struct seq_operations pagetypeinfo_op = {
+ .start = frag_start,
+ .next = frag_next,
+ .stop = frag_stop,
+ .show = pagetypeinfo_show,
+};
+
#ifdef CONFIG_ZONE_DMA
#define TEXT_FOR_DMA(xx) xx "_dma",
#else
@@ -532,84 +662,78 @@ static const char * const vmstat_text[]
#endif
};
-/*
- * Output information about zones in @pgdat.
- */
-static int zoneinfo_show(struct seq_file *m, void *arg)
+static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
+ struct zone *zone)
{
- pg_data_t *pgdat = arg;
- struct zone *zone;
- struct zone *node_zones = pgdat->node_zones;
- unsigned long flags;
-
- for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; zone++) {
- int i;
+ int i;
+ seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
+ seq_printf(m,
+ "\n pages free %lu"
+ "\n min %lu"
+ "\n low %lu"
+ "\n high %lu"
+ "\n scanned %lu (a: %lu i: %lu)"
+ "\n spanned %lu"
+ "\n present %lu",
+ zone_page_state(zone, NR_FREE_PAGES),
+ zone->pages_min,
+ zone->pages_low,
+ zone->pages_high,
+ zone->pages_scanned,
+ zone->nr_scan_active, zone->nr_scan_inactive,
+ zone->spanned_pages,
+ zone->present_pages);
- if (!populated_zone(zone))
- continue;
+ for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
+ seq_printf(m, "\n %-12s %lu", vmstat_text[i],
+ zone_page_state(zone, i));
- spin_lock_irqsave(&zone->lock, flags);
- seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
- seq_printf(m,
- "\n pages free %lu"
- "\n min %lu"
- "\n low %lu"
- "\n high %lu"
- "\n scanned %lu (a: %lu i: %lu)"
- "\n spanned %lu"
- "\n present %lu",
- zone_page_state(zone, NR_FREE_PAGES),
- zone->pages_min,
- zone->pages_low,
- zone->pages_high,
- zone->pages_scanned,
- zone->nr_scan_active, zone->nr_scan_inactive,
- zone->spanned_pages,
- zone->present_pages);
-
- for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
- seq_printf(m, "\n %-12s %lu", vmstat_text[i],
- zone_page_state(zone, i));
-
- seq_printf(m,
- "\n protection: (%lu",
- zone->lowmem_reserve[0]);
- for (i = 1; i < ARRAY_SIZE(zone->lowmem_reserve); i++)
- seq_printf(m, ", %lu", zone->lowmem_reserve[i]);
- seq_printf(m,
- ")"
- "\n pagesets");
- for_each_online_cpu(i) {
- struct per_cpu_pageset *pageset;
- int j;
-
- pageset = zone_pcp(zone, i);
- for (j = 0; j < ARRAY_SIZE(pageset->pcp); j++) {
- seq_printf(m,
- "\n cpu: %i pcp: %i"
- "\n count: %i"
- "\n high: %i"
- "\n batch: %i",
- i, j,
- pageset->pcp[j].count,
- pageset->pcp[j].high,
- pageset->pcp[j].batch);
+ seq_printf(m,
+ "\n protection: (%lu",
+ zone->lowmem_reserve[0]);
+ for (i = 1; i < ARRAY_SIZE(zone->lowmem_reserve); i++)
+ seq_printf(m, ", %lu", zone->lowmem_reserve[i]);
+ seq_printf(m,
+ ")"
+ "\n pagesets");
+ for_each_online_cpu(i) {
+ struct per_cpu_pageset *pageset;
+ int j;
+
+ pageset = zone_pcp(zone, i);
+ for (j = 0; j < ARRAY_SIZE(pageset->pcp); j++) {
+ seq_printf(m,
+ "\n cpu: %i pcp: %i"
+ "\n count: %i"
+ "\n high: %i"
+ "\n batch: %i",
+ i, j,
+ pageset->pcp[j].count,
+ pageset->pcp[j].high,
+ pageset->pcp[j].batch);
}
#ifdef CONFIG_SMP
- seq_printf(m, "\n vm stats threshold: %d",
- pageset->stat_threshold);
+ seq_printf(m, "\n vm stats threshold: %d",
+ pageset->stat_threshold);
#endif
- }
- seq_printf(m,
- "\n all_unreclaimable: %u"
- "\n prev_priority: %i"
- "\n start_pfn: %lu",
- zone->all_unreclaimable,
- zone->prev_priority,
- zone->zone_start_pfn);
- spin_unlock_irqrestore(&zone->lock, flags);
- seq_putc(m, '\n');
}
+ seq_printf(m,
+ "\n all_unreclaimable: %u"
+ "\n prev_priority: %i"
+ "\n start_pfn: %lu",
+ zone->all_unreclaimable,
+ zone->prev_priority,
+ zone->zone_start_pfn);
+ seq_putc(m, '\n');
+}
+
+/*
+ * Output information about zones in @pgdat.
+ */
+static int zoneinfo_show(struct seq_file *m, void *arg)
+{
+ pg_data_t *pgdat = (pg_data_t *)arg;
+ walk_zones_in_node(m, pgdat, zoneinfo_show_print);
return 0;
}
Minor nit, Mel.
It's easier to read patches if you use the diff -p option:
-p --show-c-function
Show which C function each change is in.
Thanks.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On Mon, 2007-09-10 at 12:44 -0700, Paul Jackson wrote:
> Minor nit, Mel.
>
> It's easier to read patches if you use the diff -p option:
>
> -p --show-c-function
> Show which C function each change is in.
>
That's a fair comment. I normally make sure it's there but it got missed
in a few patches in this set which is awkward. Sorry about that.
--
Mel Gorman
On Monday 10 September 2007 21:22, Mel Gorman wrote:
> Per-cpu pages can accidentally cause fragmentation because they are free,
> but pinned pages in an otherwise contiguous block. When this patch is
> applied, the per-cpu caches are drained after the direct-reclaim is entered
> if the requested order is greater than 0. It simply reuses the code used
> by suspend and hotplug.
Does this help? I have a more general version which could go in
instead (independently of the anti fragmentation patches).
> Signed-off-by: Mel Gorman <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> mm/page_alloc.c | 24 +++++++++++++++++++++++-
> 1 file changed, 23 insertions(+), 1 deletion(-)
>
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff
> linux-2.6.23-rc5-006-group-short-lived-and-reclaimable-kernel-allocations/m
>m/page_alloc.c
> linux-2.6.23-rc5-007-drain-per-cpu-lists-when-high-order-allocations-fail/m
>m/page_alloc.c ---
> linux-2.6.23-rc5-006-group-short-lived-and-reclaimable-kernel-allocations/m
>m/page_alloc.c 2007-09-02 16:20:31.000000000 +0100 +++
> linux-2.6.23-rc5-007-drain-per-cpu-lists-when-high-order-allocations-fail/m
>m/page_alloc.c 2007-09-02 16:20:48.000000000 +0100 @@ -852,6 +852,7 @@ void
> mark_free_pages(struct zone *zone)
> }
> spin_unlock_irqrestore(&zone->lock, flags);
> }
> +#endif /* CONFIG_PM */
>
> /*
> * Spill all of this CPU's per-cpu pages back into the buddy allocator.
> @@ -864,7 +865,25 @@ void drain_local_pages(void)
> __drain_pages(smp_processor_id());
> local_irq_restore(flags);
> }
> -#endif /* CONFIG_HIBERNATION */
> +
> +void smp_drain_local_pages(void *arg)
> +{
> + drain_local_pages();
> +}
> +
> +/*
> + * Spill all the per-cpu pages from all CPUs back into the buddy allocator
> + */
> +void drain_all_local_pages(void)
> +{
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + __drain_pages(smp_processor_id());
> + local_irq_restore(flags);
> +
> + smp_call_function(smp_drain_local_pages, NULL, 0, 1);
> +}
>
> /*
> * Free a 0-order page
> @@ -1452,6 +1471,9 @@ nofail_alloc:
>
> cond_resched();
>
> + if (order != 0)
> + drain_all_local_pages();
> +
> if (likely(did_some_progress)) {
> page = get_page_from_freelist(gfp_mask, order,
> zonelist, alloc_flags);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue, 2007-09-11 at 01:05 +1000, Nick Piggin wrote:
> On Monday 10 September 2007 21:22, Mel Gorman wrote:
> > Per-cpu pages can accidentally cause fragmentation because they are free,
> > but pinned pages in an otherwise contiguous block. When this patch is
> > applied, the per-cpu caches are drained after the direct-reclaim is entered
> > if the requested order is greater than 0. It simply reuses the code used
> > by suspend and hotplug.
>
> Does this help? I have a more general version which could go in
> instead (independently of the anti fragmentation patches).
Yes, it does help. It's noticable when one is trying to get as much
memory in hugepages as possible. It reaches a certain point where
hugepages are free but pinned due to per-cpu pages. This "certain point"
depends on the number of CPUs as a ratio to the size of physical memory
as well as a certain degree of randomness as the location of per-cpu
pages is not predictable. Worst case is not being able to allocate
something like (NR_CPUS * pcp->high * 2) hugepages even if they are
otherwise free.
By all means if you have a general version, send it and I'll take a
look. If it's more general and nicer but still can be used to drain the
per-cpu lists when high-order allocations fail, I'm all for it.
Thanks Nick
> > Signed-off-by: Mel Gorman <[email protected]>
> > Signed-off-by: Andrew Morton <[email protected]>
> > ---
> >
> > mm/page_alloc.c | 24 +++++++++++++++++++++++-
> > 1 file changed, 23 insertions(+), 1 deletion(-)
> >
> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff
> > linux-2.6.23-rc5-006-group-short-lived-and-reclaimable-kernel-allocations/m
> >m/page_alloc.c
> > linux-2.6.23-rc5-007-drain-per-cpu-lists-when-high-order-allocations-fail/m
> >m/page_alloc.c ---
> > linux-2.6.23-rc5-006-group-short-lived-and-reclaimable-kernel-allocations/m
> >m/page_alloc.c 2007-09-02 16:20:31.000000000 +0100 +++
> > linux-2.6.23-rc5-007-drain-per-cpu-lists-when-high-order-allocations-fail/m
> >m/page_alloc.c 2007-09-02 16:20:48.000000000 +0100 @@ -852,6 +852,7 @@ void
> > mark_free_pages(struct zone *zone)
> > }
> > spin_unlock_irqrestore(&zone->lock, flags);
> > }
> > +#endif /* CONFIG_PM */
> >
> > /*
> > * Spill all of this CPU's per-cpu pages back into the buddy allocator.
> > @@ -864,7 +865,25 @@ void drain_local_pages(void)
> > __drain_pages(smp_processor_id());
> > local_irq_restore(flags);
> > }
> > -#endif /* CONFIG_HIBERNATION */
> > +
> > +void smp_drain_local_pages(void *arg)
> > +{
> > + drain_local_pages();
> > +}
> > +
> > +/*
> > + * Spill all the per-cpu pages from all CPUs back into the buddy allocator
> > + */
> > +void drain_all_local_pages(void)
> > +{
> > + unsigned long flags;
> > +
> > + local_irq_save(flags);
> > + __drain_pages(smp_processor_id());
> > + local_irq_restore(flags);
> > +
> > + smp_call_function(smp_drain_local_pages, NULL, 0, 1);
> > +}
> >
> > /*
> > * Free a 0-order page
> > @@ -1452,6 +1471,9 @@ nofail_alloc:
> >
> > cond_resched();
> >
> > + if (order != 0)
> > + drain_all_local_pages();
> > +
> > if (likely(did_some_progress)) {
> > page = get_page_from_freelist(gfp_mask, order,
> > zonelist, alloc_flags);
On Mon, 10 Sep 2007 12:20:11 +0100 (IST) Mel Gorman <[email protected]> wrote:
> Here is a restacked version of the grouping pages by mobility patches
> based on the patches currently in your tree. It should be a drop-in
> replacement for what is in 2.6.23-rc4-mm1 and is what I propose for merging
> to mainline.
It really gives me the creeps to throw away a large set of large patches
and to then introduce a new set.
What would go wrong if we just merged the patches I already have?
On (13/09/07 18:01), Andrew Morton didst pronounce:
> On Mon, 10 Sep 2007 12:20:11 +0100 (IST) Mel Gorman <[email protected]> wrote:
>
> > Here is a restacked version of the grouping pages by mobility patches
> > based on the patches currently in your tree. It should be a drop-in
> > replacement for what is in 2.6.23-rc4-mm1 and is what I propose for merging
> > to mainline.
>
> It really gives me the creeps to throw away a large set of large patches
> and to then introduce a new set.
>
I can understand that logic.
> What would go wrong if we just merged the patches I already have?
>
Nothing, the end result is more or less the same. There are three style
cleanups in the restack and for some reason, one of the functions moved
but otherwise they are identical.
The restacked version was provided to illustrate what the final stack really
looks like and because I thought you would prefer it over a stack that had
one patch introducing a change and a later patch removing it (like making
it configurable for example). It also allowed us to test against mainline
to make sure everything was ok prior to the merge.
Go ahead with the patches you already
have if you prefer. Just make sure not to include
breakout-page_order-to-internalh-to-avoid-special-knowledge-of-the-buddy-allocator.patch
as it's only required for page-owner-tracking.
Thanks Andrew.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Fri, 14 Sep 2007 15:33:55 +0100 [email protected] (Mel Gorman) wrote:
> Go ahead with the patches you already
> have if you prefer. Just make sure not to include
> breakout-page_order-to-internalh-to-avoid-special-knowledge-of-the-buddy-allocator.patch
> as it's only required for page-owner-tracking.
memory-unplug-v7-page-isolation.patch uses page_order() also, so I brought this patch back.
On Mon, 10 Sep 2007 12:21:51 +0100 (IST)
Mel Gorman <[email protected]> wrote:
>
A somewhat belated review comment.
> The freelists for each migrate type can slowly become polluted due to the
> per-cpu list. Consider what happens when the following happens
>
> 1. A 2^pageblock_order list is reserved for __GFP_MOVABLE pages
> 2. An order-0 page is allocated from the newly reserved block
> 3. The page is freed and placed on the per-cpu list
> 4. alloc_page() is called with GFP_KERNEL as the gfp_mask
> 5. The per-cpu list is used to satisfy the allocation
>
> This results in a kernel page is in the middle of a migratable region. This
> patch prevents this leak occuring by storing the MIGRATE_ type of the page in
> page->private. On allocate, a page will only be returned of the desired type,
> else more pages will be allocated. This may temporarily allow a per-cpu list
> to go over the pcp->high limit but it'll be corrected on the next free. Care
> is taken to preserve the hotness of pages recently freed.
>
> The additional code is not measurably slower for the workloads we've tested.
It sure looks slower.
> Signed-off-by: Mel Gorman <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> mm/page_alloc.c | 18 ++++++++++++++++--
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-004-split-the-free-lists-for-movable-and-unmovable-allocations/mm/page_alloc.c linux-2.6.23-rc5-005-choose-pages-from-the-per-cpu-list-based-on-migration-type/mm/page_alloc.c
> --- linux-2.6.23-rc5-004-split-the-free-lists-for-movable-and-unmovable-allocations/mm/page_alloc.c 2007-09-02 16:19:34.000000000 +0100
> +++ linux-2.6.23-rc5-005-choose-pages-from-the-per-cpu-list-based-on-migration-type/mm/page_alloc.c 2007-09-02 16:20:09.000000000 +0100
> @@ -757,7 +757,8 @@ static int rmqueue_bulk(struct zone *zon
> struct page *page = __rmqueue(zone, order, migratetype);
> if (unlikely(page == NULL))
> break;
> - list_add_tail(&page->lru, list);
> + list_add(&page->lru, list);
> + set_page_private(page, migratetype);
> }
> spin_unlock(&zone->lock);
> return i;
> @@ -884,6 +885,7 @@ static void fastcall free_hot_cold_page(
> local_irq_save(flags);
> __count_vm_event(PGFREE);
> list_add(&page->lru, &pcp->list);
> + set_page_private(page, get_pageblock_migratetype(page));
> pcp->count++;
> if (pcp->count >= pcp->high) {
> free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
> @@ -948,7 +950,19 @@ again:
> if (unlikely(!pcp->count))
> goto failed;
> }
> - page = list_entry(pcp->list.next, struct page, lru);
> +
> + /* Find a page of the appropriate migrate type */
> + list_for_each_entry(page, &pcp->list, lru)
> + if (page_private(page) == migratetype)
> + break;
We're doing a linear search through the per-cpu magaznines right there
in the page allocator hot path. Even if the search matches the first
element, the setup costs will matter.
Surely we can make this search go away with a better choice of data
structures?
> + /* Allocate more to the pcp list if necessary */
> + if (unlikely(&page->lru == &pcp->list)) {
> + pcp->count += rmqueue_bulk(zone, 0,
> + pcp->batch, &pcp->list, migratetype);
> + page = list_entry(pcp->list.next, struct page, lru);
> + }
> +
> list_del(&page->lru);
> pcp->count--;
> } else {
On Mon, Jul 13, 2009 at 12:16:28PM -0700, Andrew Morton wrote:
> On Mon, 10 Sep 2007 12:21:51 +0100 (IST)
> Mel Gorman <[email protected]> wrote:
> >
>
> A somewhat belated review comment.
>
> > The freelists for each migrate type can slowly become polluted due to the
> > per-cpu list. Consider what happens when the following happens
> >
> > 1. A 2^pageblock_order list is reserved for __GFP_MOVABLE pages
> > 2. An order-0 page is allocated from the newly reserved block
> > 3. The page is freed and placed on the per-cpu list
> > 4. alloc_page() is called with GFP_KERNEL as the gfp_mask
> > 5. The per-cpu list is used to satisfy the allocation
> >
> > This results in a kernel page is in the middle of a migratable region. This
> > patch prevents this leak occuring by storing the MIGRATE_ type of the page in
> > page->private. On allocate, a page will only be returned of the desired type,
> > else more pages will be allocated. This may temporarily allow a per-cpu list
> > to go over the pcp->high limit but it'll be corrected on the next free. Care
> > is taken to preserve the hotness of pages recently freed.
> >
> > The additional code is not measurably slower for the workloads we've tested.
>
> It sure looks slower.
>
> > Signed-off-by: Mel Gorman <[email protected]>
> > Signed-off-by: Andrew Morton <[email protected]>
> > ---
> >
> > mm/page_alloc.c | 18 ++++++++++++++++--
> > 1 file changed, 16 insertions(+), 2 deletions(-)
> >
> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-004-split-the-free-lists-for-movable-and-unmovable-allocations/mm/page_alloc.c linux-2.6.23-rc5-005-choose-pages-from-the-per-cpu-list-based-on-migration-type/mm/page_alloc.c
> > --- linux-2.6.23-rc5-004-split-the-free-lists-for-movable-and-unmovable-allocations/mm/page_alloc.c 2007-09-02 16:19:34.000000000 +0100
> > +++ linux-2.6.23-rc5-005-choose-pages-from-the-per-cpu-list-based-on-migration-type/mm/page_alloc.c 2007-09-02 16:20:09.000000000 +0100
> > @@ -757,7 +757,8 @@ static int rmqueue_bulk(struct zone *zon
> > struct page *page = __rmqueue(zone, order, migratetype);
> > if (unlikely(page == NULL))
> > break;
> > - list_add_tail(&page->lru, list);
> > + list_add(&page->lru, list);
> > + set_page_private(page, migratetype);
> > }
> > spin_unlock(&zone->lock);
> > return i;
> > @@ -884,6 +885,7 @@ static void fastcall free_hot_cold_page(
> > local_irq_save(flags);
> > __count_vm_event(PGFREE);
> > list_add(&page->lru, &pcp->list);
> > + set_page_private(page, get_pageblock_migratetype(page));
> > pcp->count++;
> > if (pcp->count >= pcp->high) {
> > free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
> > @@ -948,7 +950,19 @@ again:
> > if (unlikely(!pcp->count))
> > goto failed;
> > }
> > - page = list_entry(pcp->list.next, struct page, lru);
> > +
> > + /* Find a page of the appropriate migrate type */
> > + list_for_each_entry(page, &pcp->list, lru)
> > + if (page_private(page) == migratetype)
> > + break;
>
> We're doing a linear search through the per-cpu magaznines right there
> in the page allocator hot path. Even if the search matches the first
> element, the setup costs will matter.
>
> Surely we can make this search go away with a better choice of data
> structures?
>
I have a patch that expands the per-cpu structure and eliminates the search
and I made various attempts at reducing the setup cost (e.g. checking if
the first element suited before starting the search). However, I wasn't been
able to show for definite it made anything faster but it did increase
the size of a per-cpu structure.
>
> > + /* Allocate more to the pcp list if necessary */
> > + if (unlikely(&page->lru == &pcp->list)) {
> > + pcp->count += rmqueue_bulk(zone, 0,
> > + pcp->batch, &pcp->list, migratetype);
> > + page = list_entry(pcp->list.next, struct page, lru);
> > + }
> > +
> > list_del(&page->lru);
> > pcp->count--;
> > } else {
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab