Here comes a batch of fifteen against 2.6.14-mm1. The first eight tie
up various loose ends of the page fault scalability patches (but I've
not yet got to Documentation/vm/locking). The last seven are about page
count overflow, resolving that issue made more immediate by refcounting
the ZERO_PAGE. So I'd like to see these as 2.6.15 material: and mostly
the patches do apply to 2.6.14-git12; but there's two reiser4 patches I
separated out, an assumption that Andi's x86_64-page-flags-cleanup.patch
will go in ahead, and an obvious reject in vmscan.c. You may see them
rather differently! I think there could be two views on the overflow
patches: you might find them neat, you might find them too oblique; but
they do avoid extra tests in the hotter paths. In some cases I've put
an additional comment below the --- cutoff.
Hugh
arch/frv/mm/pgalloc.c | 4
arch/i386/mm/pgtable.c | 8
arch/powerpc/mm/4xx_mmu.c | 4
arch/powerpc/mm/hugetlbpage.c | 4
arch/powerpc/mm/mem.c | 2
arch/powerpc/mm/tlb_32.c | 6
arch/powerpc/mm/tlb_64.c | 4
arch/ppc/mm/pgtable.c | 13 -
arch/ppc64/kernel/vdso.c | 6
arch/sh64/lib/dbg.c | 2
drivers/char/drm/drm_vm.c | 2
fs/afs/file.c | 4
fs/buffer.c | 2
fs/jfs/jfs_metapage.c | 12 -
fs/proc/task_mmu.c | 2
fs/reiser4/flush_queue.c | 2
fs/reiser4/jnode.c | 10 -
fs/reiser4/page_cache.c | 4
fs/reiser4/page_cache.h | 2
fs/reiser4/plugin/file/tail_conversion.c | 2
fs/reiser4/txnmgr.c | 6
fs/xfs/linux-2.6/xfs_buf.c | 7
include/asm-alpha/atomic.h | 7
include/asm-ppc/pgtable.h | 10 -
include/asm-s390/atomic.h | 10 -
include/asm-sparc64/atomic.h | 1
include/asm-x86_64/atomic.h | 49 ++++-
include/linux/buffer_head.h | 6
include/linux/mm.h | 262 +++++++++++++++++++++++--------
include/linux/page-flags.h | 1
include/linux/rmap.h | 12 -
kernel/futex.c | 15 -
kernel/kexec.c | 4
mm/Kconfig | 6
mm/filemap.c | 2
mm/fremap.c | 3
mm/memory.c | 50 +++--
mm/page_alloc.c | 113 ++++++++++++-
mm/page_io.c | 6
mm/rmap.c | 29 +--
mm/shmem.c | 22 +-
mm/swap.c | 2
mm/swap_state.c | 8
mm/swapfile.c | 14 -
mm/vmscan.c | 2
45 files changed, 485 insertions(+), 257 deletions(-)
The split ptlock patch enlarged the default SMP PREEMPT struct page from
32 to 36 bytes on most 32-bit platforms, from 32 to 44 bytes on PA-RISC
7xxx (without PREEMPT). That was not my intention, and I don't believe
that split ptlock deserves any such slice of the user's memory.
Could we please try this patch, or something like it? Again to overlay
the spinlock_t from &page->private onwards, with corrected BUILD_BUG_ON
that we don't go too far; with poisoning of the fields overlaid, and
unsplit SMP config verifying that the split config is safe to use them.
The previous attempt at this patch broke ppc64, which uses slab for its
page tables - and slab saves vital info in page->lru: but no config of
spinlock_t needs to overwrite lru on 64-bit anyway; and the only 32-bit
user of slab for page tables is arm26, which is never SMP i.e. all the
problems came from the "safety" checks, not from what's actually needed.
So previous checks refined with #ifdefs, and a BUG_ON(PageSlab) added.
This overlaying is unlikely to be portable forever: but the added checks
should warn developers when it's going to break, long before any users.
Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/mm.h | 78 +++++++++++++++++++++++++++++++++++++++++++----------
1 files changed, 64 insertions(+), 14 deletions(-)
--- 2.6.14-mm1/include/linux/mm.h 2005-11-07 07:39:57.000000000 +0000
+++ mm01/include/linux/mm.h 2005-11-09 14:37:47.000000000 +0000
@@ -224,18 +224,19 @@ struct page {
* to show when page is mapped
* & limit reverse map searches.
*/
- union {
- unsigned long private; /* Mapping-private opaque data:
+ unsigned long private; /* Mapping-private opaque data:
* usually used for buffer_heads
* if PagePrivate set; used for
* swp_entry_t if PageSwapCache
* When page is free, this indicates
* order in the buddy system.
*/
-#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
- spinlock_t ptl;
-#endif
- } u;
+ /*
+ * Depending on config, along with private, subsequent fields of
+ * a page table page's struct page may be overlaid by a spinlock
+ * for pte locking: see comment on "split ptlock" below. Please
+ * do not rearrange these fields without considering that usage.
+ */
struct address_space *mapping; /* If low bit clear, points to
* inode address_space, or NULL.
* If page mapped as anonymous
@@ -268,8 +269,8 @@ struct page {
#endif
};
-#define page_private(page) ((page)->u.private)
-#define set_page_private(page, v) ((page)->u.private = (v))
+#define page_private(page) ((page)->private)
+#define set_page_private(page, v) ((page)->private = (v))
/*
* FIXME: take this include out, include page-flags.h in
@@ -827,25 +828,74 @@ static inline pmd_t *pmd_alloc(struct mm
}
#endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */
+/*
+ * In the split ptlock case, we shall be overlaying the struct page
+ * of a page table page with a spinlock starting at &page->private:
+ * ending dependent on architecture and config, but never beyond lru.
+ *
+ * So poison page table struct page in all SMP cases (in part to assert
+ * our territory: that pte locking owns these fields of a page table's
+ * struct page); and verify it when freeing in the unsplit ptlock case,
+ * when none of these fields should have been touched. So that now the
+ * common config checks also for the larger PREEMPT DEBUG_SPINLOCK lock.
+ *
+ * Poison lru only in the 32-bit configs, since no 64-bit spinlock_t
+ * extends that far - and ppc64 allocates from slab, which saves info in
+ * these lru fields (arm26 also allocates from slab, but is never SMP).
+ * Poison lru back-to-front, to make sure that list_del was not used.
+ */
+static inline void poison_struct_page(struct page *page)
+{
+#ifdef CONFIG_SMP
+ page->private = (unsigned long) page;
+ page->mapping = (struct address_space *) page;
+ page->index = (pgoff_t) page;
+#if BITS_PER_LONG == 32
+ BUG_ON(PageSlab(page));
+ page->lru.next = LIST_POISON2;
+ page->lru.prev = LIST_POISON1;
+#endif
+#endif
+}
+
+static inline void verify_struct_page(struct page *page)
+{
+#ifdef CONFIG_SMP
+ BUG_ON(page->private != (unsigned long) page);
+ BUG_ON(page->mapping != (struct address_space *) page);
+ BUG_ON(page->index != (pgoff_t) page);
+ page->mapping = NULL;
+#if BITS_PER_LONG == 32
+ BUG_ON(page->lru.next != LIST_POISON2);
+ BUG_ON(page->lru.prev != LIST_POISON1);
+#endif
+#endif
+}
+
#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
/*
* We tuck a spinlock to guard each pagetable page into its struct page,
* at page->private, with BUILD_BUG_ON to make sure that this will not
- * overflow into the next struct page (as it might with DEBUG_SPINLOCK).
- * When freeing, reset page->mapping so free_pages_check won't complain.
+ * extend further than expected.
*/
-#define __pte_lockptr(page) &((page)->u.ptl)
+#define __pte_lockptr(page) ((spinlock_t *)&((page)->private))
#define pte_lock_init(_page) do { \
+ BUILD_BUG_ON(__pte_lockptr((struct page *)0) + 1 > (spinlock_t*)\
+ (&((struct page *)0)->lru + (BITS_PER_LONG == 32))); \
+ poison_struct_page(_page); \
spin_lock_init(__pte_lockptr(_page)); \
} while (0)
+/*
+ * When freeing, reset page->mapping so free_pages_check won't complain.
+ */
#define pte_lock_deinit(page) ((page)->mapping = NULL)
#define pte_lockptr(mm, pmd) ({(void)(mm); __pte_lockptr(pmd_page(*(pmd)));})
-#else
+#else /* NR_CPUS < CONFIG_SPLIT_PTLOCK_CPUS */
/*
* We use mm->page_table_lock to guard all pagetable pages of the mm.
*/
-#define pte_lock_init(page) do {} while (0)
-#define pte_lock_deinit(page) do {} while (0)
+#define pte_lock_init(page) poison_struct_page(page)
+#define pte_lock_deinit(page) verify_struct_page(page)
#define pte_lockptr(mm, pmd) ({(void)(pmd); &(mm)->page_table_lock;})
#endif /* NR_CPUS < CONFIG_SPLIT_PTLOCK_CPUS */
The page_private macro serves no purpose while spinlock_t is overlaid
in struct page: a purpose may be found for it later on, but until then
revert the abstraction, restoring various files to their previous state.
Signed-off-by: Hugh Dickins <[email protected]>
---
Note: the mm/vmscan.c mod gives an easily resolved reject on 2.6.14-git.
arch/frv/mm/pgalloc.c | 4 ++--
arch/i386/mm/pgtable.c | 8 ++++----
fs/afs/file.c | 4 ++--
fs/buffer.c | 2 +-
fs/jfs/jfs_metapage.c | 12 ++++++------
fs/xfs/linux-2.6/xfs_buf.c | 7 +++----
include/linux/buffer_head.h | 6 +++---
include/linux/mm.h | 9 +++------
kernel/kexec.c | 4 ++--
mm/filemap.c | 2 +-
mm/page_alloc.c | 16 ++++++++--------
mm/page_io.c | 6 ++----
mm/rmap.c | 2 +-
mm/shmem.c | 22 ++++++++++++----------
mm/swap.c | 2 +-
mm/swap_state.c | 8 ++++----
mm/swapfile.c | 12 ++++++------
mm/vmscan.c | 2 +-
18 files changed, 62 insertions(+), 66 deletions(-)
--- mm01/arch/frv/mm/pgalloc.c 2005-11-07 07:39:00.000000000 +0000
+++ mm02/arch/frv/mm/pgalloc.c 2005-11-09 14:38:02.000000000 +0000
@@ -87,14 +87,14 @@ static inline void pgd_list_add(pgd_t *p
if (pgd_list)
pgd_list->private = (unsigned long) &page->index;
pgd_list = page;
- set_page_private(page, (unsigned long)&pgd_list);
+ page->private = (unsigned long) &pgd_list;
}
static inline void pgd_list_del(pgd_t *pgd)
{
struct page *next, **pprev, *page = virt_to_page(pgd);
next = (struct page *) page->index;
- pprev = (struct page **)page_private(page);
+ pprev = (struct page **) page->private;
*pprev = next;
if (next)
next->private = (unsigned long) pprev;
--- mm01/arch/i386/mm/pgtable.c 2005-11-07 07:39:01.000000000 +0000
+++ mm02/arch/i386/mm/pgtable.c 2005-11-09 14:38:03.000000000 +0000
@@ -191,19 +191,19 @@ static inline void pgd_list_add(pgd_t *p
struct page *page = virt_to_page(pgd);
page->index = (unsigned long)pgd_list;
if (pgd_list)
- set_page_private(pgd_list, (unsigned long)&page->index);
+ pgd_list->private = (unsigned long)&page->index;
pgd_list = page;
- set_page_private(page, (unsigned long)&pgd_list);
+ page->private = (unsigned long)&pgd_list;
}
static inline void pgd_list_del(pgd_t *pgd)
{
struct page *next, **pprev, *page = virt_to_page(pgd);
next = (struct page *)page->index;
- pprev = (struct page **)page_private(page);
+ pprev = (struct page **)page->private;
*pprev = next;
if (next)
- set_page_private(next, (unsigned long)pprev);
+ next->private = (unsigned long)pprev;
}
void pgd_ctor(void *pgd, kmem_cache_t *cache, unsigned long unused)
--- mm01/fs/afs/file.c 2005-11-07 07:39:42.000000000 +0000
+++ mm02/fs/afs/file.c 2005-11-09 14:38:03.000000000 +0000
@@ -261,8 +261,8 @@ static int afs_file_releasepage(struct p
cachefs_uncache_page(vnode->cache, page);
#endif
- pageio = (struct cachefs_page *) page_private(page);
- set_page_private(page, 0);
+ pageio = (struct cachefs_page *) page->private;
+ page->private = 0;
ClearPagePrivate(page);
kfree(pageio);
--- mm01/fs/buffer.c 2005-11-07 07:39:43.000000000 +0000
+++ mm02/fs/buffer.c 2005-11-09 14:38:03.000000000 +0000
@@ -96,7 +96,7 @@ static void
__clear_page_buffers(struct page *page)
{
ClearPagePrivate(page);
- set_page_private(page, 0);
+ page->private = 0;
page_cache_release(page);
}
--- mm01/fs/jfs/jfs_metapage.c 2005-11-07 07:39:44.000000000 +0000
+++ mm02/fs/jfs/jfs_metapage.c 2005-11-09 14:38:03.000000000 +0000
@@ -86,7 +86,7 @@ struct meta_anchor {
atomic_t io_count;
struct metapage *mp[MPS_PER_PAGE];
};
-#define mp_anchor(page) ((struct meta_anchor *)page_private(page))
+#define mp_anchor(page) ((struct meta_anchor *)page->private)
static inline struct metapage *page_to_mp(struct page *page, uint offset)
{
@@ -108,7 +108,7 @@ static inline int insert_metapage(struct
if (!a)
return -ENOMEM;
memset(a, 0, sizeof(struct meta_anchor));
- set_page_private(page, (unsigned long)a);
+ page->private = (unsigned long)a;
SetPagePrivate(page);
kmap(page);
}
@@ -136,7 +136,7 @@ static inline void remove_metapage(struc
a->mp[index] = NULL;
if (--a->mp_count == 0) {
kfree(a);
- set_page_private(page, 0);
+ page->private = 0;
ClearPagePrivate(page);
kunmap(page);
}
@@ -156,13 +156,13 @@ static inline void dec_io(struct page *p
#else
static inline struct metapage *page_to_mp(struct page *page, uint offset)
{
- return PagePrivate(page) ? (struct metapage *)page_private(page) : NULL;
+ return PagePrivate(page) ? (struct metapage *)page->private : NULL;
}
static inline int insert_metapage(struct page *page, struct metapage *mp)
{
if (mp) {
- set_page_private(page, (unsigned long)mp);
+ page->private = (unsigned long)mp;
SetPagePrivate(page);
kmap(page);
}
@@ -171,7 +171,7 @@ static inline int insert_metapage(struct
static inline void remove_metapage(struct page *page, struct metapage *mp)
{
- set_page_private(page, 0);
+ page->private = 0;
ClearPagePrivate(page);
kunmap(page);
}
--- mm01/fs/xfs/linux-2.6/xfs_buf.c 2005-11-07 07:39:47.000000000 +0000
+++ mm02/fs/xfs/linux-2.6/xfs_buf.c 2005-11-09 14:38:03.000000000 +0000
@@ -141,9 +141,8 @@ set_page_region(
size_t offset,
size_t length)
{
- set_page_private(page,
- page_private(page) | page_region_mask(offset, length));
- if (page_private(page) == ~0UL)
+ page->private |= page_region_mask(offset, length);
+ if (page->private == ~0UL)
SetPageUptodate(page);
}
@@ -155,7 +154,7 @@ test_page_region(
{
unsigned long mask = page_region_mask(offset, length);
- return (mask && (page_private(page) & mask) == mask);
+ return (mask && (page->private & mask) == mask);
}
/*
--- mm01/include/linux/buffer_head.h 2005-11-07 07:39:56.000000000 +0000
+++ mm02/include/linux/buffer_head.h 2005-11-09 14:38:03.000000000 +0000
@@ -126,8 +126,8 @@ BUFFER_FNS(Eopnotsupp, eopnotsupp)
/* If we *know* page->private refers to buffer_heads */
#define page_buffers(page) \
({ \
- BUG_ON(!PagePrivate(page)); \
- ((struct buffer_head *)page_private(page)); \
+ BUG_ON(!PagePrivate(page)); \
+ ((struct buffer_head *)(page)->private); \
})
#define page_has_buffers(page) PagePrivate(page)
@@ -220,7 +220,7 @@ static inline void attach_page_buffers(s
{
page_cache_get(page);
SetPagePrivate(page);
- set_page_private(page, (unsigned long)head);
+ page->private = (unsigned long)head;
}
static inline void get_bh(struct buffer_head *bh)
--- mm01/include/linux/mm.h 2005-11-09 14:37:47.000000000 +0000
+++ mm02/include/linux/mm.h 2005-11-09 14:38:03.000000000 +0000
@@ -269,9 +269,6 @@ struct page {
#endif
};
-#define page_private(page) ((page)->private)
-#define set_page_private(page, v) ((page)->private = (v))
-
/*
* FIXME: take this include out, include page-flags.h in
* files which need it (119 of them)
@@ -326,14 +323,14 @@ extern void FASTCALL(__page_cache_releas
static inline int page_count(struct page *page)
{
if (PageCompound(page))
- page = (struct page *)page_private(page);
+ page = (struct page *)page->private;
return atomic_read(&page->_count) + 1;
}
static inline void get_page(struct page *page)
{
if (unlikely(PageCompound(page)))
- page = (struct page *)page_private(page);
+ page = (struct page *)page->private;
atomic_inc(&page->_count);
}
@@ -599,7 +596,7 @@ static inline int PageAnon(struct page *
static inline pgoff_t page_index(struct page *page)
{
if (unlikely(PageSwapCache(page)))
- return page_private(page);
+ return page->private;
return page->index;
}
--- mm01/kernel/kexec.c 2005-11-07 07:39:59.000000000 +0000
+++ mm02/kernel/kexec.c 2005-11-09 14:38:03.000000000 +0000
@@ -334,7 +334,7 @@ static struct page *kimage_alloc_pages(g
if (pages) {
unsigned int count, i;
pages->mapping = NULL;
- set_page_private(pages, order);
+ pages->private = order;
count = 1 << order;
for (i = 0; i < count; i++)
SetPageReserved(pages + i);
@@ -347,7 +347,7 @@ static void kimage_free_pages(struct pag
{
unsigned int order, count, i;
- order = page_private(page);
+ order = page->private;
count = 1 << order;
for (i = 0; i < count; i++)
ClearPageReserved(page + i);
--- mm01/mm/filemap.c 2005-11-07 07:39:59.000000000 +0000
+++ mm02/mm/filemap.c 2005-11-09 14:38:03.000000000 +0000
@@ -154,7 +154,7 @@ static int sync_page(void *word)
* in the ->sync_page() methods make essential use of the
* page_mapping(), merely passing the page down to the backing
* device's unplug functions when it's non-NULL, which in turn
- * ignore it for all cases but swap, where only page_private(page) is
+ * ignore it for all cases but swap, where only page->private is
* of interest. When page_mapping() does go NULL, the entire
* call stack gracefully ignores the page and returns.
* -- wli
--- mm01/mm/page_alloc.c 2005-11-07 07:39:59.000000000 +0000
+++ mm02/mm/page_alloc.c 2005-11-09 14:38:03.000000000 +0000
@@ -180,7 +180,7 @@ static void prep_compound_page(struct pa
struct page *p = page + i;
SetPageCompound(p);
- set_page_private(p, (unsigned long)page);
+ p->private = (unsigned long)page;
}
}
@@ -200,7 +200,7 @@ static void destroy_compound_page(struct
if (!PageCompound(p))
bad_page(__FUNCTION__, page);
- if (page_private(p) != (unsigned long)page)
+ if (p->private != (unsigned long)page)
bad_page(__FUNCTION__, page);
ClearPageCompound(p);
}
@@ -213,18 +213,18 @@ static void destroy_compound_page(struct
* So, we don't need atomic page->flags operations here.
*/
static inline unsigned long page_order(struct page *page) {
- return page_private(page);
+ return page->private;
}
static inline void set_page_order(struct page *page, int order) {
- set_page_private(page, order);
+ page->private = order;
__SetPagePrivate(page);
}
static inline void rmv_page_order(struct page *page)
{
__ClearPagePrivate(page);
- set_page_private(page, 0);
+ page->private = 0;
}
/*
@@ -264,7 +264,7 @@ __find_combined_index(unsigned long page
* (a) the buddy is free &&
* (b) the buddy is on the buddy system &&
* (c) a page and its buddy have the same order.
- * for recording page's order, we use page_private(page) and PG_private.
+ * for recording page's order, we use page->private and PG_private.
*
*/
static inline int page_is_buddy(struct page *page, int order)
@@ -290,7 +290,7 @@ static inline int page_is_buddy(struct p
* parts of the VM system.
* At each level, we keep a list of pages, which are heads of continuous
* free pages of length of (1 << order) and marked with PG_Private.Page's
- * order is recorded in page_private(page) field.
+ * order is recorded in page->private field.
* So when we are allocating or freeing one, we can derive the state of the
* other. That is, if we allocate a small block, and both were
* free, the remainder of the region must be split into blocks.
@@ -489,7 +489,7 @@ static void prep_new_page(struct page *p
page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
1 << PG_referenced | 1 << PG_arch_1 |
1 << PG_checked | 1 << PG_mappedtodisk);
- set_page_private(page, 0);
+ page->private = 0;
set_page_refs(page, order);
kernel_map_pages(page, 1 << order, 1);
}
--- mm01/mm/page_io.c 2005-11-07 07:39:59.000000000 +0000
+++ mm02/mm/page_io.c 2005-11-09 14:38:03.000000000 +0000
@@ -91,8 +91,7 @@ int swap_writepage(struct page *page, st
unlock_page(page);
goto out;
}
- bio = get_swap_bio(GFP_NOIO, page_private(page), page,
- end_swap_bio_write);
+ bio = get_swap_bio(GFP_NOIO, page->private, page, end_swap_bio_write);
if (bio == NULL) {
set_page_dirty(page);
unlock_page(page);
@@ -116,8 +115,7 @@ int swap_readpage(struct file *file, str
BUG_ON(!PageLocked(page));
ClearPageUptodate(page);
- bio = get_swap_bio(GFP_KERNEL, page_private(page), page,
- end_swap_bio_read);
+ bio = get_swap_bio(GFP_KERNEL, page->private, page, end_swap_bio_read);
if (bio == NULL) {
unlock_page(page);
ret = -ENOMEM;
--- mm01/mm/rmap.c 2005-11-07 07:40:00.000000000 +0000
+++ mm02/mm/rmap.c 2005-11-09 14:38:03.000000000 +0000
@@ -550,7 +550,7 @@ static int try_to_unmap_one(struct page
update_hiwater_rss(mm);
if (PageAnon(page)) {
- swp_entry_t entry = { .val = page_private(page) };
+ swp_entry_t entry = { .val = page->private };
/*
* Store the swap location in the pte.
* See handle_pte_fault() ...
--- mm01/mm/shmem.c 2005-11-07 07:40:00.000000000 +0000
+++ mm02/mm/shmem.c 2005-11-09 14:38:03.000000000 +0000
@@ -71,6 +71,9 @@
/* Pretend that each entry is of this size in directory's i_size */
#define BOGO_DIRENT_SIZE 20
+/* Keep swapped page count in private field of indirect struct page */
+#define nr_swapped private
+
/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
enum sgp_type {
SGP_QUICK, /* don't try more than file page cache lookup */
@@ -321,10 +324,8 @@ static void shmem_swp_set(struct shmem_i
entry->val = value;
info->swapped += incdec;
- if ((unsigned long)(entry - info->i_direct) >= SHMEM_NR_DIRECT) {
- struct page *page = kmap_atomic_to_page(entry);
- set_page_private(page, page_private(page) + incdec);
- }
+ if ((unsigned long)(entry - info->i_direct) >= SHMEM_NR_DIRECT)
+ kmap_atomic_to_page(entry)->nr_swapped += incdec;
}
/*
@@ -367,8 +368,9 @@ static swp_entry_t *shmem_swp_alloc(stru
spin_unlock(&info->lock);
page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
- if (page)
- set_page_private(page, 0);
+ if (page) {
+ page->nr_swapped = 0;
+ }
spin_lock(&info->lock);
if (!page) {
@@ -559,7 +561,7 @@ static void shmem_truncate(struct inode
diroff = 0;
}
subdir = dir[diroff];
- if (subdir && page_private(subdir)) {
+ if (subdir && subdir->nr_swapped) {
size = limit - idx;
if (size > ENTRIES_PER_PAGE)
size = ENTRIES_PER_PAGE;
@@ -570,10 +572,10 @@ static void shmem_truncate(struct inode
nr_swaps_freed += freed;
if (offset)
spin_lock(&info->lock);
- set_page_private(subdir, page_private(subdir) - freed);
+ subdir->nr_swapped -= freed;
if (offset)
spin_unlock(&info->lock);
- BUG_ON(page_private(subdir) > offset);
+ BUG_ON(subdir->nr_swapped > offset);
}
if (offset)
offset = 0;
@@ -741,7 +743,7 @@ static int shmem_unuse_inode(struct shme
dir = shmem_dir_map(subdir);
}
subdir = *dir;
- if (subdir && page_private(subdir)) {
+ if (subdir && subdir->nr_swapped) {
ptr = shmem_swp_map(subdir);
size = limit - idx;
if (size > ENTRIES_PER_PAGE)
--- mm01/mm/swap.c 2005-11-07 07:40:00.000000000 +0000
+++ mm02/mm/swap.c 2005-11-09 14:38:03.000000000 +0000
@@ -39,7 +39,7 @@ int page_cluster;
void put_page(struct page *page)
{
if (unlikely(PageCompound(page))) {
- page = (struct page *)page_private(page);
+ page = (struct page *)page->private;
if (put_page_testzero(page)) {
void (*dtor)(struct page *page);
--- mm01/mm/swap_state.c 2005-11-07 07:40:00.000000000 +0000
+++ mm02/mm/swap_state.c 2005-11-09 14:38:03.000000000 +0000
@@ -83,7 +83,7 @@ static int __add_to_swap_cache(struct pa
page_cache_get(page);
SetPageLocked(page);
SetPageSwapCache(page);
- set_page_private(page, entry.val);
+ page->private = entry.val;
total_swapcache_pages++;
pagecache_acct(1);
}
@@ -127,8 +127,8 @@ void __delete_from_swap_cache(struct pag
BUG_ON(PageWriteback(page));
BUG_ON(PagePrivate(page));
- radix_tree_delete(&swapper_space.page_tree, page_private(page));
- set_page_private(page, 0);
+ radix_tree_delete(&swapper_space.page_tree, page->private);
+ page->private = 0;
ClearPageSwapCache(page);
total_swapcache_pages--;
pagecache_acct(-1);
@@ -201,7 +201,7 @@ void delete_from_swap_cache(struct page
{
swp_entry_t entry;
- entry.val = page_private(page);
+ entry.val = page->private;
write_lock_irq(&swapper_space.tree_lock);
__delete_from_swap_cache(page);
--- mm01/mm/swapfile.c 2005-11-07 07:40:00.000000000 +0000
+++ mm02/mm/swapfile.c 2005-11-09 14:38:03.000000000 +0000
@@ -59,7 +59,7 @@ void swap_unplug_io_fn(struct backing_de
swp_entry_t entry;
down_read(&swap_unplug_sem);
- entry.val = page_private(page);
+ entry.val = page->private;
if (PageSwapCache(page)) {
struct block_device *bdev = swap_info[swp_type(entry)].bdev;
struct backing_dev_info *bdi;
@@ -67,8 +67,8 @@ void swap_unplug_io_fn(struct backing_de
/*
* If the page is removed from swapcache from under us (with a
* racy try_to_unuse/swapoff) we need an additional reference
- * count to avoid reading garbage from page_private(page) above.
- * If the WARN_ON triggers during a swapoff it maybe the race
+ * count to avoid reading garbage from page->private above. If
+ * the WARN_ON triggers during a swapoff it maybe the race
* condition and it's harmless. However if it triggers without
* swapoff it signals a problem.
*/
@@ -292,7 +292,7 @@ static inline int page_swapcount(struct
struct swap_info_struct *p;
swp_entry_t entry;
- entry.val = page_private(page);
+ entry.val = page->private;
p = swap_info_get(entry);
if (p) {
/* Subtract the 1 for the swap cache itself */
@@ -337,7 +337,7 @@ int remove_exclusive_swap_page(struct pa
if (page_count(page) != 2) /* 2: us + cache */
return 0;
- entry.val = page_private(page);
+ entry.val = page->private;
p = swap_info_get(entry);
if (!p)
return 0;
@@ -1040,7 +1040,7 @@ int page_queue_congested(struct page *pa
BUG_ON(!PageLocked(page)); /* It pins the swap_info_struct */
if (PageSwapCache(page)) {
- swp_entry_t entry = { .val = page_private(page) };
+ swp_entry_t entry = { .val = page->private };
struct swap_info_struct *sis;
sis = get_swap_info_struct(swp_type(entry));
--- mm01/mm/vmscan.c 2005-11-07 07:40:01.000000000 +0000
+++ mm02/mm/vmscan.c 2005-11-09 14:38:03.000000000 +0000
@@ -387,7 +387,7 @@ static inline int remove_mapping(struct
goto cannot_free;
if (PageSwapCache(page)) {
- swp_entry_t swap = { .val = page_private(page) };
+ swp_entry_t swap = { .val = page->private };
add_to_swapped_list(swap.val);
__delete_from_swap_cache(page);
write_unlock_irq(&mapping->tree_lock);
Revert page_private: remove -mm tree's reiser4-page-private-fixes.patch
Signed-off-by: Hugh Dickins <[email protected]>
---
fs/reiser4/flush_queue.c | 2 +-
fs/reiser4/jnode.c | 10 +++++-----
fs/reiser4/page_cache.c | 2 +-
fs/reiser4/page_cache.h | 2 +-
fs/reiser4/plugin/file/tail_conversion.c | 2 +-
fs/reiser4/txnmgr.c | 6 +++---
6 files changed, 12 insertions(+), 12 deletions(-)
--- mm02/fs/reiser4/flush_queue.c 2005-11-07 07:39:46.000000000 +0000
+++ mm03/fs/reiser4/flush_queue.c 2005-11-09 14:38:18.000000000 +0000
@@ -427,7 +427,7 @@ end_io_handler(struct bio *bio, unsigned
assert("zam-736", pg != NULL);
assert("zam-736", PagePrivate(pg));
- node = jprivate(pg);
+ node = (jnode *) (pg->private);
JF_CLR(node, JNODE_WRITEBACK);
}
--- mm02/fs/reiser4/jnode.c 2005-11-07 07:39:46.000000000 +0000
+++ mm03/fs/reiser4/jnode.c 2005-11-09 14:38:18.000000000 +0000
@@ -38,12 +38,12 @@
* page_address().
*
* jnode and page are attached to each other by jnode_attach_page(). This
- * function places pointer to jnode in page_private(), sets PG_private flag
+ * function places pointer to jnode in page->private, sets PG_private flag
* and increments page counter.
*
* Opposite operation is performed by page_clear_jnode().
*
- * jnode->pg is protected by jnode spin lock, and page_private() is
+ * jnode->pg is protected by jnode spin lock, and page->private is
* protected by page lock. See comment at the top of page_cache.c for
* more.
*
@@ -664,7 +664,7 @@ void jnode_attach_page(jnode * node, str
assert("nikita-2060", node != NULL);
assert("nikita-2061", pg != NULL);
- assert("nikita-2050", page_private(pg) == 0ul);
+ assert("nikita-2050", pg->private == 0ul);
assert("nikita-2393", !PagePrivate(pg));
assert("vs-1741", node->pg == NULL);
@@ -672,7 +672,7 @@ void jnode_attach_page(jnode * node, str
assert("nikita-2397", spin_jnode_is_locked(node));
page_cache_get(pg);
- set_page_private(pg, (unsigned long)node);
+ pg->private = (unsigned long)node;
node->pg = pg;
SetPagePrivate(pg);
}
@@ -689,7 +689,7 @@ void page_clear_jnode(struct page *page,
assert("nikita-3551", !PageWriteback(page));
JF_CLR(node, JNODE_PARSED);
- set_page_private(page, 0ul);
+ page->private = 0ul;
ClearPagePrivate(page);
node->pg = NULL;
page_cache_release(page);
--- mm02/fs/reiser4/page_cache.c 2005-11-07 07:39:46.000000000 +0000
+++ mm03/fs/reiser4/page_cache.c 2005-11-09 14:38:18.000000000 +0000
@@ -753,7 +753,7 @@ void print_page(const char *prefix, stru
}
printk("%s: page index: %lu mapping: %p count: %i private: %lx\n",
prefix, page->index, page->mapping, page_count(page),
- page_private(page));
+ page->private);
printk("\tflags: %s%s%s%s %s%s%s %s%s%s %s%s\n",
page_flag_name(page, PG_locked), page_flag_name(page, PG_error),
page_flag_name(page, PG_referenced), page_flag_name(page,
--- mm02/fs/reiser4/page_cache.h 2005-11-07 07:39:46.000000000 +0000
+++ mm03/fs/reiser4/page_cache.h 2005-11-09 14:38:18.000000000 +0000
@@ -30,7 +30,7 @@ static inline void lock_and_wait_page_wr
reiser4_wait_page_writeback(page);
}
-#define jprivate(page) ((jnode *) page_private(page))
+#define jprivate(page) ((jnode *) (page)->private)
extern int page_io(struct page *page, jnode * node, int rw, int gfp);
extern void drop_page(struct page *page);
--- mm02/fs/reiser4/plugin/file/tail_conversion.c 2005-11-07 07:39:46.000000000 +0000
+++ mm03/fs/reiser4/plugin/file/tail_conversion.c 2005-11-09 14:38:18.000000000 +0000
@@ -670,7 +670,7 @@ int extent2tail(unix_file_info_t * uf_in
/* page is already detached from jnode and mapping. */
assert("vs-1086", page->mapping == NULL);
assert("nikita-2690",
- (!PagePrivate(page) && page_private(page) == 0));
+ (!PagePrivate(page) && page->private == 0));
/* waiting for writeback completion with page lock held is
* perfectly valid. */
wait_on_page_writeback(page);
--- mm02/fs/reiser4/txnmgr.c 2005-11-07 07:39:46.000000000 +0000
+++ mm03/fs/reiser4/txnmgr.c 2005-11-09 14:38:18.000000000 +0000
@@ -2331,7 +2331,7 @@ void uncapture_page(struct page *pg)
reiser4_wait_page_writeback(pg);
- node = jprivate(pg);
+ node = (jnode *) (pg->private);
BUG_ON(node == NULL);
LOCK_JNODE(node);
@@ -3603,14 +3603,14 @@ static void swap_jnode_pages(jnode * nod
copy->pg = node->pg;
copy->data = page_address(copy->pg);
jnode_set_block(copy, jnode_get_block(node));
- set_page_private(copy->pg, (unsigned long)copy);
+ copy->pg->private = (unsigned long)copy;
/* attach new page to jnode */
assert("vs-1412", !PagePrivate(new_page));
page_cache_get(new_page);
node->pg = new_page;
node->data = page_address(new_page);
- set_page_private(new_page, (unsigned long)node);
+ new_page->private = (unsigned long)node;
SetPagePrivate(new_page);
{
Closer attention to the arithmetic shows that neither ppc64 nor sparc
really uses one page for multiple page tables: how on earth could they,
while pte_alloc_one returns just a struct page pointer, with no offset?
Gasp, splutter... arm26 manages it by returning a pte_t pointer cast to
a struct page pointer, then compensating in its pmd_populate. That's
almost as evil as overlaying a struct page with a spinlock_t. But arm26
is never SMP, and we now only poison when SMP, so it's not a problem.
And the PA-RISC situation has been recently improved: CONFIG_PA20
works without the 16-byte alignment which inflated its spinlock_t.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/Kconfig | 6 ++----
1 files changed, 2 insertions(+), 4 deletions(-)
--- mm03/mm/Kconfig 2005-11-07 07:39:59.000000000 +0000
+++ mm04/mm/Kconfig 2005-11-09 14:38:32.000000000 +0000
@@ -125,14 +125,12 @@ comment "Memory hotplug is currently inc
# space can be handled with less contention: split it at this NR_CPUS.
# Default to 4 for wider testing, though 8 might be more appropriate.
# ARM's adjust_pte (unused if VIPT) depends on mm-wide page_table_lock.
-# PA-RISC's debug spinlock_t is too large for the 32-bit struct page.
-# ARM26 and SPARC32 and PPC64 may use one page for multiple page tables.
+# PA-RISC 7xxx's debug spinlock_t is too large for 32-bit struct page.
#
config SPLIT_PTLOCK_CPUS
int
default "4096" if ARM && !CPU_CACHE_VIPT
- default "4096" if PARISC && DEBUG_SPINLOCK && !64BIT
- default "4096" if ARM26 || SPARC32 || PPC64
+ default "4096" if PARISC && DEBUG_SPINLOCK && !PA20
default "4"
#
The follow_page changes in get_futex_key have left it with two almost
identical blocks, when handling the rare case of a futex in a nonlinear
vma. get_user_pages will itself do that follow_page, and its additional
find_extend_vma is hardly any overhead since the vma is already cached.
Let's just delete the follow_page block and let get_user_pages do it.
Signed-off-by: Hugh Dickins <[email protected]>
---
kernel/futex.c | 15 ---------------
1 files changed, 15 deletions(-)
--- mm04/kernel/futex.c 2005-11-07 07:39:59.000000000 +0000
+++ mm05/kernel/futex.c 2005-11-09 14:38:47.000000000 +0000
@@ -201,21 +201,6 @@ static int get_futex_key(unsigned long u
* from swap. But that's a lot of code to duplicate here
* for a rare case, so we simply fetch the page.
*/
-
- /*
- * Do a quick atomic lookup first - this is the fastpath.
- */
- page = follow_page(mm, uaddr, FOLL_TOUCH|FOLL_GET);
- if (likely(page != NULL)) {
- key->shared.pgoff =
- page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
- put_page(page);
- return 0;
- }
-
- /*
- * Do it the general way.
- */
err = get_user_pages(current, mm, uaddr, 1, 0, 0, &page, NULL);
if (err >= 0) {
key->shared.pgoff =
ppc's HIGHPTE config option was removed in 2.5.28, and nobody seems to
have wanted it enough to restore it: so remove its traces from pgtable.h
and pte_alloc_one. Or supply an alternative patch to config it back?
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/ppc/mm/pgtable.c | 13 +------------
include/asm-ppc/pgtable.h | 10 ++++------
2 files changed, 5 insertions(+), 18 deletions(-)
--- mm05/arch/ppc/mm/pgtable.c 2005-11-07 07:39:08.000000000 +0000
+++ mm06/arch/ppc/mm/pgtable.c 2005-11-09 14:39:02.000000000 +0000
@@ -111,18 +111,7 @@ pte_t *pte_alloc_one_kernel(struct mm_st
struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
{
- struct page *ptepage;
-
-#ifdef CONFIG_HIGHPTE
- gfp_t flags = GFP_KERNEL | __GFP_HIGHMEM | __GFP_REPEAT;
-#else
- gfp_t flags = GFP_KERNEL | __GFP_REPEAT;
-#endif
-
- ptepage = alloc_pages(flags, 0);
- if (ptepage)
- clear_highpage(ptepage);
- return ptepage;
+ return alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
}
void pte_free_kernel(pte_t *pte)
--- mm05/include/asm-ppc/pgtable.h 2005-11-07 07:39:55.000000000 +0000
+++ mm06/include/asm-ppc/pgtable.h 2005-11-09 14:39:02.000000000 +0000
@@ -750,13 +750,11 @@ static inline pmd_t * pmd_offset(pgd_t *
(((address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
#define pte_offset_kernel(dir, addr) \
((pte_t *) pmd_page_kernel(*(dir)) + pte_index(addr))
-#define pte_offset_map(dir, addr) \
- ((pte_t *) kmap_atomic(pmd_page(*(dir)), KM_PTE0) + pte_index(addr))
-#define pte_offset_map_nested(dir, addr) \
- ((pte_t *) kmap_atomic(pmd_page(*(dir)), KM_PTE1) + pte_index(addr))
-#define pte_unmap(pte) kunmap_atomic(pte, KM_PTE0)
-#define pte_unmap_nested(pte) kunmap_atomic(pte, KM_PTE1)
+#define pte_offset_map(dir,addr) pte_offset_kernel(dir,addr)
+#define pte_offset_map_nested(dir,addr) pte_offset_kernel(dir,addr)
+#define pte_unmap(pte) do { } while(0)
+#define pte_unmap_nested(pte) do { } while(0)
extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
Update comments (only) on page_table_lock and mmap_sem in arch/powerpc.
Removed the comment on page_table_lock from hash_huge_page: since it's
no longer taking page_table_lock itself, it's irrelevant whether others
are; but how it is safe (even against huge file truncation?) I can't say.
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/powerpc/mm/hugetlbpage.c | 4 +---
arch/powerpc/mm/mem.c | 2 +-
arch/powerpc/mm/tlb_32.c | 6 ++++++
arch/powerpc/mm/tlb_64.c | 4 ++--
4 files changed, 10 insertions(+), 6 deletions(-)
--- mm06/arch/powerpc/mm/hugetlbpage.c 2005-11-07 07:39:05.000000000 +0000
+++ mm07/arch/powerpc/mm/hugetlbpage.c 2005-11-09 14:39:16.000000000 +0000
@@ -754,9 +754,7 @@ repeat:
}
/*
- * No need to use ldarx/stdcx here because all who
- * might be updating the pte will hold the
- * page_table_lock
+ * No need to use ldarx/stdcx here
*/
*ptep = __pte(new_pte & ~_PAGE_BUSY);
--- mm06/arch/powerpc/mm/mem.c 2005-11-07 07:39:05.000000000 +0000
+++ mm07/arch/powerpc/mm/mem.c 2005-11-09 14:39:16.000000000 +0000
@@ -491,7 +491,7 @@ EXPORT_SYMBOL(flush_icache_user_range);
* We use it to preload an HPTE into the hash table corresponding to
* the updated linux PTE.
*
- * This must always be called with the mm->page_table_lock held
+ * This must always be called with the pte lock held.
*/
void update_mmu_cache(struct vm_area_struct *vma, unsigned long address,
pte_t pte)
--- mm06/arch/powerpc/mm/tlb_32.c 2005-11-07 07:39:05.000000000 +0000
+++ mm07/arch/powerpc/mm/tlb_32.c 2005-11-09 14:39:16.000000000 +0000
@@ -149,6 +149,12 @@ void flush_tlb_mm(struct mm_struct *mm)
return;
}
+ /*
+ * It is safe to go down the mm's list of vmas when called
+ * from dup_mmap, holding mmap_sem. It would also be safe from
+ * unmap_region or exit_mmap, but not from vmtruncate on SMP -
+ * but it seems dup_mmap is the only SMP case which gets here.
+ */
for (mp = mm->mmap; mp != NULL; mp = mp->vm_next)
flush_range(mp->vm_mm, mp->vm_start, mp->vm_end);
FINISH_FLUSH;
--- mm06/arch/powerpc/mm/tlb_64.c 2005-11-07 07:39:05.000000000 +0000
+++ mm07/arch/powerpc/mm/tlb_64.c 2005-11-09 14:39:16.000000000 +0000
@@ -95,7 +95,7 @@ static void pte_free_submit(struct pte_f
void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
{
- /* This is safe as we are holding page_table_lock */
+ /* This is safe since tlb_gather_mmu has disabled preemption */
cpumask_t local_cpumask = cpumask_of_cpu(smp_processor_id());
struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
@@ -206,7 +206,7 @@ void __flush_tlb_pending(struct ppc64_tl
void pte_free_finish(void)
{
- /* This is safe as we are holding page_table_lock */
+ /* This is safe since tlb_gather_mmu has disabled preemption */
struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
if (*batchp == NULL)
On Thu, 2005-11-10 at 01:50 +0000, Hugh Dickins wrote:
> ppc's HIGHPTE config option was removed in 2.5.28, and nobody seems to
> have wanted it enough to restore it: so remove its traces from pgtable.h
> and pte_alloc_one. Or supply an alternative patch to config it back?
Hrm... ppc32 does have fully working kmap & kmap atomic so I would
rather put back the CONFIG option...
Ben.
Restore an earlier mod which went missing in the powerpc reshuffle:
the 4xx mmu_mapin_ram does not need to take init_mm.page_table_lock.
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/powerpc/mm/4xx_mmu.c | 4 ----
1 files changed, 4 deletions(-)
--- mm07/arch/powerpc/mm/4xx_mmu.c 2005-11-07 07:39:05.000000000 +0000
+++ mm08/arch/powerpc/mm/4xx_mmu.c 2005-11-09 14:39:29.000000000 +0000
@@ -110,13 +110,11 @@ unsigned long __init mmu_mapin_ram(void)
pmd_t *pmdp;
unsigned long val = p | _PMD_SIZE_16M | _PAGE_HWEXEC | _PAGE_HWWRITE;
- spin_lock(&init_mm.page_table_lock);
pmdp = pmd_offset(pgd_offset_k(v), v);
pmd_val(*pmdp++) = val;
pmd_val(*pmdp++) = val;
pmd_val(*pmdp++) = val;
pmd_val(*pmdp++) = val;
- spin_unlock(&init_mm.page_table_lock);
v += LARGE_PAGE_SIZE_16M;
p += LARGE_PAGE_SIZE_16M;
@@ -127,10 +125,8 @@ unsigned long __init mmu_mapin_ram(void)
pmd_t *pmdp;
unsigned long val = p | _PMD_SIZE_4M | _PAGE_HWEXEC | _PAGE_HWWRITE;
- spin_lock(&init_mm.page_table_lock);
pmdp = pmd_offset(pgd_offset_k(v), v);
pmd_val(*pmdp) = val;
- spin_unlock(&init_mm.page_table_lock);
v += LARGE_PAGE_SIZE_4M;
p += LARGE_PAGE_SIZE_4M;
Hugh Dickins writes:
> ppc's HIGHPTE config option was removed in 2.5.28, and nobody seems to
> have wanted it enough to restore it: so remove its traces from pgtable.h
> and pte_alloc_one. Or supply an alternative patch to config it back?
I'm staggered. We do want to be able to have pte pages in highmem.
I would rather just have it always enabled if CONFIG_HIGHMEM=y, rather
than putting the config option back. I think that should just involve
adding __GFP_HIGHMEM to the flags for alloc_pages in pte_alloc_one
unconditionally, no?
Paul.
alpha, s390, sparc64, x86_64 are each missing some primitives from their
atomic64 support: fill in the gaps I've noticed by extrapolating asm,
follow the groupings in each file, and say "int" for the booleans rather
than long or long long. But powerpc and parisc still have no atomic64.
Signed-off-by: Hugh Dickins <[email protected]>
---
include/asm-alpha/atomic.h | 7 ++++--
include/asm-s390/atomic.h | 10 ++++++--
include/asm-sparc64/atomic.h | 1
include/asm-x86_64/atomic.h | 49 ++++++++++++++++++++++++++++++++-----------
4 files changed, 50 insertions(+), 17 deletions(-)
--- mm08/include/asm-alpha/atomic.h 2005-10-28 01:02:08.000000000 +0100
+++ mm09/include/asm-alpha/atomic.h 2005-11-09 14:39:48.000000000 +0000
@@ -118,8 +118,6 @@ static __inline__ long atomic_add_return
return result;
}
-#define atomic_add_negative(a, v) (atomic_add_return((a), (v)) < 0)
-
static __inline__ long atomic64_add_return(long i, atomic64_t * v)
{
long temp, result;
@@ -177,6 +175,9 @@ static __inline__ long atomic64_sub_retu
return result;
}
+#define atomic_add_negative(a, v) (atomic_add_return((a), (v)) < 0)
+#define atomic64_add_negative(a, v) (atomic64_add_return((a), (v)) < 0)
+
#define atomic_dec_return(v) atomic_sub_return(1,(v))
#define atomic64_dec_return(v) atomic64_sub_return(1,(v))
@@ -187,6 +188,8 @@ static __inline__ long atomic64_sub_retu
#define atomic64_sub_and_test(i,v) (atomic64_sub_return((i), (v)) == 0)
#define atomic_inc_and_test(v) (atomic_add_return(1, (v)) == 0)
+#define atomic64_inc_and_test(v) (atomic64_add_return(1, (v)) == 0)
+
#define atomic_dec_and_test(v) (atomic_sub_return(1, (v)) == 0)
#define atomic64_dec_and_test(v) (atomic64_sub_return(1, (v)) == 0)
--- mm08/include/asm-s390/atomic.h 2005-08-29 00:41:01.000000000 +0100
+++ mm09/include/asm-s390/atomic.h 2005-11-09 14:39:48.000000000 +0000
@@ -131,7 +131,7 @@ static __inline__ long long atomic64_add
{
return __CSG_LOOP(v, i, "agr");
}
-static __inline__ long long atomic64_add_negative(long long i, atomic64_t * v)
+static __inline__ int atomic64_add_negative(long long i, atomic64_t * v)
{
return __CSG_LOOP(v, i, "agr") < 0;
}
@@ -139,6 +139,10 @@ static __inline__ void atomic64_sub(long
{
__CSG_LOOP(v, i, "sgr");
}
+static __inline__ long long atomic64_sub_return(long long i, atomic64_t * v)
+{
+ return __CSG_LOOP(v, i, "sgr");
+}
static __inline__ void atomic64_inc(volatile atomic64_t * v)
{
__CSG_LOOP(v, 1, "agr");
@@ -147,7 +151,7 @@ static __inline__ long long atomic64_inc
{
return __CSG_LOOP(v, 1, "agr");
}
-static __inline__ long long atomic64_inc_and_test(volatile atomic64_t * v)
+static __inline__ int atomic64_inc_and_test(volatile atomic64_t * v)
{
return __CSG_LOOP(v, 1, "agr") == 0;
}
@@ -159,7 +163,7 @@ static __inline__ long long atomic64_dec
{
return __CSG_LOOP(v, 1, "sgr");
}
-static __inline__ long long atomic64_dec_and_test(volatile atomic64_t * v)
+static __inline__ int atomic64_dec_and_test(volatile atomic64_t * v)
{
return __CSG_LOOP(v, 1, "sgr") == 0;
}
--- mm08/include/asm-sparc64/atomic.h 2005-10-28 01:02:08.000000000 +0100
+++ mm09/include/asm-sparc64/atomic.h 2005-11-09 14:39:48.000000000 +0000
@@ -54,6 +54,7 @@ extern int atomic64_sub_ret(int, atomic6
* other cases.
*/
#define atomic_inc_and_test(v) (atomic_inc_return(v) == 0)
+#define atomic64_inc_and_test(v) (atomic64_inc_return(v) == 0)
#define atomic_sub_and_test(i, v) (atomic_sub_ret(i, v) == 0)
#define atomic64_sub_and_test(i, v) (atomic64_sub_ret(i, v) == 0)
--- mm08/include/asm-x86_64/atomic.h 2004-12-24 21:37:29.000000000 +0000
+++ mm09/include/asm-x86_64/atomic.h 2005-11-09 14:39:48.000000000 +0000
@@ -160,8 +160,8 @@ static __inline__ int atomic_inc_and_tes
/**
* atomic_add_negative - add and test if negative
- * @v: pointer of type atomic_t
* @i: integer value to add
+ * @v: pointer of type atomic_t
*
* Atomically adds @i to @v and returns true
* if the result is negative, or false when
@@ -178,6 +178,31 @@ static __inline__ int atomic_add_negativ
return c;
}
+/**
+ * atomic_add_return - add and return
+ * @i: integer value to add
+ * @v: pointer of type atomic_t
+ *
+ * Atomically adds @i to @v and returns @i + @v
+ */
+static __inline__ int atomic_add_return(int i, atomic_t *v)
+{
+ int __i = i;
+ __asm__ __volatile__(
+ LOCK "xaddl %0, %1;"
+ :"=r"(i)
+ :"m"(v->counter), "0"(i));
+ return i + __i;
+}
+
+static __inline__ int atomic_sub_return(int i, atomic_t *v)
+{
+ return atomic_add_return(-i,v);
+}
+
+#define atomic_inc_return(v) (atomic_add_return(1,v))
+#define atomic_dec_return(v) (atomic_sub_return(1,v))
+
/* An 64bit atomic type */
typedef struct { volatile long counter; } atomic64_t;
@@ -320,14 +345,14 @@ static __inline__ int atomic64_inc_and_t
/**
* atomic64_add_negative - add and test if negative
- * @v: pointer to atomic64_t
* @i: integer value to add
+ * @v: pointer to atomic64_t
*
* Atomically adds @i to @v and returns true
* if the result is negative, or false when
* result is greater than or equal to zero.
*/
-static __inline__ long atomic64_add_negative(long i, atomic64_t *v)
+static __inline__ int atomic64_add_negative(long i, atomic64_t *v)
{
unsigned char c;
@@ -339,29 +364,29 @@ static __inline__ long atomic64_add_nega
}
/**
- * atomic_add_return - add and return
- * @v: pointer of type atomic_t
+ * atomic64_add_return - add and return
* @i: integer value to add
+ * @v: pointer to type atomic64_t
*
* Atomically adds @i to @v and returns @i + @v
*/
-static __inline__ int atomic_add_return(int i, atomic_t *v)
+static __inline__ long atomic64_add_return(long i, atomic64_t *v)
{
- int __i = i;
+ long __i = i;
__asm__ __volatile__(
- LOCK "xaddl %0, %1;"
+ LOCK "xaddq %0, %1;"
:"=r"(i)
:"m"(v->counter), "0"(i));
return i + __i;
}
-static __inline__ int atomic_sub_return(int i, atomic_t *v)
+static __inline__ long atomic64_sub_return(long i, atomic64_t *v)
{
- return atomic_add_return(-i,v);
+ return atomic64_add_return(-i,v);
}
-#define atomic_inc_return(v) (atomic_add_return(1,v))
-#define atomic_dec_return(v) (atomic_sub_return(1,v))
+#define atomic64_inc_return(v) (atomic64_add_return(1,v))
+#define atomic64_dec_return(v) (atomic64_sub_return(1,v))
/* These are x86-specific, used by some header files */
#define atomic_clear_mask(mask, addr) \
Page count and page mapcount might overflow their 31 bits on 64-bit
architectures, especially now we're refcounting the ZERO_PAGE. We could
quite easily avoid counting it, but shared file pages may also overflow.
Prefer not to enlarge struct page: don't assign separate atomic64_ts to
count and mapcount, instead keep them both in one atomic64_t - the count
in the low 23 bits and the mapcount in the high 41 bits. But of course
that can only work if we don't duplicate mapcount in count in this case.
The low 23 bits can accomodate 0x7fffff, that's 2 * PID_MAX_LIMIT - 1,
which seems adequate for tasks with a transient hold on pages; and the
high 41 bits would use 16TB of page table space to back max mapcount.
Signed-off-by: Hugh Dickins <[email protected]>
---
I think Nick should have a veto on this patch if he wishes, it's not my
intention to make his lockless pagecache impossible. But I doubt he'll
have to veto it, I expect it just needs more macros: whereas page_count
carefully assembles grabcount and mapcount into the familiar page count,
he'll probably want something giving the raw atomic32 or atomic64 value.
include/linux/mm.h | 163 +++++++++++++++++++++++++++++++++++++--------------
include/linux/rmap.h | 12 ---
mm/memory.c | 3
mm/rmap.c | 9 +-
4 files changed, 124 insertions(+), 63 deletions(-)
--- mm09/include/linux/mm.h 2005-11-09 14:38:03.000000000 +0000
+++ mm10/include/linux/mm.h 2005-11-09 14:40:00.000000000 +0000
@@ -219,11 +219,15 @@ struct inode;
struct page {
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
+#ifdef ATOMIC64_INIT
+ atomic64_t _pcount; /* Both _count and _mapcount. */
+#else
atomic_t _count; /* Usage count, see below. */
atomic_t _mapcount; /* Count of ptes mapped in mms,
* to show when page is mapped
* & limit reverse map searches.
*/
+#endif
unsigned long private; /* Mapping-private opaque data:
* usually used for buffer_heads
* if PagePrivate set; used for
@@ -297,30 +301,112 @@ struct page {
* macros which retain the old rules: page_count(page) == 0 is a free page.
*/
+#ifdef ATOMIC64_INIT
+/*
+ * We squeeze _count and _mapcount into a single atomic64 _pcount:
+ * keeping the mapcount in the upper 41 bits, and the grabcount in
+ * the lower 23 bits: then page_count is the total of these two.
+ * When adding a mapping, get_page_mapped_testone transfers one from
+ * grabcount to mapcount; which put_page_mapped_testzero reverses.
+ * Each stored count is based from 0 in this case, not from -1.
+ */
+#define PCOUNT_SHIFT 23
+#define PCOUNT_MASK ((1UL << PCOUNT_SHIFT) - 1)
+
/*
- * Drop a ref, return true if the logical refcount fell to zero (the page has
- * no users)
+ * Drop a ref, return true if the logical refcount fell to zero
+ * (the page has no users)
*/
-#define put_page_testzero(p) \
- ({ \
- BUG_ON(page_count(p) == 0); \
- atomic_add_negative(-1, &(p)->_count); \
- })
+static inline int put_page_testzero(struct page *page)
+{
+ BUILD_BUG_ON(PCOUNT_MASK+1 < 2*PID_MAX_LIMIT);
+ BUG_ON(!(atomic64_read(&page->_pcount) & PCOUNT_MASK));
+ return atomic64_dec_and_test(&page->_pcount);
+}
+
+static inline int put_page_mapped_testzero(struct page *page)
+{
+ return (unsigned long)atomic64_sub_return(PCOUNT_MASK, &page->_pcount)
+ <= PCOUNT_MASK;
+}
/*
* Grab a ref, return true if the page previously had a logical refcount of
* zero. ie: returns true if we just grabbed an already-deemed-to-be-free page
*/
-#define get_page_testone(p) atomic_inc_and_test(&(p)->_count)
+#define get_page_testone(page) \
+ (atomic64_add_return(1, &(page)->_pcount) == 1)
+#define get_page_mapped_testone(page) \
+ ((atomic64_add_return(PCOUNT_MASK,&(page)->_pcount)>>PCOUNT_SHIFT)==1)
+
+#define set_page_count(page, val) atomic64_set(&(page)->_pcount, val)
+#define reset_page_mapcount(page) do { } while (0)
+#define __put_page(page) atomic64_dec(&(page)->_pcount)
-#define set_page_count(p,v) atomic_set(&(p)->_count, v - 1)
-#define __put_page(p) atomic_dec(&(p)->_count)
+static inline long page_count(struct page *page)
+{
+ unsigned long pcount;
-extern void FASTCALL(__page_cache_release(struct page *));
+ if (PageCompound(page))
+ page = (struct page *)page->private;
+ pcount = (unsigned long)atomic64_read(&page->_pcount);
+ /* Return total of grabcount and mapcount */
+ return (pcount & PCOUNT_MASK) + (pcount >> PCOUNT_SHIFT);
+}
-#ifdef CONFIG_HUGETLB_PAGE
+static inline void get_page(struct page *page)
+{
+ if (unlikely(PageCompound(page)))
+ page = (struct page *)page->private;
+ atomic64_inc(&page->_pcount);
+}
+
+static inline void get_page_dup_rmap(struct page *page)
+{
+ /* copy_one_pte increment mapcount */
+ atomic64_add(PCOUNT_MASK + 1, &page->_pcount);
+}
+
+static inline long page_mapcount(struct page *page)
+{
+ return (unsigned long)atomic64_read(&page->_pcount) >> PCOUNT_SHIFT;
+}
+
+static inline int page_mapped(struct page *page)
+{
+ return (unsigned long)atomic64_read(&page->_pcount) > PCOUNT_MASK;
+}
+
+#else /* !ATOMIC64_INIT */
+
+/*
+ * Drop a ref, return true if the logical refcount fell to zero
+ * (the page has no users)
+ */
+static inline int put_page_testzero(struct page *page)
+{
+ BUG_ON(atomic_read(&page->_count) == -1);
+ return atomic_add_negative(-1, &page->_count);
+}
+
+static inline int put_page_mapped_testzero(struct page *page)
+{
+ return atomic_add_negative(-1, &page->_mapcount);
+}
+
+/*
+ * Grab a ref, return true if the page previously had a logical refcount of
+ * zero. ie: returns true if we just grabbed an already-deemed-to-be-free page
+ */
+#define get_page_testone(page) atomic_inc_and_test(&(page)->_count)
+#define get_page_mapped_testone(page) \
+ atomic_inc_and_test(&(page)->_mapcount)
+
+#define set_page_count(page, val) atomic_set(&(page)->_count, val - 1)
+#define reset_page_mapcount(page) atomic_set(&(page)->_mapcount, -1)
+#define __put_page(page) atomic_dec(&(page)->_count)
-static inline int page_count(struct page *page)
+static inline long page_count(struct page *page)
{
if (PageCompound(page))
page = (struct page *)page->private;
@@ -334,24 +420,36 @@ static inline void get_page(struct page
atomic_inc(&page->_count);
}
-void put_page(struct page *page);
-
-#else /* CONFIG_HUGETLB_PAGE */
+static inline void get_page_dup_rmap(struct page *page)
+{
+ /* copy_one_pte increment total count and mapcount */
+ atomic_inc(&page->_count);
+ atomic_inc(&page->_mapcount);
+}
-#define page_count(p) (atomic_read(&(p)->_count) + 1)
+static inline long page_mapcount(struct page *page)
+{
+ return atomic_read(&page->_mapcount) + 1;
+}
-static inline void get_page(struct page *page)
+static inline int page_mapped(struct page *page)
{
- atomic_inc(&page->_count);
+ return atomic_read(&page->_mapcount) >= 0;
}
+#endif /* !ATOMIC64_INIT */
+
+void FASTCALL(__page_cache_release(struct page *));
+
+#ifdef CONFIG_HUGETLB_PAGE
+void put_page(struct page *page);
+#else
static inline void put_page(struct page *page)
{
if (put_page_testzero(page))
__page_cache_release(page);
}
-
-#endif /* CONFIG_HUGETLB_PAGE */
+#endif /* !CONFIG_HUGETLB_PAGE */
/*
* Multiple processes may "see" the same page. E.g. for untouched
@@ -601,29 +699,6 @@ static inline pgoff_t page_index(struct
}
/*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
- */
-static inline void reset_page_mapcount(struct page *page)
-{
- atomic_set(&(page)->_mapcount, -1);
-}
-
-static inline int page_mapcount(struct page *page)
-{
- return atomic_read(&(page)->_mapcount) + 1;
-}
-
-/*
- * Return true if this page is mapped into pagetables.
- */
-static inline int page_mapped(struct page *page)
-{
- return atomic_read(&(page)->_mapcount) >= 0;
-}
-
-/*
* Error return values for the *_nopage functions
*/
#define NOPAGE_SIGBUS (NULL)
--- mm09/include/linux/rmap.h 2005-11-07 07:39:58.000000000 +0000
+++ mm10/include/linux/rmap.h 2005-11-09 14:40:00.000000000 +0000
@@ -74,18 +74,6 @@ void page_add_anon_rmap(struct page *, s
void page_add_file_rmap(struct page *);
void page_remove_rmap(struct page *);
-/**
- * page_dup_rmap - duplicate pte mapping to a page
- * @page: the page to add the mapping to
- *
- * For copy_page_range only: minimal extract from page_add_rmap,
- * avoiding unnecessary tests (already checked) so it's quicker.
- */
-static inline void page_dup_rmap(struct page *page)
-{
- atomic_inc(&page->_mapcount);
-}
-
/*
* Called from mm/vmscan.c to handle paging out
*/
--- mm09/mm/memory.c 2005-11-07 07:39:59.000000000 +0000
+++ mm10/mm/memory.c 2005-11-09 14:40:00.000000000 +0000
@@ -414,8 +414,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
if (vm_flags & VM_SHARED)
pte = pte_mkclean(pte);
pte = pte_mkold(pte);
- get_page(page);
- page_dup_rmap(page);
+ get_page_dup_rmap(page);
rss[!!PageAnon(page)]++;
out_set_pte:
--- mm09/mm/rmap.c 2005-11-09 14:38:03.000000000 +0000
+++ mm10/mm/rmap.c 2005-11-09 14:40:00.000000000 +0000
@@ -450,7 +450,7 @@ int page_referenced(struct page *page, i
void page_add_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
{
- if (atomic_inc_and_test(&page->_mapcount)) {
+ if (get_page_mapped_testone(page)) {
struct anon_vma *anon_vma = vma->anon_vma;
BUG_ON(!anon_vma);
@@ -474,8 +474,7 @@ void page_add_file_rmap(struct page *pag
{
BUG_ON(PageAnon(page));
BUG_ON(!pfn_valid(page_to_pfn(page)));
-
- if (atomic_inc_and_test(&page->_mapcount))
+ if (get_page_mapped_testone(page))
inc_page_state(nr_mapped);
}
@@ -487,8 +486,8 @@ void page_add_file_rmap(struct page *pag
*/
void page_remove_rmap(struct page *page)
{
- if (atomic_add_negative(-1, &page->_mapcount)) {
- BUG_ON(page_mapcount(page) < 0);
+ BUG_ON(!page_mapcount(page));
+ if (put_page_mapped_testzero(page)) {
/*
* It would be tidy to reset the PageAnon mapping here,
* but that might overwrite a racing page_add_anon_rmap
The type of the page_count and page_mapcount functions has changed from
int to long. Update those places which give warnings (mostly debug
printks), or where the count might significantly overflow.
Don't bother with the arch's show_mem functions for now (some say int
shared, some long): they don't cause warnings, the truncation wouldn't
matter much, and we'll want to visit them all (perhaps bring them into
common code) in a later phase of PageReserved removal.
The thought of page_referenced on a page whose mapcount exceeds an int
is rather disturbing: it should probably skip high mapcounts unless the
memory pressure is high; but that's a different problem, ignore for now.
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/ppc64/kernel/vdso.c | 6 +++---
arch/sh64/lib/dbg.c | 2 +-
drivers/char/drm/drm_vm.c | 2 +-
fs/proc/task_mmu.c | 2 +-
mm/page_alloc.c | 2 +-
mm/rmap.c | 18 +++++++++---------
mm/swapfile.c | 2 +-
7 files changed, 17 insertions(+), 17 deletions(-)
--- mm10/arch/ppc64/kernel/vdso.c 2005-11-07 07:39:07.000000000 +0000
+++ mm11/arch/ppc64/kernel/vdso.c 2005-11-09 14:40:00.000000000 +0000
@@ -112,11 +112,11 @@ struct lib64_elfinfo
#ifdef __DEBUG
static void dump_one_vdso_page(struct page *pg, struct page *upg)
{
- printk("kpg: %p (c:%d,f:%08lx)", __va(page_to_pfn(pg) << PAGE_SHIFT),
+ printk("kpg: %p (c:%ld,f:%08lx)", __va(page_to_pfn(pg) << PAGE_SHIFT),
page_count(pg),
pg->flags);
if (upg/* && pg != upg*/) {
- printk(" upg: %p (c:%d,f:%08lx)", __va(page_to_pfn(upg) << PAGE_SHIFT),
+ printk(" upg: %p (c:%ld,f:%08lx)", __va(page_to_pfn(upg) << PAGE_SHIFT),
page_count(upg),
upg->flags);
}
@@ -184,7 +184,7 @@ static struct page * vdso_vma_nopage(str
pg = virt_to_page(vbase + offset);
get_page(pg);
- DBG(" ->page count: %d\n", page_count(pg));
+ DBG(" ->page count: %ld\n", page_count(pg));
return pg;
}
--- mm10/arch/sh64/lib/dbg.c 2005-06-17 20:48:29.000000000 +0100
+++ mm11/arch/sh64/lib/dbg.c 2005-11-09 14:40:00.000000000 +0000
@@ -422,7 +422,7 @@ unsigned long lookup_itlb(unsigned long
void print_page(struct page *page)
{
- printk(" page[%p] -> index 0x%lx, count 0x%x, flags 0x%lx\n",
+ printk(" page[%p] -> index 0x%lx, count 0x%lx, flags 0x%lx\n",
page, page->index, page_count(page), page->flags);
printk(" address_space = %p, pages =%ld\n", page->mapping,
page->mapping->nrpages);
--- mm10/drivers/char/drm/drm_vm.c 2005-11-07 07:39:15.000000000 +0000
+++ mm11/drivers/char/drm/drm_vm.c 2005-11-09 14:40:00.000000000 +0000
@@ -112,7 +112,7 @@ static __inline__ struct page *drm_do_vm
get_page(page);
DRM_DEBUG
- ("baddr = 0x%lx page = 0x%p, offset = 0x%lx, count=%d\n",
+ ("baddr = 0x%lx page = 0x%p, offset = 0x%lx, count=%ld\n",
baddr, __va(agpmem->memory->memory[offset]), offset,
page_count(page));
--- mm10/fs/proc/task_mmu.c 2005-11-07 07:39:46.000000000 +0000
+++ mm11/fs/proc/task_mmu.c 2005-11-09 14:40:00.000000000 +0000
@@ -422,7 +422,7 @@ static struct numa_maps *get_numa_maps(c
for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
page = follow_page(mm, vaddr, 0);
if (page) {
- int count = page_mapcount(page);
+ long count = page_mapcount(page);
if (count)
md->mapped++;
--- mm10/mm/page_alloc.c 2005-11-09 14:38:03.000000000 +0000
+++ mm11/mm/page_alloc.c 2005-11-09 14:40:00.000000000 +0000
@@ -126,7 +126,7 @@ static void bad_page(const char *functio
{
printk(KERN_EMERG "Bad page state at %s (in process '%s', page %p)\n",
function, current->comm, page);
- printk(KERN_EMERG "flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
+ printk(KERN_EMERG "flags:0x%0*lx mapping:%p mapcount:%ld count:%ld\n",
(int)(2*sizeof(unsigned long)), (unsigned long)page->flags,
page->mapping, page_mapcount(page), page_count(page));
printk(KERN_EMERG "Backtrace:\n");
--- mm10/mm/rmap.c 2005-11-09 14:40:00.000000000 +0000
+++ mm11/mm/rmap.c 2005-11-09 14:40:00.000000000 +0000
@@ -64,12 +64,12 @@ static inline void validate_anon_vma(str
#ifdef RMAP_DEBUG
struct anon_vma *anon_vma = find_vma->anon_vma;
struct vm_area_struct *vma;
- unsigned int mapcount = 0;
+ unsigned int vmacount = 0;
int found = 0;
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
- mapcount++;
- BUG_ON(mapcount > 100000);
+ vmacount++;
+ BUG_ON(vmacount > 100000);
if (vma == find_vma)
found = 1;
}
@@ -289,7 +289,7 @@ pte_t *page_check_address(struct page *p
* repeatedly from either page_referenced_anon or page_referenced_file.
*/
static int page_referenced_one(struct page *page,
- struct vm_area_struct *vma, unsigned int *mapcount, int ignore_token)
+ struct vm_area_struct *vma, long *mapcount, int ignore_token)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -322,7 +322,7 @@ out:
static int page_referenced_anon(struct page *page, int ignore_token)
{
- unsigned int mapcount;
+ long mapcount;
struct anon_vma *anon_vma;
struct vm_area_struct *vma;
int referenced = 0;
@@ -355,7 +355,7 @@ static int page_referenced_anon(struct p
*/
static int page_referenced_file(struct page *page, int ignore_token)
{
- unsigned int mapcount;
+ long mapcount;
struct address_space *mapping = page->mapping;
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
struct vm_area_struct *vma;
@@ -600,7 +600,7 @@ out:
#define CLUSTER_MASK (~(CLUSTER_SIZE - 1))
static void try_to_unmap_cluster(unsigned long cursor,
- unsigned int *mapcount, struct vm_area_struct *vma)
+ long *mapcount, struct vm_area_struct *vma)
{
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
@@ -712,7 +712,7 @@ static int try_to_unmap_file(struct page
unsigned long cursor;
unsigned long max_nl_cursor = 0;
unsigned long max_nl_size = 0;
- unsigned int mapcount;
+ long mapcount;
spin_lock(&mapping->i_mmap_lock);
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
@@ -768,7 +768,7 @@ static int try_to_unmap_file(struct page
try_to_unmap_cluster(cursor, &mapcount, vma);
cursor += CLUSTER_SIZE;
vma->vm_private_data = (void *) cursor;
- if ((int)mapcount <= 0)
+ if (mapcount <= 0)
goto out;
}
vma->vm_private_data = (void *) max_nl_cursor;
--- mm10/mm/swapfile.c 2005-11-09 14:38:03.000000000 +0000
+++ mm11/mm/swapfile.c 2005-11-09 14:40:00.000000000 +0000
@@ -308,7 +308,7 @@ static inline int page_swapcount(struct
*/
int can_share_swap_page(struct page *page)
{
- int count;
+ long count;
BUG_ON(!PageLocked(page));
count = page_mapcount(page);
Update -mm tree's reiser4 source for the long page_count.
Signed-off-by: Hugh Dickins <[email protected]>
---
fs/reiser4/page_cache.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)
--- mm11/fs/reiser4/page_cache.c 2005-11-09 14:38:18.000000000 +0000
+++ mm12/fs/reiser4/page_cache.c 2005-11-09 14:40:00.000000000 +0000
@@ -751,7 +751,7 @@ void print_page(const char *prefix, stru
printk("null page\n");
return;
}
- printk("%s: page index: %lu mapping: %p count: %i private: %lx\n",
+ printk("%s: page index: %lu mapping: %p count: %li private: %lx\n",
prefix, page->index, page->mapping, page_count(page),
page->private);
printk("\tflags: %s%s%s%s %s%s%s %s%s%s %s%s\n",
Most calls to get_user_pages soon release the pages and then return to
user space, but some are long term - notably Infiniband's ib_umem_get.
That's blessed with good locked_vm checking, but if the system were
misconfigured, then it might be possible to build up a huge page_count.
Guard against overflow by page_count_too_high checks in get_user_pages:
refuse with -ENOMEM once page count reaches its final PID_MAX_LIMIT
(which is why we allowed for 2*PID_MAX_LIMIT in the 23 bits of count).
Sorry, can't touch get_user_pages without giving it more cleanup.
Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/mm.h | 16 ++++++++++++++++
mm/memory.c | 40 +++++++++++++++++++++++-----------------
2 files changed, 39 insertions(+), 17 deletions(-)
--- mm12/include/linux/mm.h 2005-11-09 14:40:00.000000000 +0000
+++ mm13/include/linux/mm.h 2005-11-09 14:40:17.000000000 +0000
@@ -312,6 +312,7 @@ struct page {
*/
#define PCOUNT_SHIFT 23
#define PCOUNT_MASK ((1UL << PCOUNT_SHIFT) - 1)
+#define PCOUNT_HIGH (PCOUNT_MASK - PID_MAX_LIMIT)
/*
* Drop a ref, return true if the logical refcount fell to zero
@@ -377,8 +378,17 @@ static inline int page_mapped(struct pag
return (unsigned long)atomic64_read(&page->_pcount) > PCOUNT_MASK;
}
+static inline int page_count_too_high(struct page *page)
+{
+ /* get_user_pages check when nearing overflow */
+ return ((unsigned long)atomic64_read(&page->_pcount) & PCOUNT_MASK)
+ >= PCOUNT_HIGH;
+}
+
#else /* !ATOMIC64_INIT */
+#define PCOUNT_HIGH (INT_MAX - PID_MAX_LIMIT)
+
/*
* Drop a ref, return true if the logical refcount fell to zero
* (the page has no users)
@@ -437,6 +447,12 @@ static inline int page_mapped(struct pag
return atomic_read(&page->_mapcount) >= 0;
}
+static inline int page_count_too_high(struct page *page)
+{
+ /* get_user_pages check when nearing overflow */
+ return atomic_read(&page->_count) >= PCOUNT_HIGH;
+}
+
#endif /* !ATOMIC64_INIT */
void FASTCALL(__page_cache_release(struct page *));
--- mm12/mm/memory.c 2005-11-09 14:40:00.000000000 +0000
+++ mm13/mm/memory.c 2005-11-09 14:40:17.000000000 +0000
@@ -928,39 +928,43 @@ int get_user_pages(struct task_struct *t
do {
struct vm_area_struct *vma;
unsigned int foll_flags;
+ struct page *page;
vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(tsk, start)) {
- unsigned long pg = start & PAGE_MASK;
- struct vm_area_struct *gate_vma = get_gate_vma(tsk);
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
+ pte_t ptent;
+
if (write) /* user gate pages are read-only */
return i ? : -EFAULT;
- if (pg > TASK_SIZE)
- pgd = pgd_offset_k(pg);
+ start &= PAGE_MASK; /* what needs that? */
+ if (start >= TASK_SIZE)
+ pgd = pgd_offset_k(start);
else
- pgd = pgd_offset_gate(mm, pg);
+ pgd = pgd_offset_gate(mm, start);
BUG_ON(pgd_none(*pgd));
- pud = pud_offset(pgd, pg);
+ pud = pud_offset(pgd, start);
BUG_ON(pud_none(*pud));
- pmd = pmd_offset(pud, pg);
+ pmd = pmd_offset(pud, start);
if (pmd_none(*pmd))
return i ? : -EFAULT;
- pte = pte_offset_map(pmd, pg);
- if (pte_none(*pte)) {
- pte_unmap(pte);
+ pte = pte_offset_map(pmd, start);
+ ptent = *pte;
+ pte_unmap(pte);
+ if (pte_none(ptent))
return i ? : -EFAULT;
- }
if (pages) {
- pages[i] = pte_page(*pte);
- get_page(pages[i]);
+ page = pte_page(ptent);
+ if (page_count_too_high(page))
+ return i ? : -ENOMEM;
+ get_page(page);
+ pages[i] = page;
}
- pte_unmap(pte);
if (vmas)
- vmas[i] = gate_vma;
+ vmas[i] = get_gate_vma(tsk);
i++;
start += PAGE_SIZE;
len--;
@@ -985,8 +989,6 @@ int get_user_pages(struct task_struct *t
foll_flags |= FOLL_ANON;
do {
- struct page *page;
-
if (write)
foll_flags |= FOLL_WRITE;
@@ -1020,6 +1022,10 @@ int get_user_pages(struct task_struct *t
}
}
if (pages) {
+ if (page_count_too_high(page)) {
+ put_page(page);
+ return i ? : -ENOMEM;
+ }
pages[i] = page;
flush_dcache_page(page);
}
A 32-bit machine needs 16GB of page table space to back its max mapcount,
a 64-bit machine needs 16TB of page table space to back its max mapcount.
But, there are certainly 32-bit machines with 64GB physical - and more?
and even if 16TB were a 64-bit limit today, it would not be tomorrow.
Yet it'll be some time (I hope) before such machines need to use that
amount of their memory just on the page tables.
Therefore, guard against mapcount overflow on such extreme machines, by
limiting the number of page table pages with a check before incrementing
nr_page_table_pages. That's a per_cpu variable, so add a per_cpu max,
and avoid extra locking (except at startup and rare occasions after).
Of course, normally those page tables would be filled with entries for
different pages, and no mapcount remotely approach the limit; but this
check avoids checks on hotter paths, without being too restrictive.
Signed-off-by: Hugh Dickins <[email protected]>
---
But 64-bit powerpc and parisc are limited to 16GB of page table space by
this, because they don't yet provide the atomic64_t type and operations.
Is there someone who could provide the necessary atomic64_t for them?
include/linux/mm.h | 2
include/linux/page-flags.h | 1
mm/memory.c | 7 +--
mm/page_alloc.c | 95 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 102 insertions(+), 3 deletions(-)
--- mm13/include/linux/mm.h 2005-11-09 14:40:17.000000000 +0000
+++ mm14/include/linux/mm.h 2005-11-09 14:40:32.000000000 +0000
@@ -313,6 +313,7 @@ struct page {
#define PCOUNT_SHIFT 23
#define PCOUNT_MASK ((1UL << PCOUNT_SHIFT) - 1)
#define PCOUNT_HIGH (PCOUNT_MASK - PID_MAX_LIMIT)
+#define MAPCOUNT_MAX (-1UL >> PCOUNT_SHIFT)
/*
* Drop a ref, return true if the logical refcount fell to zero
@@ -388,6 +389,7 @@ static inline int page_count_too_high(st
#else /* !ATOMIC64_INIT */
#define PCOUNT_HIGH (INT_MAX - PID_MAX_LIMIT)
+#define MAPCOUNT_MAX (INT_MAX - 2*PID_MAX_LIMIT)
/*
* Drop a ref, return true if the logical refcount fell to zero
--- mm13/include/linux/page-flags.h 2005-11-07 07:39:57.000000000 +0000
+++ mm14/include/linux/page-flags.h 2005-11-09 14:40:32.000000000 +0000
@@ -139,6 +139,7 @@ extern void get_page_state_node(struct p
extern void get_full_page_state(struct page_state *ret);
extern unsigned long __read_page_state(unsigned long offset);
extern void __mod_page_state(unsigned long offset, unsigned long delta);
+extern int inc_page_table_pages(void);
#define read_page_state(member) \
__read_page_state(offsetof(struct page_state, member))
--- mm13/mm/memory.c 2005-11-09 14:40:17.000000000 +0000
+++ mm14/mm/memory.c 2005-11-09 14:40:32.000000000 +0000
@@ -291,22 +291,23 @@ void free_pgtables(struct mmu_gather **t
int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
{
+ int ret = 0;
struct page *new = pte_alloc_one(mm, address);
if (!new)
return -ENOMEM;
pte_lock_init(new);
spin_lock(&mm->page_table_lock);
- if (pmd_present(*pmd)) { /* Another has populated it */
+ if (pmd_present(*pmd) || /* Another has populated it */
+ (ret = inc_page_table_pages())) {
pte_lock_deinit(new);
pte_free(new);
} else {
mm->nr_ptes++;
- inc_page_state(nr_page_table_pages);
pmd_populate(mm, pmd, new);
}
spin_unlock(&mm->page_table_lock);
- return 0;
+ return ret;
}
int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
--- mm13/mm/page_alloc.c 2005-11-09 14:40:00.000000000 +0000
+++ mm14/mm/page_alloc.c 2005-11-09 14:40:32.000000000 +0000
@@ -1217,6 +1217,14 @@ static void show_node(struct zone *zone)
#endif
/*
+ * Declarations for inc_page_table_pages(): placed here in the hope
+ * that max_page_table_pages will share cacheline with page_states.
+ */
+#define MAX_PAGE_TABLE_PAGES (MAPCOUNT_MAX / PTRS_PER_PTE)
+static DEFINE_SPINLOCK(max_page_tables_lock);
+static DEFINE_PER_CPU(unsigned long, max_page_table_pages) = {0};
+
+/*
* Accumulate the page_state information across all CPUs.
* The result is unavoidably approximate - it can change
* during and after execution of this function.
@@ -1309,6 +1317,90 @@ void __mod_page_state(unsigned long offs
EXPORT_SYMBOL(__mod_page_state);
+/*
+ * inc_page_table_pages() increments cpu page_state.nr_page_table_pages
+ * after checking against the MAX_PAGE_TABLE_PAGES limit: which ensures
+ * mapcount cannot wrap even if _every_ page table entry points to the
+ * same page. Absurdly draconian, yet no serious practical limitation -
+ * limits 32-bit to 16GB in page tables, 64-bit to 16TB in page tables.
+ */
+int inc_page_table_pages(void)
+{
+ unsigned long offset;
+ unsigned long *max;
+ unsigned long *ptr;
+ unsigned long nr_ptps;
+ int nr_cpus;
+ long delta;
+ int cpu;
+
+ offset = offsetof(struct page_state, nr_page_table_pages);
+again:
+ ptr = (void *) &__get_cpu_var(page_states) + offset;
+ max = &__get_cpu_var(max_page_table_pages);
+ /*
+ * Beware, *ptr and *max may go "negative" if more page
+ * tables happen to be freed on this cpu than allocated.
+ * We avoid the need for barriers by keeping max 1 low.
+ */
+ if (likely((long)(*max - *ptr) > 0)) {
+ (*ptr)++;
+ return 0;
+ }
+
+ spin_lock(&max_page_tables_lock);
+ /*
+ * Below, we drop *max on each cpu to stop racing allocations
+ * while we're updating. But perhaps another cpu just did the
+ * update while we were waiting for the lock: don't do it again.
+ */
+ if ((long)(*max - *ptr) > 0) {
+ (*ptr)++;
+ spin_unlock(&max_page_tables_lock);
+ return 0;
+ }
+
+ /*
+ * Find how much is allocated and how many online cpus.
+ * Stop racing allocations by dropping *max temporarily.
+ */
+ nr_cpus = 0;
+ nr_ptps = 0;
+ for_each_online_cpu(cpu) {
+ ptr = (void *) &per_cpu(page_states, cpu) + offset;
+ max = &per_cpu(max_page_table_pages, cpu);
+ *max = *ptr;
+ nr_ptps += *max;
+ nr_cpus++;
+ }
+
+ /*
+ * Allow each cpu the same quota. Subtract 1 to avoid the need
+ * for barriers above: each racing cpu might allocate one table
+ * too many, but will meet a barrier before it can get another.
+ */
+ delta = ((MAX_PAGE_TABLE_PAGES - nr_ptps) / nr_cpus) - 1;
+ if (delta <= 0) {
+ spin_unlock(&max_page_tables_lock);
+ return -ENOMEM;
+ }
+
+ /*
+ * Redistribute new maxima amongst the online cpus.
+ * Don't allow too much if a new cpu has come online; don't
+ * worry if a cpu went offline, it'll get sorted eventually.
+ */
+ for_each_online_cpu(cpu) {
+ max = &per_cpu(max_page_table_pages, cpu);
+ *max += delta;
+ --nr_cpus;
+ if (!nr_cpus)
+ break;
+ }
+ spin_unlock(&max_page_tables_lock);
+ goto again;
+}
+
void __get_zone_counts(unsigned long *active, unsigned long *inactive,
unsigned long *free, struct pglist_data *pgdat)
{
@@ -2431,6 +2523,9 @@ static int page_alloc_cpu_notify(struct
src[i] = 0;
}
+ src = (unsigned long *)&per_cpu(max_page_table_pages, cpu);
+ *src = 0;
+
local_irq_enable();
}
return NOTIFY_OK;
Remove the INT_MAX/2 limit on remap_file_pages mapcount, which we added
in late 2.6.14: both 32-bit and 64-bit can now handle more, and are now
limited by their page tables.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/fremap.c | 3 ---
1 files changed, 3 deletions(-)
--- mm14/mm/fremap.c 2005-11-07 07:39:59.000000000 +0000
+++ mm15/mm/fremap.c 2005-11-09 14:40:44.000000000 +0000
@@ -87,9 +87,6 @@ int install_page(struct mm_struct *mm, s
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (!page->mapping || page->index >= size)
goto unlock;
- err = -ENOMEM;
- if (page_mapcount(page) > INT_MAX/2)
- goto unlock;
if (pte_none(*pte) || !zap_pte(mm, vma, addr, pte))
inc_mm_counter(mm, file_rss);
Hugh Dickins <[email protected]> wrote:
>
> The split ptlock patch enlarged the default SMP PREEMPT struct page from
> 32 to 36 bytes on most 32-bit platforms, from 32 to 44 bytes on PA-RISC
> 7xxx (without PREEMPT). That was not my intention, and I don't believe
> that split ptlock deserves any such slice of the user's memory.
>
> Could we please try this patch, or something like it? Again to overlay
> the spinlock_t from &page->private onwards, with corrected BUILD_BUG_ON
> that we don't go too far; with poisoning of the fields overlaid, and
> unsplit SMP config verifying that the split config is safe to use them.
>
> The previous attempt at this patch broke ppc64, which uses slab for its
> page tables - and slab saves vital info in page->lru: but no config of
> spinlock_t needs to overwrite lru on 64-bit anyway; and the only 32-bit
> user of slab for page tables is arm26, which is never SMP i.e. all the
> problems came from the "safety" checks, not from what's actually needed.
> So previous checks refined with #ifdefs, and a BUG_ON(PageSlab) added.
>
> This overlaying is unlikely to be portable forever: but the added checks
> should warn developers when it's going to break, long before any users.
argh.
Really, I'd prefer to abandon gcc-2.95.x rather than this. Look:
struct a
{
union {
struct {
int b;
int c;
};
int d;
};
} z;
main()
{
z.b = 1;
z.d = 1;
}
It does everything we want.
Of course, it would be nice to retain 2.95.x support. The reviled
page_private(() would help us do that. But the now-to-be-reviled
page_mapping() does extraneous stuff, and we'd need a ton of page_lru()'s.
So it'd be a big patch, converting page->lru to page->u.s.lru in lots of
places.
But I think either a big patch or 2.95.x abandonment is preferable to this
approach.
Hugh Dickins <[email protected]> wrote:
>
> Page count and page mapcount might overflow their 31 bits on 64-bit
> architectures, especially now we're refcounting the ZERO_PAGE. We could
> quite easily avoid counting it, but shared file pages may also overflow.
>
> Prefer not to enlarge struct page: don't assign separate atomic64_ts to
> count and mapcount, instead keep them both in one atomic64_t - the count
> in the low 23 bits and the mapcount in the high 41 bits. But of course
> that can only work if we don't duplicate mapcount in count in this case.
>
> The low 23 bits can accomodate 0x7fffff, that's 2 * PID_MAX_LIMIT - 1,
> which seems adequate for tasks with a transient hold on pages; and the
> high 41 bits would use 16TB of page table space to back max mapcount.
hm. I thought we were going to address this by
a) checking for an insane number of mappings of the same page in the
pagefault handler(s) and
b) special-casing ZERO_PAGEs on the page unallocator path.
That'll generate faster and smaller code than adding an extra shift and
possibly larger constants in lots of places.
It's cleaner, too.
On Wed, 9 Nov 2005, Andrew Morton wrote:
>
> It does everything we want.
I don't think so: the union leaves us just as vulnerable to some
subsystem using fields of the other half of the union, doesn't it?
Which is not really a problem, but is a part of what's worrying you.
> Of course, it would be nice to retain 2.95.x support. The reviled
> page_private(() would help us do that. But the now-to-be-reviled
> page_mapping() does extraneous stuff, and we'd need a ton of page_lru()'s.
>
> So it'd be a big patch, converting page->lru to page->u.s.lru in lots of
> places.
>
> But I think either a big patch or 2.95.x abandonment is preferable to this
> approach.
Hmm, that's a pity.
Hugh
On Wed, 9 Nov 2005, Andrew Morton wrote:
>
> hm. I thought we were going to address this by
>
> a) checking for an insane number of mappings of the same page in the
> pagefault handler(s) and
>
> b) special-casing ZERO_PAGEs on the page unallocator path.
>
> That'll generate faster and smaller code than adding an extra shift and
> possibly larger constants in lots of places.
I think we all had different ideas of how to deal with it.
When I outlined this method to you, you remarked "Hmm" or something
like that, which I agree didn't indicate great enthusiasm!
I'm quite pleased with the way it's worked out, but you were intending
that the 64-bit arches should get along with 32-bit counts? Maybe.
Hugh
On Thu, 10 Nov 2005, Paul Mackerras wrote:
> Hugh Dickins writes:
>
> > ppc's HIGHPTE config option was removed in 2.5.28, and nobody seems to
> > have wanted it enough to restore it: so remove its traces from pgtable.h
> > and pte_alloc_one. Or supply an alternative patch to config it back?
>
> I'm staggered. We do want to be able to have pte pages in highmem.
> I would rather just have it always enabled if CONFIG_HIGHMEM=y, rather
> than putting the config option back. I think that should just involve
> adding __GFP_HIGHMEM to the flags for alloc_pages in pte_alloc_one
> unconditionally, no?
I'm glad the patch has struck a spark! Yes, there doesn't appear to
be anything else that you'd need to do. Except, why was it disabled
in the first place? I think you'll need to reassure yourselves that
whatever the reason was has gone away, before reenabling.
Andrew, please drop 06/15, I don't think anything after depends on it
(but it came about when I was looking at uninlining pte_offset_map_lock).
Hugh
Hugh Dickins <[email protected]> wrote:
>
> On Wed, 9 Nov 2005, Andrew Morton wrote:
> >
> > It does everything we want.
>
> I don't think so: the union leaves us just as vulnerable to some
> subsystem using fields of the other half of the union, doesn't it?
Yes, spose so.
> Which is not really a problem, but is a part of what's worrying you.
yes, with either approach it remans the case that someone could write some
code which works just fine on their setup but explodes gorridly fomr
someone else who has a different config/arch which has a larger spinlock_t.
> > Of course, it would be nice to retain 2.95.x support. The reviled
> > page_private(() would help us do that. But the now-to-be-reviled
> > page_mapping() does extraneous stuff, and we'd need a ton of page_lru()'s.
> >
> > So it'd be a big patch, converting page->lru to page->u.s.lru in lots of
> > places.
> >
> > But I think either a big patch or 2.95.x abandonment is preferable to this
> > approach.
>
> Hmm, that's a pity.
Well plan B is to kill spinlock_t.break_lock. That fixes everything and
has obvious beneficial side-effects.
a) x86 spinlock_t is but one byte. Can we stuff break_lock into the
same word?
(Yes, there's also a >128 CPUs spinlock overflow problem to solve,
but perhaps we can use lock;addw?)
b) Redesign the code somehow. Currently break_lock means "there's
someone waiting for this lock".
But if we were to leave the lock in a decremented state while
spinning (as we've always done), that info is still present via the
value of spinlock_t.slock. Hence: bye-bye break_lock.
c) Make the break_lock stuff a new config option,
CONFIG_SUPER_LOW_LATENCY_BLOATS_STRUCT_PAGE.
d) Revert it wholesale, have sucky SMP worst-case latency ;)
Andrew Morton <[email protected]> wrote:
>
> Well plan B is to kill spinlock_t.break_lock.
And plan C is to use a bit_spinlock() against page->flags.
Hugh Dickins <[email protected]> wrote:
>
> ...
>
> I'm quite pleased with the way it's worked out, but you were intending
> that the 64-bit arches should get along with 32-bit counts? Maybe.
That seems reasonsable for file pages. For the ZERO_PAGE the count can do
whatever it wants, because we'd never free them up.
* Andrew Morton <[email protected]> wrote:
> Andrew Morton <[email protected]> wrote:
> >
> > Well plan B is to kill spinlock_t.break_lock.
>
> And plan C is to use a bit_spinlock() against page->flags.
please dont. Bit-spinlock is a PITA because it is non-debuggable and
non-preempt-friendly. (In fact in -rt i have completely eliminated
bit_spinlock(), because it's also such a PITA to type-encapsulate it
sanely.)
Ingo
* Andrew Morton <[email protected]> wrote:
> > > But I think either a big patch or 2.95.x abandonment is preferable to this
> > > approach.
> >
> > Hmm, that's a pity.
>
> Well plan B is to kill spinlock_t.break_lock. That fixes everything
> and has obvious beneficial side-effects.
i'd really, really love to have this solved without restricting the type
flexibility of spinlocks.
do we really need 2.95.x support? gcc now produces smaller code with -S.
> a) x86 spinlock_t is but one byte. Can we stuff break_lock into the
> same word?
>
> (Yes, there's also a >128 CPUs spinlock overflow problem to solve,
> but perhaps we can use lock;addw?)
the >128 CPUs spinlock overflow problem is solved by going to 32-bit ops
(patch has been posted to lkml recently). 16-bit ops are out of
question. While byte ops are frequently optimized for (avoiding partial
register access related stalls), the same is not true for 16-bit
prefixed instructions! Mixing 32-bit with 16-bit code is going to suck
on a good number of x86 CPUs. It also bloats the instruction size of
spinlocks, due to the 16-bit prefix. (while byte access has its own
opcode)
also, there are arches where the only atomic op available is a 32-bit
one. So trying to squeeze the break_lock flag into the spinlock field is
i think unrobust.
> b) Redesign the code somehow. Currently break_lock means "there's
> someone waiting for this lock".
>
> But if we were to leave the lock in a decremented state while
> spinning (as we've always done), that info is still present via the
> value of spinlock_t.slock. Hence: bye-bye break_lock.
this wont work on arches that dont have atomic-decrement based
spinlocks. (some arches dont even have such an instruction) This means
those arches would have to implement a "I'm spinning" flag in the word,
which might or might not work (if the arch doesnt have an atomic
instruction that works on the owner bit only then it becomes impossible)
- but in any case it would need very fragile per-arch assembly work to
pull off.
> c) Make the break_lock stuff a new config option,
> CONFIG_SUPER_LOW_LATENCY_BLOATS_STRUCT_PAGE.
>
> d) Revert it wholesale, have sucky SMP worst-case latency ;)
yuck. What is the real problem btw? AFAICS there's enough space for a
2-word spinlock in struct page for pagetables. We really dont want to
rewrite spinlocks (or remove features) just to keep gcc 2.95 supported
for some more time. In fact, is there any 2.6 based distro that uses gcc
2.95?
Ingo
Ingo Molnar <[email protected]> wrote:
>
> yuck. What is the real problem btw?
Well. One problem is that spinlocks now take two words...
But apart from that, the problem at hand is that we want to embed a
spinlock in struct page, and the size of the spinlock varies a lot according
to config. The only >wordsize version we really care about is
CONFIG_PREEMPT, NR_CPUS >= 4. (which distros don't ship...)
> AFAICS there's enough space for a
> 2-word spinlock in struct page for pagetables.
spinlocks get a lot bigger than that with CONFIG_DEBUG_SPINLOCK.
> We really dont want to
> rewrite spinlocks (or remove features) just to keep gcc 2.95 supported
> for some more time. In fact, is there any 2.6 based distro that uses gcc
> 2.95?
I think some of the debian derivates might. But who knows?
On Thu, 10 Nov 2005, Ingo Molnar wrote:
>
> yuck. What is the real problem btw? AFAICS there's enough space for a
> 2-word spinlock in struct page for pagetables.
Yes. There is no real problem. But my patch offends good taste.
Hugh
Hugh Dickins <[email protected]> wrote:
>
> On Thu, 10 Nov 2005, Ingo Molnar wrote:
> >
> > yuck. What is the real problem btw? AFAICS there's enough space for a
> > 2-word spinlock in struct page for pagetables.
>
> Yes. There is no real problem. But my patch offends good taste.
>
Isn't it going to overrun page.lru with CONFIG_DEBUG_SPINLOCK?
On Thu, 10 Nov 2005, Andrew Morton wrote:
> Hugh Dickins <[email protected]> wrote:
> > On Thu, 10 Nov 2005, Ingo Molnar wrote:
> > >
> > > yuck. What is the real problem btw? AFAICS there's enough space for a
> > > 2-word spinlock in struct page for pagetables.
> >
> > Yes. There is no real problem. But my patch offends good taste.
>
> Isn't it going to overrun page.lru with CONFIG_DEBUG_SPINLOCK?
No. There is just one case where it would,
so in that case split ptlock is disabled by mm/Kconfig's
# PA-RISC 7xxx's debug spinlock_t is too large for 32-bit struct page.
default "4096" if PARISC && DEBUG_SPINLOCK && !PA20
Of course, someone may extend spinlock debugging info tomorrow; but
when they do, presumably they'll try it out, and hit the BUILD_BUG_ON.
They'll then probably want to extend the suppression in mm/Kconfig.
Hugh
On Thursday 10 November 2005 02:56, Hugh Dickins wrote:
> alpha, s390, sparc64, x86_64 are each missing some primitives from their
> atomic64 support: fill in the gaps I've noticed by extrapolating asm,
> follow the groupings in each file, and say "int" for the booleans rather
> than long or long long. But powerpc and parisc still have no atomic64.
x86-64 part looks ok thanks. I assume you will take care of the merge
or do you want me to take that hunk?
-Andi
* Hugh Dickins <[email protected]> wrote:
> On Thu, 10 Nov 2005, Andrew Morton wrote:
> > Hugh Dickins <[email protected]> wrote:
> > > On Thu, 10 Nov 2005, Ingo Molnar wrote:
> > > >
> > > > yuck. What is the real problem btw? AFAICS there's enough space for a
> > > > 2-word spinlock in struct page for pagetables.
> > >
> > > Yes. There is no real problem. But my patch offends good taste.
> >
> > Isn't it going to overrun page.lru with CONFIG_DEBUG_SPINLOCK?
>
> No. There is just one case where it would,
> so in that case split ptlock is disabled by mm/Kconfig's
> # PA-RISC 7xxx's debug spinlock_t is too large for 32-bit struct page.
>
> default "4096" if PARISC && DEBUG_SPINLOCK && !PA20
>
> Of course, someone may extend spinlock debugging info tomorrow; but
> when they do, presumably they'll try it out, and hit the BUILD_BUG_ON.
> They'll then probably want to extend the suppression in mm/Kconfig.
why not do the union thing so that struct page grows automatically as
new fields are added? It is quite bad design to introduce a hard limit
like that. The only sizing concern is to make sure that the common
.configs dont increase the size of struct page, but otherwise why not
allow a larger struct page - it's for debugging only.
Ingo
On Thu, 10 Nov 2005, Andi Kleen wrote:
> On Thursday 10 November 2005 02:56, Hugh Dickins wrote:
> > alpha, s390, sparc64, x86_64 are each missing some primitives from their
> > atomic64 support: fill in the gaps I've noticed by extrapolating asm,
> > follow the groupings in each file, and say "int" for the booleans rather
> > than long or long long. But powerpc and parisc still have no atomic64.
>
> x86-64 part looks ok thanks. I assume you will take care of the merge
> or do you want me to take that hunk?
Thanks for checking, Andi. Don't worry about it for now. That patch
series is currently sitting somewhere well away from Andrew's breakfast.
I'll wait and see where it goes.
Hugh
On Thu, 10 Nov 2005, Ingo Molnar wrote:
> * Hugh Dickins <[email protected]> wrote:
> > Of course, someone may extend spinlock debugging info tomorrow; but
> > when they do, presumably they'll try it out, and hit the BUILD_BUG_ON.
> > They'll then probably want to extend the suppression in mm/Kconfig.
>
> why not do the union thing so that struct page grows automatically as
> new fields are added? It is quite bad design to introduce a hard limit
> like that. The only sizing concern is to make sure that the common
> .configs dont increase the size of struct page, but otherwise why not
> allow a larger struct page - it's for debugging only.
Yes, we wouldn't be worrying much about DEBUG_SPINLOCK enlarging struct
page (unnecessary as that currently is): it's the PREEMPT case adding
break_lock that's of concern (and only on 32-bit, I think: on all the
64-bits we'd have two unsigned ints in unsigned long private).
Going the union way doesn't give any more guarantee that another level
isn't using the fields, than the overlay way I did. Going the union
way, without enlarging the common struct page, seems to involve either
lots of abstraction edits all over the tree (break_lock coinciding with
with page->mapping on 32-bit), or ruling out gcc 2.95. Either seems to
me like a lose with no real win.
I'm certainly not arguing against break_lock itself: it's one of the
reasons why I went for a proper spinlock_t there; and agree with you
that trying to squeeze it in with the lock would likely go deep into
architectural complications.
Hugh
Hugh Dickins <[email protected]> wrote:
>
> On Thu, 10 Nov 2005, Andrew Morton wrote:
> > Hugh Dickins <[email protected]> wrote:
> > > On Thu, 10 Nov 2005, Ingo Molnar wrote:
> > > >
> > > > yuck. What is the real problem btw? AFAICS there's enough space for a
> > > > 2-word spinlock in struct page for pagetables.
> > >
> > > Yes. There is no real problem. But my patch offends good taste.
> >
> > Isn't it going to overrun page.lru with CONFIG_DEBUG_SPINLOCK?
>
> No.
!no, methinks.
On 32-bit architectures with CONFIG_PREEMPT, CONFIG_DEBUG_SPINLOCK,
CONFIG_NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS we have a 32-byte page, a
20-byte spinlock and offsetof(page, private) == 12.
IOW we're assuming that no 32-bit architectures will obtain pagetables from
slab?
On Thu, 10 Nov 2005, Andrew Morton wrote:
>
> IOW we're assuming that no 32-bit architectures will obtain pagetables from
> slab?
I thought ARM does?
The ARM page tables are something strange (I think they have 1024-byte
page tables and 4kB pages or something like that?). So they'll not only
obtain the page tables from slab, they have four of them per page.
Linus
On Thu, 10 Nov 2005, Andrew Morton wrote:
> spinlock in struct page, and the size of the spinlock varies a lot according
> to config. The only >wordsize version we really care about is
> CONFIG_PREEMPT, NR_CPUS >= 4. (which distros don't ship...)
Suse, Debian and Redhat ship such kernels.
On Wed, 9 Nov 2005, Andrew Morton wrote:
> > I'm quite pleased with the way it's worked out, but you were intending
> > that the 64-bit arches should get along with 32-bit counts? Maybe.
>
> That seems reasonsable for file pages. For the ZERO_PAGE the count can do
> whatever it wants, because we'd never free them up.
Frequent increments and decrements on the zero page count can cause a
bouncing cacheline that may limit performance.
On Thu, Nov 10, 2005 at 01:37:19PM -0800, Christoph Lameter wrote:
> On Thu, 10 Nov 2005, Andrew Morton wrote:
>
> > spinlock in struct page, and the size of the spinlock varies a lot according
> > to config. The only >wordsize version we really care about is
> > CONFIG_PREEMPT, NR_CPUS >= 4. (which distros don't ship...)
>
> Suse, Debian and Redhat ship such kernels.
No. SuSE and Redhat have always been smart enough to avoid CONFIG_PREEMPT
like the plague, and even Debian finally noticed this a few month ago.
Christoph Lameter <[email protected]> wrote:
>
> On Wed, 9 Nov 2005, Andrew Morton wrote:
>
> > > I'm quite pleased with the way it's worked out, but you were intending
> > > that the 64-bit arches should get along with 32-bit counts? Maybe.
> >
> > That seems reasonsable for file pages. For the ZERO_PAGE the count can do
> > whatever it wants, because we'd never free them up.
>
> Frequent increments and decrements on the zero page count can cause a
> bouncing cacheline that may limit performance.
I think Hugh did some instrumentation on that and decided that problems
were unlikely?
On Thu, Nov 10, 2005 at 11:56:52AM -0800, Linus Torvalds wrote:
> On Thu, 10 Nov 2005, Andrew Morton wrote:
> >
> > IOW we're assuming that no 32-bit architectures will obtain pagetables from
> > slab?
>
> I thought ARM does?
ARM26 does. On ARM, we play some games to weld to page tables together.
(We also duplicate them to give us a level of independence from the MMU
architecture and to give us space for things like young and dirty bits.)
As far as Linux is concerned, a 2nd level page table is one struct page
and L1 entries are two words in size.
I wrote up some docs on this and threw them into a comment in
include/asm-arm/pgtable.h...
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
* Christoph Hellwig <[email protected]> wrote:
> On Thu, Nov 10, 2005 at 01:37:19PM -0800, Christoph Lameter wrote:
> > On Thu, 10 Nov 2005, Andrew Morton wrote:
> >
> > > spinlock in struct page, and the size of the spinlock varies a lot according
> > > to config. The only >wordsize version we really care about is
> > > CONFIG_PREEMPT, NR_CPUS >= 4. (which distros don't ship...)
> >
> > Suse, Debian and Redhat ship such kernels.
>
> No. SuSE and Redhat have always been smart enough to avoid
> CONFIG_PREEMPT like the plague, and even Debian finally noticed this a
> few month ago.
while you are right about CONFIG_PREEMPT, there is some movement in this
area: CONFIG_PREEMPT_VOLUNTARY is now turned on in Fedora kernels, and
PREEMPT_BKL is turned on for SMP kernels.
Ingo
On Thu, 10 Nov 2005, Andrew Morton wrote:
> Hugh Dickins <[email protected]> wrote:
> > On Thu, 10 Nov 2005, Andrew Morton wrote:
> > > Isn't it going to overrun page.lru with CONFIG_DEBUG_SPINLOCK?
> >
> > No.
>
> !no, methinks.
>
> On 32-bit architectures with CONFIG_PREEMPT, CONFIG_DEBUG_SPINLOCK,
> CONFIG_NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS we have a 32-byte page, a
> 20-byte spinlock and offsetof(page, private) == 12.
>
> IOW we're assuming that no 32-bit architectures will obtain pagetables from
> slab?
Sorry, I misunderstood "overrun". Yes, you're right, the 32-bit
DEBUG_SPINLOCK spinlock_t runs into page.lru but does not run beyond it.
Here's what I already said about that in the comment on the patch:
The previous attempt at this patch broke ppc64, which uses slab for its
page tables - and slab saves vital info in page->lru: but no config of
spinlock_t needs to overwrite lru on 64-bit anyway; and the only 32-bit
user of slab for page tables is arm26, which is never SMP i.e. all the
problems came from the "safety" checks, not from what's actually needed.
So previous checks refined with #ifdefs, and a BUG_ON(PageSlab) added.
Hugh
On Thu, 10 Nov 2005, Andrew Morton wrote:
> Christoph Lameter <[email protected]> wrote:
> >
> > Frequent increments and decrements on the zero page count can cause a
> > bouncing cacheline that may limit performance.
>
> I think Hugh did some instrumentation on that and decided that problems
> were unlikely?
"Instrumentation" is rather too dignified a word for the counting I added
at one point, to see what proportion of faults were using the zero page,
before running some poor variety of workloads.
It came out, rather as I expected, as not so many as to cause immediate
concern, not so few that the issue could definitely be dismissed. So I
persuaded Nick to separate out the zero page refcounting as a separate
patch, and when it came to submission he was happy enough with his patch
without it, that he didn't feel much like adding it at that stage.
I did try the two SGI tests I'd seen (memscale and pft, I think latter
my abbreviation for something with longer name that Christoph pointed
us to, page-fault-tsomething), and they both showed negligible use of
the zero page (i.e. the program startup did a few such faults, nothing
to compare with all the work that the actual testing then did).
I've nothing against zero page refcounting avoidance, just wanted
numbers to show it's more worth doing than not doing. And I'm not
the only one to have wondered, if it is an issue, wouldn't big NUMA
benefit more from per-node zero pages anyway? (Though of course
the pages themselves should stay clean, so won't be bouncing.)
Hugh
On Fri, 11 Nov 2005, Hugh Dickins wrote:
> I've nothing against zero page refcounting avoidance, just wanted
> numbers to show it's more worth doing than not doing. And I'm not
> the only one to have wondered, if it is an issue, wouldn't big NUMA
> benefit more from per-node zero pages anyway? (Though of course
> the pages themselves should stay clean, so won't be bouncing.)
Per-node zero pages sound good. The page structs are placed on the
respective nodes and therefore would not bounce.
On Thu, 2005-11-10 at 11:56 -0800, Linus Torvalds wrote:
>
> On Thu, 10 Nov 2005, Andrew Morton wrote:
> >
> > IOW we're assuming that no 32-bit architectures will obtain pagetables from
> > slab?
>
> I thought ARM does?
>
> The ARM page tables are something strange (I think they have 1024-byte
> page tables and 4kB pages or something like that?). So they'll not only
> obtain the page tables from slab, they have four of them per page.
Just catching up on old mails... ppc64 also gets page tables from kmem
caches (though they are currently page sized & aligned). We really want
to be able to have more freedom and get away from the current idea that
1 PTE page == one page. Our intermediary levels (PMDs, PUDs, PGDs)
already have various random sizes.
Ben.
On Thu, Nov 10, 2005 at 04:26:13AM -0800, Andrew Morton wrote:
> Ingo Molnar <[email protected]> wrote:
>...
> > We really dont want to
> > rewrite spinlocks (or remove features) just to keep gcc 2.95 supported
> > for some more time. In fact, is there any 2.6 based distro that uses gcc
> > 2.95?
>
> I think some of the debian derivates might. But who knows?
Debian 3.1 still ships a gcc 2.95, but the default compiler is gcc 3.3 .
Even Debian 3.0 (released in July 2002 and not supporting kernel 2.6)
ships a gcc 3.0.4 (although it isn't the default compiler on !hppa).
Considering that kernel 2.6.0 was released two and a half years _after_
gcc 3.0, I do heavily doubt that there is any distribution that does
both support kernel 2.6 and not ship any gcc >= 3.0 .
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
Andrew Morton <[email protected]> wrote:
>
> Hugh Dickins <[email protected]> wrote:
> >
> > The split ptlock patch enlarged the default SMP PREEMPT struct page from
> > 32 to 36 bytes on most 32-bit platforms, from 32 to 44 bytes on PA-RISC
> > 7xxx (without PREEMPT). That was not my intention, and I don't believe
> > that split ptlock deserves any such slice of the user's memory.
> >
> > Could we please try this patch, or something like it? Again to overlay
> > the spinlock_t from &page->private onwards, with corrected BUILD_BUG_ON
> > that we don't go too far; with poisoning of the fields overlaid, and
> > unsplit SMP config verifying that the split config is safe to use them.
> >
> > The previous attempt at this patch broke ppc64, which uses slab for its
> > page tables - and slab saves vital info in page->lru: but no config of
> > spinlock_t needs to overwrite lru on 64-bit anyway; and the only 32-bit
> > user of slab for page tables is arm26, which is never SMP i.e. all the
> > problems came from the "safety" checks, not from what's actually needed.
> > So previous checks refined with #ifdefs, and a BUG_ON(PageSlab) added.
> >
> > This overlaying is unlikely to be portable forever: but the added checks
> > should warn developers when it's going to break, long before any users.
>
> argh.
>
> Really, I'd prefer to abandon gcc-2.95.x rather than this. Look:
>
> struct a
> {
> union {
> struct {
> int b;
> int c;
> };
> int d;
> };
> } z;
>
> main()
> {
> z.b = 1;
> z.d = 1;
> }
>
> It does everything we want.
It occurs to me that we can do the above if (__GNUC__ > 2), or whatever.
That way, the only people who have a 4-byte-larger pageframe are those who
use CONFIG_PREEMPT, NR_CPUS>=4 and gcc-2.x.y. An acceptably small
community, I suspect.
On Tue, 15 Nov 2005, Andrew Morton wrote:
>
> It occurs to me that we can do the above if (__GNUC__ > 2), or whatever.
>
> That way, the only people who have a 4-byte-larger pageframe are those who
> use CONFIG_PREEMPT, NR_CPUS>=4 and gcc-2.x.y. An acceptably small
> community, I suspect.
I can't really think of this at the moment (though the PageReserved
fixups going smoother this evening). Acceptably small community, yes.
But wouldn't it plunge us into the very mess of wrappers we were trying
to avoid with anony structunions, to handle the __GNUC__ differences?
Hugh
Hugh Dickins <[email protected]> wrote:
>
> On Tue, 15 Nov 2005, Andrew Morton wrote:
> >
> > It occurs to me that we can do the above if (__GNUC__ > 2), or whatever.
> >
> > That way, the only people who have a 4-byte-larger pageframe are those who
> > use CONFIG_PREEMPT, NR_CPUS>=4 and gcc-2.x.y. An acceptably small
> > community, I suspect.
>
> I can't really think of this at the moment (though the PageReserved
> fixups going smoother this evening). Acceptably small community, yes.
> But wouldn't it plunge us into the very mess of wrappers we were trying
> to avoid with anony structunions, to handle the __GNUC__ differences?
Nope, all the changes would be constrained to the definition of struct
page, and struct page is special.
struct page {
...
#if __GNUC__ > 2
union {
spinlock_t ptl;
struct {
unsigned long private;
struct address_space *mapping;
}
}
#else
union {
unsigned long private;
spinlock_t ptl;
} u;
struct address_space *mapping;
#endif
and
#if __GNUC__ > 2
#define page_private(page) ((page)->private)
#define set_page_private(page, v) ((page)->private = (v))
#else
#define page_private(page) ((page)->u.private)
#define set_page_private(page, v) ((page)->u.private = (v))
#endif
Of course, adding "u." and "u.s." all over the place would be a sane
solution, but we can do that later - I'm sure we'll be changing struct page
again.
View the above as "a space optimisation made possible by gcc-3.x".