2007-05-10 20:04:18

by Hugh Dickins

[permalink] [raw]
Subject: slub-i386-support.patch

Just want to report that I've been running SLUB on i386, both in -mm
and with slub-i386-support.patch applied to 2.6.21-git, and observed
no problems with it. I'm anxious that it (or an equivalent) go into
2.6.22-rc1, i386 being now the last ARCH_USES_SLAB_PAGE_STRUCT holdout.

In the frenzy over PowerPC, maybe Linus overlooked that i386
was still not supporting SLUB; and I fear that once people get
"# CONFIG_SLUB is not set" into their .config, they're less likely to
switch over to testing CONFIG_SLUB=y - I remain anxious that it see
as much testing as possible (but under EXPERIMENTAL for 2.6.22, yes).

Though when I look at the patchset (copied below), I do wonder why
it puts a quicklist_trim() into i386's cpu_idle() and flush_tlb_mm():
neither is where I'd expect us to be secretly freeing pages. Ah,
several arches do it in cpu_idle(): how odd, oh well.

Hugh

> From: Christoph Lameter <[email protected]>
>
> SLUB cannot run on i386 at this point because i386 uses the page->private and
> page->index field of slab pages for the pgd cache.
>
> Make SLUB run on i386 by replacing the pgd slab cache with a quicklist.
> Limit the changes as much as possible. Leave the improvised linked list in place
> etc etc. This has been working here for a couple of weeks now.
>
> Acked-by: William Lee Irwin III <[email protected]>
> Signed-off-by: Christoph Lameter <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> arch/i386/Kconfig | 8 ++++----
> arch/i386/kernel/process.c | 1 +
> arch/i386/kernel/smp.c | 2 +-
> arch/i386/mm/init.c | 7 -------
> arch/i386/mm/pgtable.c | 26 +++++++++++++++++---------
> include/asm-i386/pgalloc.h | 2 --
> include/asm-i386/pgtable.h | 5 ++---
> 7 files changed, 25 insertions(+), 26 deletions(-)
>
> diff -puN arch/i386/Kconfig~slub-i386-support arch/i386/Kconfig
> --- a/arch/i386/Kconfig~slub-i386-support
> +++ a/arch/i386/Kconfig
> @@ -55,6 +55,10 @@ config ZONE_DMA
> bool
> default y
>
> +config QUICKLIST
> + bool
> + default y
> +
> config SBUS
> bool
>
> @@ -79,10 +83,6 @@ config ARCH_MAY_HAVE_PC_FDC
> bool
> default y
>
> -config ARCH_USES_SLAB_PAGE_STRUCT
> - bool
> - default y
> -
> config DMI
> bool
> default y
> diff -puN arch/i386/kernel/process.c~slub-i386-support arch/i386/kernel/process.c
> --- a/arch/i386/kernel/process.c~slub-i386-support
> +++ a/arch/i386/kernel/process.c
> @@ -187,6 +187,7 @@ void cpu_idle(void)
> if (__get_cpu_var(cpu_idle_state))
> __get_cpu_var(cpu_idle_state) = 0;
>
> + check_pgt_cache();
> rmb();
> idle = pm_idle;
>
> diff -puN arch/i386/kernel/smp.c~slub-i386-support arch/i386/kernel/smp.c
> --- a/arch/i386/kernel/smp.c~slub-i386-support
> +++ a/arch/i386/kernel/smp.c
> @@ -422,7 +422,7 @@ void flush_tlb_mm (struct mm_struct * mm
> }
> if (!cpus_empty(cpu_mask))
> flush_tlb_others(cpu_mask, mm, TLB_FLUSH_ALL);
> -
> + check_pgt_cache();
> preempt_enable();
> }
>
> diff -puN arch/i386/mm/init.c~slub-i386-support arch/i386/mm/init.c
> --- a/arch/i386/mm/init.c~slub-i386-support
> +++ a/arch/i386/mm/init.c
> @@ -752,7 +752,6 @@ int remove_memory(u64 start, u64 size)
> EXPORT_SYMBOL_GPL(remove_memory);
> #endif
>
> -struct kmem_cache *pgd_cache;
> struct kmem_cache *pmd_cache;
>
> void __init pgtable_cache_init(void)
> @@ -776,12 +775,6 @@ void __init pgtable_cache_init(void)
> pgd_size = PAGE_SIZE;
> }
> }
> - pgd_cache = kmem_cache_create("pgd",
> - pgd_size,
> - pgd_size,
> - SLAB_PANIC,
> - pgd_ctor,
> - (!SHARED_KERNEL_PMD) ? pgd_dtor : NULL);
> }
>
> /*
> diff -puN arch/i386/mm/pgtable.c~slub-i386-support arch/i386/mm/pgtable.c
> --- a/arch/i386/mm/pgtable.c~slub-i386-support
> +++ a/arch/i386/mm/pgtable.c
> @@ -13,6 +13,7 @@
> #include <linux/pagemap.h>
> #include <linux/spinlock.h>
> #include <linux/module.h>
> +#include <linux/quicklist.h>
>
> #include <asm/system.h>
> #include <asm/pgtable.h>
> @@ -205,8 +206,6 @@ void pmd_ctor(void *pmd, struct kmem_cac
> * against pageattr.c; it is the unique case in which a valid change
> * of kernel pagetables can't be lazily synchronized by vmalloc faults.
> * vmalloc faults work because attached pagetables are never freed.
> - * The locking scheme was chosen on the basis of manfred's
> - * recommendations and having no core impact whatsoever.
> * -- wli
> */
> DEFINE_SPINLOCK(pgd_lock);
> @@ -232,9 +231,11 @@ static inline void pgd_list_del(pgd_t *p
> set_page_private(next, (unsigned long)pprev);
> }
>
> +
> +
> #if (PTRS_PER_PMD == 1)
> /* Non-PAE pgd constructor */
> -void pgd_ctor(void *pgd, struct kmem_cache *cache, unsigned long unused)
> +void pgd_ctor(void *pgd)
> {
> unsigned long flags;
>
> @@ -256,7 +257,7 @@ void pgd_ctor(void *pgd, struct kmem_cac
> }
> #else /* PTRS_PER_PMD > 1 */
> /* PAE pgd constructor */
> -void pgd_ctor(void *pgd, struct kmem_cache *cache, unsigned long unused)
> +void pgd_ctor(void *pgd)
> {
> /* PAE, kernel PMD may be shared */
>
> @@ -275,11 +276,12 @@ void pgd_ctor(void *pgd, struct kmem_cac
> }
> #endif /* PTRS_PER_PMD */
>
> -void pgd_dtor(void *pgd, struct kmem_cache *cache, unsigned long unused)
> +void pgd_dtor(void *pgd)
> {
> unsigned long flags; /* can be called from interrupt context */
>
> - BUG_ON(SHARED_KERNEL_PMD);
> + if (SHARED_KERNEL_PMD)
> + return;
>
> paravirt_release_pd(__pa(pgd) >> PAGE_SHIFT);
> spin_lock_irqsave(&pgd_lock, flags);
> @@ -321,7 +323,7 @@ static void pmd_cache_free(pmd_t *pmd, i
> pgd_t *pgd_alloc(struct mm_struct *mm)
> {
> int i;
> - pgd_t *pgd = kmem_cache_alloc(pgd_cache, GFP_KERNEL);
> + pgd_t *pgd = quicklist_alloc(0, GFP_KERNEL, pgd_ctor);
>
> if (PTRS_PER_PMD == 1 || !pgd)
> return pgd;
> @@ -344,7 +346,7 @@ out_oom:
> paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
> pmd_cache_free(pmd, i);
> }
> - kmem_cache_free(pgd_cache, pgd);
> + quicklist_free(0, pgd_dtor, pgd);
> return NULL;
> }
>
> @@ -361,5 +363,11 @@ void pgd_free(pgd_t *pgd)
> pmd_cache_free(pmd, i);
> }
> /* in the non-PAE case, free_pgtables() clears user pgd entries */
> - kmem_cache_free(pgd_cache, pgd);
> + quicklist_free(0, pgd_dtor, pgd);
> }
> +
> +void check_pgt_cache(void)
> +{
> + quicklist_trim(0, pgd_dtor, 25, 16);
> +}
> +
> diff -puN include/asm-i386/pgalloc.h~slub-i386-support include/asm-i386/pgalloc.h
> --- a/include/asm-i386/pgalloc.h~slub-i386-support
> +++ a/include/asm-i386/pgalloc.h
> @@ -65,6 +65,4 @@ do { \
> #define pud_populate(mm, pmd, pte) BUG()
> #endif
>
> -#define check_pgt_cache() do { } while (0)
> -
> #endif /* _I386_PGALLOC_H */
> diff -puN include/asm-i386/pgtable.h~slub-i386-support include/asm-i386/pgtable.h
> --- a/include/asm-i386/pgtable.h~slub-i386-support
> +++ a/include/asm-i386/pgtable.h
> @@ -35,17 +35,16 @@ struct vm_area_struct;
> #define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
> extern unsigned long empty_zero_page[1024];
> extern pgd_t swapper_pg_dir[1024];
> -extern struct kmem_cache *pgd_cache;
> extern struct kmem_cache *pmd_cache;
> extern spinlock_t pgd_lock;
> extern struct page *pgd_list;
> +void check_pgt_cache(void);
>
> void pmd_ctor(void *, struct kmem_cache *, unsigned long);
> -void pgd_ctor(void *, struct kmem_cache *, unsigned long);
> -void pgd_dtor(void *, struct kmem_cache *, unsigned long);
> void pgtable_cache_init(void);
> void paging_init(void);
>
> +
> /*
> * The Linux x86 paging architecture is 'compile-time dual-mode', it
> * implements both the traditional 2-level x86 page tables and the
> _


2007-05-10 20:17:51

by Andrew Morton

[permalink] [raw]
Subject: Re: slub-i386-support.patch

On Thu, 10 May 2007 21:03:39 +0100 (BST)
Hugh Dickins <[email protected]> wrote:

> Just want to report that I've been running SLUB on i386, both in -mm
> and with slub-i386-support.patch applied to 2.6.21-git, and observed
> no problems with it. I'm anxious that it (or an equivalent) go into
> 2.6.22-rc1, i386 being now the last ARCH_USES_SLAB_PAGE_STRUCT holdout.
>
> In the frenzy over PowerPC, maybe Linus overlooked that i386
> was still not supporting SLUB; and I fear that once people get
> "# CONFIG_SLUB is not set" into their .config, they're less likely to
> switch over to testing CONFIG_SLUB=y - I remain anxious that it see
> as much testing as possible (but under EXPERIMENTAL for 2.6.22, yes).

ok..

> Though when I look at the patchset (copied below), I do wonder why
> it puts a quicklist_trim() into i386's cpu_idle() and flush_tlb_mm():
> neither is where I'd expect us to be secretly freeing pages. Ah,
> several arches do it in cpu_idle(): how odd, oh well.

Christoph, could you please check that this is justified?

2007-05-10 20:30:31

by William Lee Irwin III

[permalink] [raw]
Subject: Re: slub-i386-support.patch

On Thu, May 10, 2007 at 09:03:39PM +0100, Hugh Dickins wrote:
> Just want to report that I've been running SLUB on i386, both in -mm
> and with slub-i386-support.patch applied to 2.6.21-git, and observed
> no problems with it. I'm anxious that it (or an equivalent) go into
> 2.6.22-rc1, i386 being now the last ARCH_USES_SLAB_PAGE_STRUCT holdout.
> In the frenzy over PowerPC, maybe Linus overlooked that i386
> was still not supporting SLUB; and I fear that once people get
> "# CONFIG_SLUB is not set" into their .config, they're less likely to
> switch over to testing CONFIG_SLUB=y - I remain anxious that it see
> as much testing as possible (but under EXPERIMENTAL for 2.6.22, yes).
> Though when I look at the patchset (copied below), I do wonder why
> it puts a quicklist_trim() into i386's cpu_idle() and flush_tlb_mm():
> neither is where I'd expect us to be secretly freeing pages. Ah,
> several arches do it in cpu_idle(): how odd, oh well.

I need to fix this up a bit. I'll point out the issues in the sequel.

So now quicklist semantics vs. TLB flushing are the motive behind the
odd flush_tlb_mm() affair. The real trick with it is that flushing
must never occur until the TLB flush. Any change to the core quicklist
code that retires pages back to the page allocator earlier (e.g. based
on some limit) will break things badly.


On Thu, May 10, 2007 at 09:03:39PM +0100, Hugh Dickins wrote:
>> -struct kmem_cache *pgd_cache;
>> struct kmem_cache *pmd_cache;
>>
>> void __init pgtable_cache_init(void)
>> @@ -776,12 +775,6 @@ void __init pgtable_cache_init(void)
>> pgd_size = PAGE_SIZE;
>> }
>> }
>> - pgd_cache = kmem_cache_create("pgd",
>> - pgd_size,
>> - pgd_size,
>> - SLAB_PANIC,
>> - pgd_ctor,
>> - (!SHARED_KERNEL_PMD) ? pgd_dtor : NULL);
>> }

This is wrong; pgd's are smaller than PAGE_SIZE on PAE. Burning lowmem
like this is very, very bad for such systems. pmd_cache is rather
trivial to convert to quicklists, since all it does is zero pages. I
still don't approve of even the !SHARED_KERNEL_PMD case using PAGE_SIZE
-sized pgd's. Xen should really be fixed to avoid requiring guests to
have recursive pagetables or whatever it's doing.


>> @@ -205,8 +206,6 @@ void pmd_ctor(void *pmd, struct kmem_cac
>> * against pageattr.c; it is the unique case in which a valid change
>> * of kernel pagetables can't be lazily synchronized by vmalloc faults.
>> * vmalloc faults work because attached pagetables are never freed.
>> - * The locking scheme was chosen on the basis of manfred's
>> - * recommendations and having no core impact whatsoever.
>> * -- wli
>> */
>> DEFINE_SPINLOCK(pgd_lock);

This comment deletion is bogus, as the locking scheme has not changed.


-- wli

2007-05-10 20:31:31

by Christoph Lameter

[permalink] [raw]
Subject: Re: slub-i386-support.patch

On Thu, 10 May 2007, Andrew Morton wrote:

> > Though when I look at the patchset (copied below), I do wonder why
> > it puts a quicklist_trim() into i386's cpu_idle() and flush_tlb_mm():
> > neither is where I'd expect us to be secretly freeing pages. Ah,
> > several arches do it in cpu_idle(): how odd, oh well.
>
> Christoph, could you please check that this is justified?

This has been that way for years for ia64 and sparc64. quicklist page
freeing needs to be synced with the tlb flushing since page table pages
may be flushed with the pages gathered in the tlb_mm struct. Dave Miller
set this up initially as far as I can tell. He could probably give more
details on why this was done historically.

ccing Dave.

2007-05-10 20:35:33

by Christoph Lameter

[permalink] [raw]
Subject: Re: slub-i386-support.patch

On Thu, 10 May 2007, William Lee Irwin III wrote:

> So now quicklist semantics vs. TLB flushing are the motive behind the
> odd flush_tlb_mm() affair. The real trick with it is that flushing
> must never occur until the TLB flush. Any change to the core quicklist
> code that retires pages back to the page allocator earlier (e.g. based
> on some limit) will break things badly.

Right.

> On Thu, May 10, 2007 at 09:03:39PM +0100, Hugh Dickins wrote:
> >> -struct kmem_cache *pgd_cache;
> >> struct kmem_cache *pmd_cache;
> >>
> >> void __init pgtable_cache_init(void)
> >> @@ -776,12 +775,6 @@ void __init pgtable_cache_init(void)
> >> pgd_size = PAGE_SIZE;
> >> }
> >> }
> >> - pgd_cache = kmem_cache_create("pgd",
> >> - pgd_size,
> >> - pgd_size,
> >> - SLAB_PANIC,
> >> - pgd_ctor,
> >> - (!SHARED_KERNEL_PMD) ? pgd_dtor : NULL);
> >> }
>
> This is wrong; pgd's are smaller than PAGE_SIZE on PAE. Burning lowmem
> like this is very, very bad for such systems. pmd_cache is rather
> trivial to convert to quicklists, since all it does is zero pages. I
> still don't approve of even the !SHARED_KERNEL_PMD case using PAGE_SIZE
> -sized pgd's. Xen should really be fixed to avoid requiring guests to
> have recursive pagetables or whatever it's doing.

But the Xen guys need the full sized pgd?

>
> >> @@ -205,8 +206,6 @@ void pmd_ctor(void *pmd, struct kmem_cac
> >> * against pageattr.c; it is the unique case in which a valid change
> >> * of kernel pagetables can't be lazily synchronized by vmalloc faults.
> >> * vmalloc faults work because attached pagetables are never freed.
> >> - * The locking scheme was chosen on the basis of manfred's
> >> - * recommendations and having no core impact whatsoever.
> >> * -- wli
> >> */
> >> DEFINE_SPINLOCK(pgd_lock);
>
> This comment deletion is bogus, as the locking scheme has not changed.

The interaction with SLAB has changed with the conversion to quicklists.
AFAIK Manfred's suggestions were due to the use of the slab allocator to manage
pgds.

2007-05-10 20:50:37

by David Miller

[permalink] [raw]
Subject: Re: slub-i386-support.patch

From: Christoph Lameter <[email protected]>
Date: Thu, 10 May 2007 13:31:21 -0700 (PDT)

> On Thu, 10 May 2007, Andrew Morton wrote:
>
> > > Though when I look at the patchset (copied below), I do wonder why
> > > it puts a quicklist_trim() into i386's cpu_idle() and flush_tlb_mm():
> > > neither is where I'd expect us to be secretly freeing pages. Ah,
> > > several arches do it in cpu_idle(): how odd, oh well.
> >
> > Christoph, could you please check that this is justified?
>
> This has been that way for years for ia64 and sparc64. quicklist page
> freeing needs to be synced with the tlb flushing since page table pages
> may be flushed with the pages gathered in the tlb_mm struct. Dave Miller
> set this up initially as far as I can tell. He could probably give more
> details on why this was done historically.
>
> ccing Dave.

I think it was just a convenient place to get it trimmed.

There could be some implications for platforms using virtual
page tables like IA64 et al. (sparc64 is no longer using
virtual page table mappings for TLB misses although it used to).

But I can't figure out a reason in that area for this stuff
either.

2007-05-10 21:09:11

by William Lee Irwin III

[permalink] [raw]
Subject: Re: slub-i386-support.patch

On Thu, 10 May 2007, William Lee Irwin III wrote:
>> This is wrong; pgd's are smaller than PAGE_SIZE on PAE. Burning lowmem
>> like this is very, very bad for such systems. pmd_cache is rather
>> trivial to convert to quicklists, since all it does is zero pages. I
>> still don't approve of even the !SHARED_KERNEL_PMD case using PAGE_SIZE
>> -sized pgd's. Xen should really be fixed to avoid requiring guests to
>> have recursive pagetables or whatever it's doing.

On Thu, May 10, 2007 at 01:35:25PM -0700, Christoph Lameter wrote:
> But the Xen guys need the full sized pgd?

Xen is not mandatory as it now stands. Also, I intend to fix up Xen
at some point so it doesn't need this.


On Thu, 10 May 2007, William Lee Irwin III wrote:
>> This comment deletion is bogus, as the locking scheme has not changed.

On Thu, May 10, 2007 at 01:35:25PM -0700, Christoph Lameter wrote:
> The interaction with SLAB has changed with the conversion to
> quicklists. AFAIK Manfred's suggestions were due to the use of the
> slab allocator to manage pgds.

The suggestion was to keep cached preconstructed pgd's on a list for
pageattr.c to traverse instead of the mmlist. The quicklist code
retains that list (in fact, even for the PAE shared kernel pmd case
where it's not necessary) despite it being a different allocator, so
manfred's suggestion remains implemented.

The alternative was 64-bit generation numbers incremented at the time
of change_page_attr(). If generation numbers were used, it would be
possible to dispose of the list altogether. Given the awkwardness of
the list maintenance for Xen, it may be worth using them now. PAE
pgd's could merely double in size to maintain those for the unshared
kernel pmd case, and remain 32B otherwise. Full PAGE_SIZE -sized pgd's
for 2-level pagetables could distribute the generation number across
page->index and page->private, or any other fields available.


-- wli

2007-05-10 21:22:31

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: slub-i386-support.patch

Christoph Lameter wrote:
>> This is wrong; pgd's are smaller than PAGE_SIZE on PAE. Burning lowmem
>> like this is very, very bad for such systems. pmd_cache is rather
>> trivial to convert to quicklists, since all it does is zero pages. I
>> still don't approve of even the !SHARED_KERNEL_PMD case using PAGE_SIZE
>> -sized pgd's. Xen should really be fixed to avoid requiring guests to
>> have recursive pagetables or whatever it's doing.
>>
>
> But the Xen guys need the full sized pgd?
>

Yes, Xen needs a page-sized PGD. There are two reasons:

1. Xen inherently needs a whole page for the pgd, because it
classifies pages by type, and the pgd is a pagetable-typed page.
2. When the kernel mappings are not being shared between processes,
as in non-PAE or PAE with !SHARED_KERNEL_PMD, it maintains a
pgd_list linked though the index field in the page structure's
index field. For this to work, you can't have more than one pgd
per page. The pgd_list is needed to sync the vmalloc mappings
across all the pagetables.

Xen needs to set !SHARED_KERNEL_PMD for PAE when running a 32-bit kernel
under a 32-bit hypervisor; when running under a 64-bit hypervisor,
there's no need to steal any guest address space for the hypervisor. (I
haven't implemented this on the Linux side yet.)

J

2007-05-10 21:28:20

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: slub-i386-support.patch

William Lee Irwin III wrote:
> On Thu, 10 May 2007, William Lee Irwin III wrote:
>
>>> This is wrong; pgd's are smaller than PAGE_SIZE on PAE. Burning lowmem
>>> like this is very, very bad for such systems. pmd_cache is rather
>>> trivial to convert to quicklists, since all it does is zero pages. I
>>> still don't approve of even the !SHARED_KERNEL_PMD case using PAGE_SIZE
>>> -sized pgd's. Xen should really be fixed to avoid requiring guests to
>>> have recursive pagetables or whatever it's doing.
>>>
>
> On Thu, May 10, 2007 at 01:35:25PM -0700, Christoph Lameter wrote:
>
>> But the Xen guys need the full sized pgd?
>>
>
> Xen is not mandatory as it now stands.
? I'm hoping to merge the Xen code in the next couple of days, so I'd
appreciate it if we don't break the foundations just before building the
building.

> Also, I intend to fix up Xen
> at some point so it doesn't need this.
>

As I mentioned in the previous mail, its only really necessary for a
32-bit guest under a 32-bit hypervisor. While that's going to be a
supported configuration for a long time, we expect that people will
increasingly use 64-bit hypervisors on new machines, so this will become
less of an issue.

We're also looking at shadowing the 4 top-level PAE entries rather than
using them directly, since the shadows only need to be updated when
reloading cr3. This would allow us to use compact pgds, so long as
there's some other way to maintain the pgd list (ideally, something that
can be shared with non-PAE).

Or did I miss something? Is pgd_list being maintained some other way
with slub/quicklists?

> The alternative was 64-bit generation numbers incremented at the time
> of change_page_attr(). If generation numbers were used, it would be
> possible to dispose of the list altogether. Given the awkwardness of
> the list maintenance for Xen, it may be worth using them now. PAE
> pgd's could merely double in size to maintain those for the unshared
> kernel pmd case, and remain 32B otherwise. Full PAGE_SIZE -sized pgd's
> for 2-level pagetables could distribute the generation number across
> page->index and page->private, or any other fields available.
>

If you use page->index for that, how does pgd_list get linked together
for vmalloc syncing?

J

2007-05-10 23:14:45

by Hugh Dickins

[permalink] [raw]
Subject: Re: slub-i386-support.patch

On Thu, 10 May 2007, William Lee Irwin III wrote:
> On Thu, May 10, 2007 at 09:03:39PM +0100, Hugh Dickins wrote:
> > Though when I look at the patchset (copied below), I do wonder why
> > it puts a quicklist_trim() into i386's cpu_idle() and flush_tlb_mm():
> > neither is where I'd expect us to be secretly freeing pages. Ah,
> > several arches do it in cpu_idle(): how odd, oh well.
>
> So now quicklist semantics vs. TLB flushing are the motive behind the
> odd flush_tlb_mm() affair. The real trick with it is that flushing
> must never occur until the TLB flush. Any change to the core quicklist
> code that retires pages back to the page allocator earlier (e.g. based
> on some limit) will break things badly.

I don't think that's right. It's vital that TLB (of an active mm)
be flushed before freeing its page back to the quicklist, before it's
recycled to another mm (or elsewhere in this mm); but having done that,
it really doesn't matter much when quicklist_trim() (check_pgt_cache)
is called to free surplus pages from quicklist back to page_alloc.c.

tlb_finish_mmu() happens to be the traditional place it's done, and
that's where we expect it. flush_tlb_mm() avoids flushing TLB unless
it's actually required for the mm in question: so wouldn't be a good
place to rely on flushing TLB for pages freed earlier from other mms
(but we'd already be in trouble to be leaving them that late).

I'm guessing (haven't rechecked source) that the cpu_idle() call comes
about because the top level pgd of a process gets freed very late in
its exit, and after a great flurry of processes have just exited,
perhaps there was nothing to free up the accumulation. Though
it still strikes me as an odd place to do it.

Hugh

2007-05-10 23:34:42

by William Lee Irwin III

[permalink] [raw]
Subject: Re: slub-i386-support.patch

William Lee Irwin III wrote:
>> Xen is not mandatory as it now stands.

On Thu, May 10, 2007 at 02:28:05PM -0700, Jeremy Fitzhardinge wrote:
> ? I'm hoping to merge the Xen code in the next couple of days, so I'd
> appreciate it if we don't break the foundations just before building the
> building.

CONFIG_X86_PAE without CONFIG_PARAVIRT is the case in question here.
What's done in that case can't break Xen because it doesn't run under
Xen.


William Lee Irwin III wrote:
>> Also, I intend to fix up Xen
>> at some point so it doesn't need this.

On Thu, May 10, 2007 at 02:28:05PM -0700, Jeremy Fitzhardinge wrote:
> As I mentioned in the previous mail, its only really necessary for a
> 32-bit guest under a 32-bit hypervisor. While that's going to be a
> supported configuration for a long time, we expect that people will
> increasingly use 64-bit hypervisors on new machines, so this will become
> less of an issue.
> We're also looking at shadowing the 4 top-level PAE entries rather than
> using them directly, since the shadows only need to be updated when
> reloading cr3. This would allow us to use compact pgds, so long as
> there's some other way to maintain the pgd list (ideally, something that
> can be shared with non-PAE).

ISTR you describing this method earlier. This is what I had in mind
for fixing up Xen not to need full PAGE_SIZE-sized pgd's.


On Thu, May 10, 2007 at 02:28:05PM -0700, Jeremy Fitzhardinge wrote:
> Or did I miss something? Is pgd_list being maintained some other way
> with slub/quicklists?

No, it's identical. clameter's code makes PAGE_SIZE-sized pgd's
unconditional for CONFIG_X86_PAE, which is what bothered me.


William Lee Irwin III wrote:
>> The alternative was 64-bit generation numbers incremented at the time
>> of change_page_attr(). If generation numbers were used, it would be
>> possible to dispose of the list altogether. Given the awkwardness of
>> the list maintenance for Xen, it may be worth using them now. PAE
>> pgd's could merely double in size to maintain those for the unshared
>> kernel pmd case, and remain 32B otherwise. Full PAGE_SIZE -sized pgd's
>> for 2-level pagetables could distribute the generation number across
>> page->index and page->private, or any other fields available.

On Thu, May 10, 2007 at 02:28:05PM -0700, Jeremy Fitzhardinge wrote:
> If you use page->index for that, how does pgd_list get linked together
> for vmalloc syncing?

It doesn't need to be linked together for vmalloc_sync(). Just increment
the generation number and walk the mmlist the same as for pageattr.c


-- wli

2007-05-11 00:06:32

by William Lee Irwin III

[permalink] [raw]
Subject: Re: slub-i386-support.patch

On Thu, 10 May 2007, William Lee Irwin III wrote:
>> So now quicklist semantics vs. TLB flushing are the motive behind the
>> odd flush_tlb_mm() affair. The real trick with it is that flushing
>> must never occur until the TLB flush. Any change to the core quicklist
>> code that retires pages back to the page allocator earlier (e.g. based
>> on some limit) will break things badly.

On Fri, May 11, 2007 at 12:14:14AM +0100, Hugh Dickins wrote:
> I don't think that's right. It's vital that TLB (of an active mm)
> be flushed before freeing its page back to the quicklist, before it's
> recycled to another mm (or elsewhere in this mm); but having done that,
> it really doesn't matter much when quicklist_trim() (check_pgt_cache)
> is called to free surplus pages from quicklist back to page_alloc.c.

What I was really going on about was that quicklist freeing can't
enforce any high watermarks in the future because it must wait until
the TLB flush unless it's guaranteed that TLB flushes are done prior
to quicklist freeing (which is furthermore required for other reasons,
to be described in the sequel).


On Fri, May 11, 2007 at 12:14:14AM +0100, Hugh Dickins wrote:
> tlb_finish_mmu() happens to be the traditional place it's done, and
> that's where we expect it. flush_tlb_mm() avoids flushing TLB unless
> it's actually required for the mm in question: so wouldn't be a good
> place to rely on flushing TLB for pages freed earlier from other mms
> (but we'd already be in trouble to be leaving them that late).
> I'm guessing (haven't rechecked source) that the cpu_idle() call comes
> about because the top level pgd of a process gets freed very late in
> its exit, and after a great flurry of processes have just exited,
> perhaps there was nothing to free up the accumulation. Though
> it still strikes me as an odd place to do it.

I described it as motivated by such, not really correctly handling it.
I didn't bother analyzing it for correctness. I'm not surprised at all
that the TLB flush can be missed where it now stands in the patch. I
wanted to move it to tlb_finish_mmu() all along, along with quicklist
management of lower levels of hierarchy.

quicklist_free() with unflushed TLB entries admits speculation through
the pagetable entries corresponding to the list links. So tlb_finish_mmu()
is the place to call quicklist_free() on pagetables. This requires
distinguishing preconstructed pagetables from freed user pages, which
is not done in include/asm-generic/tlb.h (and core callers may need
to be adjusted, pending the results of audits).

To clarify, upper levels of pagetables are indeed cached by x86 TLB's.
The same kind of deferral of freeing until the TLB is flushed required
for leaf pagetables is required for the upper levels as well.


-- wli

2007-05-11 01:08:16

by William Lee Irwin III

[permalink] [raw]
Subject: Re: slub-i386-support.patch

On Thu, May 10, 2007 at 05:07:02PM -0700, William Lee Irwin III wrote:
> quicklist_free() with unflushed TLB entries admits speculation through
> the pagetable entries corresponding to the list links. So tlb_finish_mmu()
> is the place to call quicklist_free() on pagetables. This requires
> distinguishing preconstructed pagetables from freed user pages, which
> is not done in include/asm-generic/tlb.h (and core callers may need
> to be adjusted, pending the results of audits).
> To clarify, upper levels of pagetables are indeed cached by x86 TLB's.
> The same kind of deferral of freeing until the TLB is flushed required
> for leaf pagetables is required for the upper levels as well.

Looking more closely at it, the entire attempt to avoid struct page
pointers is far beyond pointless. The freeing functions unconditionally
require struct page pointers to either be passed or computed and the
allocation function's virtual address it returns as a result is not
directly usable. The callers all have to do arithmetic on the result.
One might as well stash precomputed pfn's (if not paddrs) and vaddrs in
page->private and page->mapping, chain them with ->lru (use only .next
if you care to stay singly-linked), and handle struct page pointers
throughout. At that point quicklists not only become directly callable
for pagetable freeing (including upper levels) instead of needing calls
to quicklist freeing staged to occur at the time of tlb_finish_mmu(),
but also become usable for the highpte case.

The computations this is trying to save on are computing the virtual
and physical addresses (pfn's modulo a cheap shift; besides, all the
API's work on pfn's) of a page from the pointer to the struct page.
Chaining through the memory for the page incurs the cost of having to
stage freeing through tlb_finish_mmu() instead of using the quicklist
as a staging arena directly. So the translation from a struct page
pointer is not saving work. It's not saving cache, either. The page's
memory is no more likely to be hot than its struct page.

In the course of freeing the pointer to the struct page is computed
whether by the caller or the API function. So the translation to a
struct page pointer is done during freeing regardless.

A better solution would be to precompute those results and store
them in various fields of the struct page. i386 can move to using
generation numbers (->_mapcount and ->index are still available
for 64 bits there even after quicklists use ->lru, ->mapping, and
->private, and quicklists really only need half of ->lru) to handle
change_page_attr() and vmalloc_sync().


-- wli

2007-05-11 01:41:39

by William Lee Irwin III

[permalink] [raw]
Subject: Re: slub-i386-support.patch

On Thu, May 10, 2007 at 05:07:02PM -0700, William Lee Irwin III wrote:
> I described it as motivated by such, not really correctly handling it.
> I didn't bother analyzing it for correctness. I'm not surprised at all
> that the TLB flush can be missed where it now stands in the patch. I
> wanted to move it to tlb_finish_mmu() all along, along with quicklist
> management of lower levels of hierarchy.
> quicklist_free() with unflushed TLB entries admits speculation through
> the pagetable entries corresponding to the list links. So tlb_finish_mmu()
> is the place to call quicklist_free() on pagetables. This requires
> distinguishing preconstructed pagetables from freed user pages, which
> is not done in include/asm-generic/tlb.h (and core callers may need
> to be adjusted, pending the results of audits).
> To clarify, upper levels of pagetables are indeed cached by x86 TLB's.
> The same kind of deferral of freeing until the TLB is flushed required
> for leaf pagetables is required for the upper levels as well.

Never mind. The present bit ends up unset because all the vaddrs are
page-aligned, and PDPTE entries (which lack present bits) aren't ever
internally updated until explicit reloads. I'm still not wild about it,
but can't be arsed to deal with it unless it actually breaks.


-- wli

2007-05-11 05:09:41

by Christoph Lameter

[permalink] [raw]
Subject: Re: slub-i386-support.patch

On Thu, 10 May 2007, William Lee Irwin III wrote:

> Looking more closely at it, the entire attempt to avoid struct page
> pointers is far beyond pointless. The freeing functions unconditionally
> require struct page pointers to either be passed or computed and the
> allocation function's virtual address it returns as a result is not
> directly usable. The callers all have to do arithmetic on the result.
> One might as well stash precomputed pfn's (if not paddrs) and vaddrs in
> page->private and page->mapping, chain them with ->lru (use only .next
> if you care to stay singly-linked), and handle struct page pointers

Well then you'd have to rewrite the existing ways of fiddling with page
structs. This way all is clear and you fiddle as you want. It just works.
Could we get this in? You acked it once already?

2007-05-11 07:32:26

by Andi Kleen

[permalink] [raw]
Subject: Re: slub-i386-support.patch

> I'm guessing (haven't rechecked source) that the cpu_idle() call comes
> about because the top level pgd of a process gets freed very late in
> its exit, and after a great flurry of processes have just exited,
> perhaps there was nothing to free up the accumulation. Though
> it still strikes me as an odd place to do it.

I always found it odd and probably the wrong place too.
-Andi

2007-05-11 07:42:20

by Andrew Morton

[permalink] [raw]
Subject: Re: slub-i386-support.patch

On Fri, 11 May 2007 10:29:30 +0200 Andi Kleen <[email protected]> wrote:

> > I'm guessing (haven't rechecked source) that the cpu_idle() call comes
> > about because the top level pgd of a process gets freed very late in
> > its exit, and after a great flurry of processes have just exited,
> > perhaps there was nothing to free up the accumulation. Though
> > it still strikes me as an odd place to do it.
>
> I always found it odd and probably the wrong place too.

so... what's the bottom line here, guys? Should we change that patch?

2007-05-11 07:43:08

by William Lee Irwin III

[permalink] [raw]
Subject: Re: slub-i386-support.patch

On Thu, 10 May 2007, William Lee Irwin III wrote:
>> Looking more closely at it, the entire attempt to avoid struct page
>> pointers is far beyond pointless. The freeing functions unconditionally
>> require struct page pointers to either be passed or computed and the
>> allocation function's virtual address it returns as a result is not
>> directly usable. The callers all have to do arithmetic on the result.
>> One might as well stash precomputed pfn's (if not paddrs) and vaddrs in
>> page->private and page->mapping, chain them with ->lru (use only .next
>> if you care to stay singly-linked), and handle struct page pointers

On Thu, May 10, 2007 at 10:09:32PM -0700, Christoph Lameter wrote:
> Well then you'd have to rewrite the existing ways of fiddling with page
> structs. This way all is clear and you fiddle as you want. It just works.
> Could we get this in? You acked it once already?

I guess I can say I'm microoptimizing things by getting rid of the
lock. I can also do without mucking with fields in the page or
generation numbers given a method of dumping the entire cache for such
catastrophic reorganizations, which is actually better because it makes
change_page_attr() and vmalloc_sync() bear the entire cost, apart from
the loss of the cache in their aftermath. I'll post one of those two in
a follow-up.


-- wli

2007-05-11 07:53:38

by William Lee Irwin III

[permalink] [raw]
Subject: Re: slub-i386-support.patch

At some point in the past, Hugh Dickins wrote:
>>> I'm guessing (haven't rechecked source) that the cpu_idle() call comes
>>> about because the top level pgd of a process gets freed very late in
>>> its exit, and after a great flurry of processes have just exited,
>>> perhaps there was nothing to free up the accumulation. Though
>>> it still strikes me as an odd place to do it.

On Fri, 11 May 2007 10:29:30 +0200 Andi Kleen <[email protected]> wrote:
>> I always found it odd and probably the wrong place too.

On Fri, May 11, 2007 at 12:42:00AM -0700, Andrew Morton wrote:
> so... what's the bottom line here, guys? Should we change that patch?

My summary:
(1) The patch is safe-as is, though it should move its hook elsewhere.
(2) Scribbling on the memory still irks me but is safe.
(3) I should redo the i386 pgd updates to suck less.
(4) Christoph might let me change quicklists to use struct page pointers.

For (3) I can post the update any time; it won't conflict with the
quicklist code, apart from textually, and I can handle the mergework.
The case one should hold off on merging the quicklist code is if
Christoph wants to update the core quicklist code or have me update
it according to the cached precomputed value scheme I outlined in a
prior post before its initial merge. I'll prepare all the patches for
comment regardless.


-- wli

2007-05-11 09:28:26

by Hugh Dickins

[permalink] [raw]
Subject: Re: slub-i386-support.patch

On Fri, 11 May 2007, Andrew Morton wrote:
> On Fri, 11 May 2007 10:29:30 +0200 Andi Kleen <[email protected]> wrote:
>
> > > I'm guessing (haven't rechecked source) that the cpu_idle() call comes
> > > about because the top level pgd of a process gets freed very late in
> > > its exit, and after a great flurry of processes have just exited,
> > > perhaps there was nothing to free up the accumulation. Though
> > > it still strikes me as an odd place to do it.
> >
> > I always found it odd and probably the wrong place too.
>
> so... what's the bottom line here, guys? Should we change that patch?

The bottom line... I can see why you're asking for that ;)

I'd say delete the change to arch/i386/kernel/smp.c - contrary to
what Christoph says, no other arch buries a check_pgt_cache() call
in flush_tlb_mm(), that just seems to be a thinko: i386 has the usual
call to it from tlb_finish_mmu() - _that_ is the one which he and
David were talking about.

I'm just worried that there might somewhere be a call to flush_tlb_mm()
which would now be surprised to be freeing pages: almost certainly not,
but why raise that concern? It's just not flush_tlb_mm()'s business.

The cpu_idle() call should stay for now: we're agreed that it's odd,
but there's plenty of precedent for it, and it's easier to imagine it
serves a real purpose, and shouldn't be removed without replacement.

Bill raised a real concern about unnecessary PAE pgd memory usage,
but let's get the patch into -rc1 to enable the wider SLUB testing,
before coming back to fix that up. His micro-optimizations can wait.

IMHO
Hugh

2007-05-11 16:16:12

by Christoph Lameter

[permalink] [raw]
Subject: Re: slub-i386-support.patch

On Fri, 11 May 2007, William Lee Irwin III wrote:

> My summary:
> (1) The patch is safe-as is, though it should move its hook elsewhere.
> (2) Scribbling on the memory still irks me but is safe.
> (3) I should redo the i386 pgd updates to suck less.
> (4) Christoph might let me change quicklists to use struct page pointers.

I am fine with you working on it. You just need to coordinate with the 7
or so arches that are now using quicklists.

> Christoph wants to update the core quicklist code or have me update
> it according to the cached precomputed value scheme I outlined in a
> prior post before its initial merge. I'll prepare all the patches for
> comment regardless.

I am fine with you updating the quicklist code since I have currently no
plans on doing more there. Just be sure to coordinate with all quicklists
users please.

2007-05-11 20:47:20

by William Lee Irwin III

[permalink] [raw]
Subject: Re: slub-i386-support.patch

On Fri, 11 May 2007, William Lee Irwin III wrote:
>> Christoph wants to update the core quicklist code or have me update
>> it according to the cached precomputed value scheme I outlined in a
>> prior post before its initial merge. I'll prepare all the patches for
>> comment regardless.

On Fri, May 11, 2007 at 09:15:58AM -0700, Christoph Lameter wrote:
> I am fine with you updating the quicklist code since I have currently no
> plans on doing more there. Just be sure to coordinate with all quicklists
> users please.

No problem. I do sweep all arches as a matter of course.


-- wli