On Mon, Jul 23, 2007 at 12:40:09PM -0700, Andrew Morton wrote:
> On Mon, 23 Jul 2007 22:24:57 +0800
> Fengguang Wu <[email protected]> wrote:
>
> > On Mon, Jul 23, 2007 at 07:00:59PM +1000, Nick Piggin wrote:
> > > Rusty Russell wrote:
> > > >On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote:
> > >
> > > >>So I opt for it being made tunable, safe, and turned off by default.
> > >
> > > I hate tunables :) Unless we have workload A that gets a reasonable
> > > benefit from something and workload B that gets a significant regression,
> > > and no clear way to reconcile them...
> >
> > Me too ;)
> >
> > But sometimes we really want to avoid flushing the cache.
> > Andrew's user space LD_PRELOAD+fadvise based tool fit nicely here.
>
> It's the only way to go in some situations. Sometimes the kernel just
> cannot predict the future sufficiently well, and the costs of making a
> mistake are terribly high. We need human help. And it should be
> administration-time help, not programming-time help.
Agreed. I feel that drop behind is not a universal applicable.
Cost based reclaim sounds better, but who knows before field use ;)
> > > >I'd like to see it turned on by default in -mm, and try to come up with
> > > >some server-like workload to measure the effect. Should be easy to
> > > >simulate something (eg. apache server, where clients grab some files in
> > > >preference, and apache server where clients grab different files).
> > >
> > > I don't like this kind of conditional information going from something
> > > like readahead into page reclaim. Unless it is for readahead _specific_
> > > data such as "I got these all wrong, so you can reclaim them" (which
> > > this isn't).
> > >
> > > Possibly it makes sense to realise that the given pages are cheaper
> > > to read back in as they are apparently being read-ahead very nicely.
> >
> > In fact I have talked to Jens about it in last year's kernel summit.
> > The patch below explains itself.
> > ---
> > Subject: cost based page reclaim
> >
> > Cost based page reclaim - a minimalist implementation.
> >
> > Suppose we cached 32 small files each with 1 page, and one 32-page chunk from a
> > large file. Should we first drop the 32-pages which are read in one I/O, or
> > drop the 32 distinct pages, each costs one I/O? (Given that the files are of
> > equal hotness.)
> >
> > Page replacement algorithms should be designed to minimize the number of I/Os,
> > instead of the number of page faults. Dividing the cost of I/O by the number of
> > pages it bring in, we get the cost of the page. The bigger page cost, the more
> > 'lives/bloods' the page should have.
> >
> > This patch adds life to costly pages by pretending that they are
> > referenced more times. Possible downsides:
> > - burdens the pressure of vmscan
> > - active pages are no longer that 'active'
> >
>
> This is all fun stuff, but how do we find out that changes like this are
> good ones, apart from shipping it and seeing who gets hurt 12 months later?
One thing I can imagine now is that the first pages may get more life
because of the conservative initial readahead size.
Generally file servers use sendfile(wholefile), so not a problem.
> > +#define log2(n) fls(n)
>
> <look at include/linux/log2.h>
Thank you, this comment lead to another patch :)
---
Subject: convert ill defined log2() to ilog2()
It's *wrong* to have
#define log2(n) ffz(~(n))
It should be *reversed*:
#define log2(n) flz(~(n))
or
#define log2(n) fls(n)
or just use
ilog2(n) defined in linux/log2.h.
This patch follows the last solution, recommended by Andrew Morton.
//Or are they simply the wrong naming, and is in fact correct in behavior?
Cc: [email protected]
Cc: Mingming Cao <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Chris Ahna <[email protected]>
Cc: David Mosberger-Tang <[email protected]>
Cc: Kyle McMartin <[email protected]>
Signed-off-by: Fengguang Wu <[email protected]>
---
drivers/char/agp/hp-agp.c | 9 +++------
drivers/char/agp/i460-agp.c | 5 ++---
drivers/char/agp/parisc-agp.c | 7 ++-----
fs/ext4/super.c | 6 ++----
4 files changed, 9 insertions(+), 18 deletions(-)
--- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/hp-agp.c
+++ linux-2.6.22-rc6-mm1/drivers/char/agp/hp-agp.c
@@ -14,15 +14,12 @@
#include <linux/pci.h>
#include <linux/init.h>
#include <linux/agp_backend.h>
+#include <linux/log2.h>
#include <asm/acpi-ext.h>
#include "agp.h"
-#ifndef log2
-#define log2(x) ffz(~(x))
-#endif
-
#define HP_ZX1_IOC_OFFSET 0x1000 /* ACPI reports SBA, we want IOC */
/* HP ZX1 IOC registers */
@@ -256,7 +253,7 @@ hp_zx1_configure (void)
readl(hp->ioc_regs+HP_ZX1_IMASK);
writel(hp->iova_base|1, hp->ioc_regs+HP_ZX1_IBASE);
readl(hp->ioc_regs+HP_ZX1_IBASE);
- writel(hp->iova_base|log2(HP_ZX1_IOVA_SIZE), hp->ioc_regs+HP_ZX1_PCOM);
+ writel(hp->iova_base|ilog2(HP_ZX1_IOVA_SIZE), hp->ioc_regs+HP_ZX1_PCOM);
readl(hp->ioc_regs+HP_ZX1_PCOM);
}
@@ -284,7 +281,7 @@ hp_zx1_tlbflush (struct agp_memory *mem)
{
struct _hp_private *hp = &hp_private;
- writeq(hp->gart_base | log2(hp->gart_size), hp->ioc_regs+HP_ZX1_PCOM);
+ writeq(hp->gart_base | ilog2(hp->gart_size), hp->ioc_regs+HP_ZX1_PCOM);
readq(hp->ioc_regs+HP_ZX1_PCOM);
}
--- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/i460-agp.c
+++ linux-2.6.22-rc6-mm1/drivers/char/agp/i460-agp.c
@@ -13,6 +13,7 @@
#include <linux/string.h>
#include <linux/slab.h>
#include <linux/agp_backend.h>
+#include <linux/log2.h>
#include "agp.h"
@@ -59,8 +60,6 @@
*/
#define WR_FLUSH_GATT(index) RD_GATT(index)
-#define log2(x) ffz(~(x))
-
static struct {
void *gatt; /* ioremap'd GATT area */
@@ -148,7 +147,7 @@ static int i460_fetch_size (void)
* values[i].size.
*/
values[i].num_entries = (values[i].size << 8) >> (I460_IO_PAGE_SHIFT - 12);
- values[i].page_order = log2((sizeof(u32)*values[i].num_entries) >> PAGE_SHIFT);
+ values[i].page_order = ilog2((sizeof(u32)*values[i].num_entries) >> PAGE_SHIFT);
}
for (i = 0; i < agp_bridge->driver->num_aperture_sizes; i++) {
--- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/parisc-agp.c
+++ linux-2.6.22-rc6-mm1/drivers/char/agp/parisc-agp.c
@@ -18,6 +18,7 @@
#include <linux/init.h>
#include <linux/klist.h>
#include <linux/agp_backend.h>
+#include <linux/log2.h>
#include <asm-parisc/parisc-device.h>
#include <asm-parisc/ropes.h>
@@ -27,10 +28,6 @@
#define DRVNAME "quicksilver"
#define DRVPFX DRVNAME ": "
-#ifndef log2
-#define log2(x) ffz(~(x))
-#endif
-
#define AGP8X_MODE_BIT 3
#define AGP8X_MODE (1 << AGP8X_MODE_BIT)
@@ -92,7 +89,7 @@ parisc_agp_tlbflush(struct agp_memory *m
{
struct _parisc_agp_info *info = &parisc_agp_info;
- writeq(info->gart_base | log2(info->gart_size), info->ioc_regs+IOC_PCOM);
+ writeq(info->gart_base | ilog2(info->gart_size), info->ioc_regs+IOC_PCOM);
readq(info->ioc_regs+IOC_PCOM); /* flush */
}
--- linux-2.6.22-rc6-mm1.orig/fs/ext4/super.c
+++ linux-2.6.22-rc6-mm1/fs/ext4/super.c
@@ -1433,8 +1433,6 @@ static void ext4_orphan_cleanup (struct
sb->s_flags = s_flags; /* Restore MS_RDONLY status */
}
-#define log2(n) ffz(~(n))
-
/*
* Maximal file size. There is a direct, and {,double-,triple-}indirect
* block limit, and also a limit of (2^32 - 1) 512-byte sectors in i_blocks.
@@ -1706,8 +1704,8 @@ static int ext4_fill_super (struct super
sbi->s_desc_per_block = blocksize / EXT4_DESC_SIZE(sb);
sbi->s_sbh = bh;
sbi->s_mount_state = le16_to_cpu(es->s_state);
- sbi->s_addr_per_block_bits = log2(EXT4_ADDR_PER_BLOCK(sb));
- sbi->s_desc_per_block_bits = log2(EXT4_DESC_PER_BLOCK(sb));
+ sbi->s_addr_per_block_bits = ilog2(EXT4_ADDR_PER_BLOCK(sb));
+ sbi->s_desc_per_block_bits = ilog2(EXT4_DESC_PER_BLOCK(sb));
for (i=0; i < 4; i++)
sbi->s_hash_seed[i] = le32_to_cpu(es->s_hash_seed[i]);
sbi->s_def_hash_version = es->s_def_hash_version;
On Tue, 24 Jul 2007 08:47:28 +0800
Fengguang Wu <[email protected]> wrote:
> Subject: convert ill defined log2() to ilog2()
>
> It's *wrong* to have
> #define log2(n) ffz(~(n))
> It should be *reversed*:
> #define log2(n) flz(~(n))
> or
> #define log2(n) fls(n)
> or just use
> ilog2(n) defined in linux/log2.h.
>
> This patch follows the last solution, recommended by Andrew Morton.
>
> //Or are they simply the wrong naming, and is in fact correct in behavior?
>
> Cc: [email protected]
> Cc: Mingming Cao <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> Cc: Chris Ahna <[email protected]>
> Cc: David Mosberger-Tang <[email protected]>
> Cc: Kyle McMartin <[email protected]>
> Signed-off-by: Fengguang Wu <[email protected]>
> ---
> drivers/char/agp/hp-agp.c | 9 +++------
> drivers/char/agp/i460-agp.c | 5 ++---
> drivers/char/agp/parisc-agp.c | 7 ++-----
> fs/ext4/super.c | 6 ++----
> 4 files changed, 9 insertions(+), 18 deletions(-)
hm, yes, there is a risk that the code was accidentally correct. Or the
code has only ever dealt with power-of-2 inputs, in which case it happens
to work either way.
David(s) and ext4-people: could we please have a close review of these
changes?
Thanks.
> --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/hp-agp.c
> +++ linux-2.6.22-rc6-mm1/drivers/char/agp/hp-agp.c
> @@ -14,15 +14,12 @@
> #include <linux/pci.h>
> #include <linux/init.h>
> #include <linux/agp_backend.h>
> +#include <linux/log2.h>
>
> #include <asm/acpi-ext.h>
>
> #include "agp.h"
>
> -#ifndef log2
> -#define log2(x) ffz(~(x))
> -#endif
> -
> #define HP_ZX1_IOC_OFFSET 0x1000 /* ACPI reports SBA, we want IOC */
>
> /* HP ZX1 IOC registers */
> @@ -256,7 +253,7 @@ hp_zx1_configure (void)
> readl(hp->ioc_regs+HP_ZX1_IMASK);
> writel(hp->iova_base|1, hp->ioc_regs+HP_ZX1_IBASE);
> readl(hp->ioc_regs+HP_ZX1_IBASE);
> - writel(hp->iova_base|log2(HP_ZX1_IOVA_SIZE), hp->ioc_regs+HP_ZX1_PCOM);
> + writel(hp->iova_base|ilog2(HP_ZX1_IOVA_SIZE), hp->ioc_regs+HP_ZX1_PCOM);
> readl(hp->ioc_regs+HP_ZX1_PCOM);
> }
>
> @@ -284,7 +281,7 @@ hp_zx1_tlbflush (struct agp_memory *mem)
> {
> struct _hp_private *hp = &hp_private;
>
> - writeq(hp->gart_base | log2(hp->gart_size), hp->ioc_regs+HP_ZX1_PCOM);
> + writeq(hp->gart_base | ilog2(hp->gart_size), hp->ioc_regs+HP_ZX1_PCOM);
> readq(hp->ioc_regs+HP_ZX1_PCOM);
> }
>
> --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/i460-agp.c
> +++ linux-2.6.22-rc6-mm1/drivers/char/agp/i460-agp.c
> @@ -13,6 +13,7 @@
> #include <linux/string.h>
> #include <linux/slab.h>
> #include <linux/agp_backend.h>
> +#include <linux/log2.h>
>
> #include "agp.h"
>
> @@ -59,8 +60,6 @@
> */
> #define WR_FLUSH_GATT(index) RD_GATT(index)
>
> -#define log2(x) ffz(~(x))
> -
> static struct {
> void *gatt; /* ioremap'd GATT area */
>
> @@ -148,7 +147,7 @@ static int i460_fetch_size (void)
> * values[i].size.
> */
> values[i].num_entries = (values[i].size << 8) >> (I460_IO_PAGE_SHIFT - 12);
> - values[i].page_order = log2((sizeof(u32)*values[i].num_entries) >> PAGE_SHIFT);
> + values[i].page_order = ilog2((sizeof(u32)*values[i].num_entries) >> PAGE_SHIFT);
> }
>
> for (i = 0; i < agp_bridge->driver->num_aperture_sizes; i++) {
> --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/parisc-agp.c
> +++ linux-2.6.22-rc6-mm1/drivers/char/agp/parisc-agp.c
> @@ -18,6 +18,7 @@
> #include <linux/init.h>
> #include <linux/klist.h>
> #include <linux/agp_backend.h>
> +#include <linux/log2.h>
>
> #include <asm-parisc/parisc-device.h>
> #include <asm-parisc/ropes.h>
> @@ -27,10 +28,6 @@
> #define DRVNAME "quicksilver"
> #define DRVPFX DRVNAME ": "
>
> -#ifndef log2
> -#define log2(x) ffz(~(x))
> -#endif
> -
> #define AGP8X_MODE_BIT 3
> #define AGP8X_MODE (1 << AGP8X_MODE_BIT)
>
> @@ -92,7 +89,7 @@ parisc_agp_tlbflush(struct agp_memory *m
> {
> struct _parisc_agp_info *info = &parisc_agp_info;
>
> - writeq(info->gart_base | log2(info->gart_size), info->ioc_regs+IOC_PCOM);
> + writeq(info->gart_base | ilog2(info->gart_size), info->ioc_regs+IOC_PCOM);
> readq(info->ioc_regs+IOC_PCOM); /* flush */
> }
>
> --- linux-2.6.22-rc6-mm1.orig/fs/ext4/super.c
> +++ linux-2.6.22-rc6-mm1/fs/ext4/super.c
> @@ -1433,8 +1433,6 @@ static void ext4_orphan_cleanup (struct
> sb->s_flags = s_flags; /* Restore MS_RDONLY status */
> }
>
> -#define log2(n) ffz(~(n))
> -
> /*
> * Maximal file size. There is a direct, and {,double-,triple-}indirect
> * block limit, and also a limit of (2^32 - 1) 512-byte sectors in i_blocks.
> @@ -1706,8 +1704,8 @@ static int ext4_fill_super (struct super
> sbi->s_desc_per_block = blocksize / EXT4_DESC_SIZE(sb);
> sbi->s_sbh = bh;
> sbi->s_mount_state = le16_to_cpu(es->s_state);
> - sbi->s_addr_per_block_bits = log2(EXT4_ADDR_PER_BLOCK(sb));
> - sbi->s_desc_per_block_bits = log2(EXT4_DESC_PER_BLOCK(sb));
> + sbi->s_addr_per_block_bits = ilog2(EXT4_ADDR_PER_BLOCK(sb));
> + sbi->s_desc_per_block_bits = ilog2(EXT4_DESC_PER_BLOCK(sb));
> for (i=0; i < 4; i++)
> sbi->s_hash_seed[i] = le32_to_cpu(es->s_hash_seed[i]);
> sbi->s_def_hash_version = es->s_def_hash_version;
Fengguang Wu wrote:
> On Mon, Jul 23, 2007 at 12:40:09PM -0700, Andrew Morton wrote:
>>This is all fun stuff, but how do we find out that changes like this are
>>good ones, apart from shipping it and seeing who gets hurt 12 months later?
>
>
> One thing I can imagine now is that the first pages may get more life
> because of the conservative initial readahead size.
Yeah I don't think it is really worth the complexity and corner cases
it would introduce at this stage.
People are still complaining about their nightly cron job *swapping*
stuff out of their 1-2GB desktop systems. There must still be some low
hanging fruit or obvious problems with page reclaim so I'd like to try
working out exactly what is going wrong and fix it in page reclaim
rather than adding more complexity in the hope that it might help.
--
SUSE Labs, Novell Inc.
On Jul 23, 2007 18:17 -0700, Andrew Morton wrote:
> hm, yes, there is a risk that the code was accidentally correct. Or the
> code has only ever dealt with power-of-2 inputs, in which case it happens
> to work either way.
>
> David(s) and ext4-people: could we please have a close review of these
> changes?
> > @@ -1706,8 +1704,8 @@ static int ext4_fill_super (struct super
> > - sbi->s_addr_per_block_bits = log2(EXT4_ADDR_PER_BLOCK(sb));
> > - sbi->s_desc_per_block_bits = log2(EXT4_DESC_PER_BLOCK(sb));
> > + sbi->s_addr_per_block_bits = ilog2(EXT4_ADDR_PER_BLOCK(sb));
> > + sbi->s_desc_per_block_bits = ilog2(EXT4_DESC_PER_BLOCK(sb));
For the ext[234] code there has only ever been power-of-two values for
ADDR_PER_BLOCK() and DESC_PER_BLOCK().
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.