2008-12-30 23:12:40

by Peter W. Morreale

[permalink] [raw]
Subject: [PATCH 0/2] pdflush fix and enhancement

The following series implements a fix for pdflush thread creation and an
enhancement for allowing the setting of the minimum and maximun number
of pdflush threads.

These patches apply against 2.6.27.

Best,
-PWM

---

Peter W Morreale (2):
Add /proc controls for pdflush threads
Fix pdflush thread creation upper bound.


include/linux/sysctl.h | 2 ++
include/linux/writeback.h | 2 ++
kernel/sysctl.c | 16 ++++++++++++++++
mm/pdflush.c | 42 +++++++++++++++++++++++++++++++++---------
4 files changed, 53 insertions(+), 9 deletions(-)


2008-12-30 23:13:18

by Peter W. Morreale

[permalink] [raw]
Subject: [PATCH 2/2] Add /proc controls for pdflush threads

From: \"Peter W. Morreale\" <[email protected]>

This patch adds /proc entries to give the admin the ability to
control the minimum and maximum number of pdflush threads. This allows
finer control of pdflush on both large and small machines.

The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and
'/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions.
---

Signed-off-by: Peter W Morreale <[email protected]>

include/linux/sysctl.h | 2 ++
include/linux/writeback.h | 2 ++
kernel/sysctl.c | 16 ++++++++++++++++
mm/pdflush.c | 19 ++++++++++++++-----
4 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index d0437f3..9921e62 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -205,6 +205,8 @@ enum
VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
+ VM_NR_PDFLUSH_THREADS_MAX=36, /* nr_pdflush_threads_max */
+ VM_NR_PDFLUSH_THREADS_MIN=37, /* nr_pdflush_threads_min */
};


diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 12b15c5..ee566a0 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -150,6 +150,8 @@ void writeback_set_ratelimit(void);
/* pdflush.c */
extern int nr_pdflush_threads; /* Global so it can be exported to sysctl
read-only. */
+extern int nr_pdflush_threads_max; /* Global so it can be exported to sysctl */
+extern int nr_pdflush_threads_min; /* Global so it can be exported to sysctl */


#endif /* WRITEBACK_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 50ec088..6dae777 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -948,6 +948,22 @@ static struct ctl_table vm_table[] = {
.proc_handler = &proc_dointvec,
},
{
+ .ctl_name = VM_NR_PDFLUSH_THREADS_MIN,
+ .procname = "nr_pdflush_threads_min",
+ .data = &nr_pdflush_threads_min,
+ .maxlen = sizeof nr_pdflush_threads_min,
+ .mode = 0644 /* read-only*/,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = VM_NR_PDFLUSH_THREADS_MAX,
+ .procname = "nr_pdflush_threads_max",
+ .data = &nr_pdflush_threads_max,
+ .maxlen = sizeof nr_pdflush_threads_max,
+ .mode = 0644 /* read-only*/,
+ .proc_handler = &proc_dointvec,
+ },
+ {
.ctl_name = VM_SWAPPINESS,
.procname = "swappiness",
.data = &vm_swappiness,
diff --git a/mm/pdflush.c b/mm/pdflush.c
index 481680f..9a6f835 100644
--- a/mm/pdflush.c
+++ b/mm/pdflush.c
@@ -58,6 +58,14 @@ static DEFINE_SPINLOCK(pdflush_lock);
int nr_pdflush_threads = 0;

/*
+ * The max/min number of pdflush threads. R/W by sysctl at
+ * /proc/sys/vm/nr_pdflush_threads_max
+ */
+int nr_pdflush_threads_max = MAX_PDFLUSH_THREADS;
+int nr_pdflush_threads_min = MIN_PDFLUSH_THREADS;
+
+
+/*
* The time at which the pdflush thread pool last went empty
*/
static unsigned long last_empty_jifs;
@@ -68,7 +76,7 @@ static unsigned long last_empty_jifs;
* Thread pool management algorithm:
*
* - The minimum and maximum number of pdflush instances are bound
- * by MIN_PDFLUSH_THREADS and MAX_PDFLUSH_THREADS.
+ * by nr_pdflush_threads_min and nr_pdflush_threads_max.
*
* - If there have been no idle pdflush instances for 1 second, create
* a new one.
@@ -133,7 +141,8 @@ static int __pdflush(struct pdflush_work *my_work)
*/
if (time_after(jiffies, last_empty_jifs + 1 * HZ)) {
if (list_empty(&pdflush_list)) {
- if (nr_pdflush_threads < MAX_PDFLUSH_THREADS) {
+ if (nr_pdflush_threads <
+ nr_pdflush_threads_max) {
nr_pdflush_threads++;
spin_unlock_irq(&pdflush_lock);
start_one_pdflush_thread();
@@ -150,7 +159,7 @@ static int __pdflush(struct pdflush_work *my_work)
*/
if (list_empty(&pdflush_list))
continue;
- if (nr_pdflush_threads <= MIN_PDFLUSH_THREADS)
+ if (nr_pdflush_threads <= nr_pdflush_threads_min)
continue;
pdf = list_entry(pdflush_list.prev, struct pdflush_work, list);
if (time_after(jiffies, pdf->when_i_went_to_sleep + 1 * HZ)) {
@@ -246,9 +255,9 @@ static int __init pdflush_init(void)
* Pre-set nr_pdflush_threads... If we fail to create,
* the count will be decremented.
*/
- nr_pdflush_threads = MIN_PDFLUSH_THREADS;
+ nr_pdflush_threads = nr_pdflush_threads_min;

- for (i = 0; i < MIN_PDFLUSH_THREADS; i++)
+ for (i = 0; i < nr_pdflush_threads_min; i++)
start_one_pdflush_thread();
return 0;
}

2008-12-30 23:12:56

by Peter W. Morreale

[permalink] [raw]
Subject: [PATCH 1/2] Fix pdflush thread creation upper bound.

From: \"Peter W. Morreale\" <[email protected]>

This patch fixes a race on creating pdflush threads. Without the patch,
it is possible to create more than MAX_PDFLUSH_THREADS threads, and this
has been observed in practice on IO loaded SMP machines.

The fix involves moving the lock around to protect the check against the
thread count and correctly dealing with thread creation failure.
---

Signed-off-by: Peter W Morreale <[email protected]>

mm/pdflush.c | 27 +++++++++++++++++++++------
1 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/mm/pdflush.c b/mm/pdflush.c
index 0cbe0c6..481680f 100644
--- a/mm/pdflush.c
+++ b/mm/pdflush.c
@@ -98,7 +98,6 @@ static int __pdflush(struct pdflush_work *my_work)
INIT_LIST_HEAD(&my_work->list);

spin_lock_irq(&pdflush_lock);
- nr_pdflush_threads++;
for ( ; ; ) {
struct pdflush_work *pdf;

@@ -126,20 +125,23 @@ static int __pdflush(struct pdflush_work *my_work)

(*my_work->fn)(my_work->arg0);

+ spin_lock_irq(&pdflush_lock);
+
/*
* Thread creation: For how long have there been zero
* available threads?
*/
if (time_after(jiffies, last_empty_jifs + 1 * HZ)) {
- /* unlocked list_empty() test is OK here */
if (list_empty(&pdflush_list)) {
- /* unlocked test is OK here */
- if (nr_pdflush_threads < MAX_PDFLUSH_THREADS)
+ if (nr_pdflush_threads < MAX_PDFLUSH_THREADS) {
+ nr_pdflush_threads++;
+ spin_unlock_irq(&pdflush_lock);
start_one_pdflush_thread();
+ spin_lock_irq(&pdflush_lock);
+ }
}
}

- spin_lock_irq(&pdflush_lock);
my_work->fn = NULL;

/*
@@ -226,13 +228,26 @@ int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0)

static void start_one_pdflush_thread(void)
{
- kthread_run(pdflush, NULL, "pdflush");
+ struct task_struct *k;
+
+ k = kthread_run(pdflush, NULL, "pdflush");
+ if (unlikely(IS_ERR(k))) {
+ spin_lock_irq(&pdflush_lock);
+ nr_pdflush_threads--;
+ spin_unlock_irq(&pdflush_lock);
+ }
}

static int __init pdflush_init(void)
{
int i;

+ /*
+ * Pre-set nr_pdflush_threads... If we fail to create,
+ * the count will be decremented.
+ */
+ nr_pdflush_threads = MIN_PDFLUSH_THREADS;
+
for (i = 0; i < MIN_PDFLUSH_THREADS; i++)
start_one_pdflush_thread();
return 0;

2008-12-30 23:59:42

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add /proc controls for pdflush threads

Peter W Morreale wrote:
> From: \"Peter W. Morreale\" <[email protected]>
>
> This patch adds /proc entries to give the admin the ability to
> control the minimum and maximum number of pdflush threads. This allows
> finer control of pdflush on both large and small machines.
>
> The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and
> '/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions.

These need to be documented in (ugh) either
Documentation/filesystems/proc.txt or
Documentation/sysctl/vm.txt. Hard to say which one, but I'd
recommend the latter (vm.txt) since they are sysctls.


It looks like there is a lot of duplication between them.
That's not good IMO.

> ---
>
> Signed-off-by: Peter W Morreale <[email protected]>
>
> include/linux/sysctl.h | 2 ++
> include/linux/writeback.h | 2 ++
> kernel/sysctl.c | 16 ++++++++++++++++
> mm/pdflush.c | 19 ++++++++++++++-----
> 4 files changed, 34 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index d0437f3..9921e62 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -205,6 +205,8 @@ enum
> VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
> VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
> VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
> + VM_NR_PDFLUSH_THREADS_MAX=36, /* nr_pdflush_threads_max */
> + VM_NR_PDFLUSH_THREADS_MIN=37, /* nr_pdflush_threads_min */
> };
>
>
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 12b15c5..ee566a0 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -150,6 +150,8 @@ void writeback_set_ratelimit(void);
> /* pdflush.c */
> extern int nr_pdflush_threads; /* Global so it can be exported to sysctl
> read-only. */
> +extern int nr_pdflush_threads_max; /* Global so it can be exported to sysctl */
> +extern int nr_pdflush_threads_min; /* Global so it can be exported to sysctl */
>
>
> #endif /* WRITEBACK_H */
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 50ec088..6dae777 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -948,6 +948,22 @@ static struct ctl_table vm_table[] = {
> .proc_handler = &proc_dointvec,
> },
> {
> + .ctl_name = VM_NR_PDFLUSH_THREADS_MIN,
> + .procname = "nr_pdflush_threads_min",
> + .data = &nr_pdflush_threads_min,
> + .maxlen = sizeof nr_pdflush_threads_min,
> + .mode = 0644 /* read-only*/,
> + .proc_handler = &proc_dointvec,
> + },
> + {
> + .ctl_name = VM_NR_PDFLUSH_THREADS_MAX,
> + .procname = "nr_pdflush_threads_max",
> + .data = &nr_pdflush_threads_max,
> + .maxlen = sizeof nr_pdflush_threads_max,
> + .mode = 0644 /* read-only*/,
> + .proc_handler = &proc_dointvec,
> + },
> + {
> .ctl_name = VM_SWAPPINESS,
> .procname = "swappiness",
> .data = &vm_swappiness,
> diff --git a/mm/pdflush.c b/mm/pdflush.c
> index 481680f..9a6f835 100644
> --- a/mm/pdflush.c
> +++ b/mm/pdflush.c
> @@ -58,6 +58,14 @@ static DEFINE_SPINLOCK(pdflush_lock);
> int nr_pdflush_threads = 0;
>
> /*
> + * The max/min number of pdflush threads. R/W by sysctl at
> + * /proc/sys/vm/nr_pdflush_threads_max
> + */
> +int nr_pdflush_threads_max = MAX_PDFLUSH_THREADS;
> +int nr_pdflush_threads_min = MIN_PDFLUSH_THREADS;
> +
> +
> +/*
> * The time at which the pdflush thread pool last went empty
> */
> static unsigned long last_empty_jifs;
> @@ -68,7 +76,7 @@ static unsigned long last_empty_jifs;
> * Thread pool management algorithm:
> *
> * - The minimum and maximum number of pdflush instances are bound
> - * by MIN_PDFLUSH_THREADS and MAX_PDFLUSH_THREADS.
> + * by nr_pdflush_threads_min and nr_pdflush_threads_max.
> *
> * - If there have been no idle pdflush instances for 1 second, create
> * a new one.
> @@ -133,7 +141,8 @@ static int __pdflush(struct pdflush_work *my_work)
> */
> if (time_after(jiffies, last_empty_jifs + 1 * HZ)) {
> if (list_empty(&pdflush_list)) {
> - if (nr_pdflush_threads < MAX_PDFLUSH_THREADS) {
> + if (nr_pdflush_threads <
> + nr_pdflush_threads_max) {
> nr_pdflush_threads++;
> spin_unlock_irq(&pdflush_lock);
> start_one_pdflush_thread();
> @@ -150,7 +159,7 @@ static int __pdflush(struct pdflush_work *my_work)
> */
> if (list_empty(&pdflush_list))
> continue;
> - if (nr_pdflush_threads <= MIN_PDFLUSH_THREADS)
> + if (nr_pdflush_threads <= nr_pdflush_threads_min)
> continue;
> pdf = list_entry(pdflush_list.prev, struct pdflush_work, list);
> if (time_after(jiffies, pdf->when_i_went_to_sleep + 1 * HZ)) {
> @@ -246,9 +255,9 @@ static int __init pdflush_init(void)
> * Pre-set nr_pdflush_threads... If we fail to create,
> * the count will be decremented.
> */
> - nr_pdflush_threads = MIN_PDFLUSH_THREADS;
> + nr_pdflush_threads = nr_pdflush_threads_min;
>
> - for (i = 0; i < MIN_PDFLUSH_THREADS; i++)
> + for (i = 0; i < nr_pdflush_threads_min; i++)
> start_one_pdflush_thread();
> return 0;
> }


--
~Randy

2008-12-31 00:15:29

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add /proc controls for pdflush threads

On Tue, 2008-12-30 at 15:59 -0800, Randy Dunlap wrote:
> Peter W Morreale wrote:
> > From: \"Peter W. Morreale\" <[email protected]>
> >
> > This patch adds /proc entries to give the admin the ability to
> > control the minimum and maximum number of pdflush threads. This allows
> > finer control of pdflush on both large and small machines.
> >
> > The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and
> > '/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions.
>
> These need to be documented in (ugh) either
> Documentation/filesystems/proc.txt or
> Documentation/sysctl/vm.txt. Hard to say which one, but I'd
> recommend the latter (vm.txt) since they are sysctls.
>

Excellent point. Will re-spin.

Thx,
-PWM

>
> It looks like there is a lot of duplication between them.
> That's not good IMO.
>
> > ---
> >
> > Signed-off-by: Peter W Morreale <[email protected]>
> >
> > include/linux/sysctl.h | 2 ++
> > include/linux/writeback.h | 2 ++
> > kernel/sysctl.c | 16 ++++++++++++++++
> > mm/pdflush.c | 19 ++++++++++++++-----
> > 4 files changed, 34 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> > index d0437f3..9921e62 100644
> > --- a/include/linux/sysctl.h
> > +++ b/include/linux/sysctl.h
> > @@ -205,6 +205,8 @@ enum
> > VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
> > VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
> > VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
> > + VM_NR_PDFLUSH_THREADS_MAX=36, /* nr_pdflush_threads_max */
> > + VM_NR_PDFLUSH_THREADS_MIN=37, /* nr_pdflush_threads_min */
> > };
> >
> >
> > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > index 12b15c5..ee566a0 100644
> > --- a/include/linux/writeback.h
> > +++ b/include/linux/writeback.h
> > @@ -150,6 +150,8 @@ void writeback_set_ratelimit(void);
> > /* pdflush.c */
> > extern int nr_pdflush_threads; /* Global so it can be exported to sysctl
> > read-only. */
> > +extern int nr_pdflush_threads_max; /* Global so it can be exported to sysctl */
> > +extern int nr_pdflush_threads_min; /* Global so it can be exported to sysctl */
> >
> >
> > #endif /* WRITEBACK_H */
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 50ec088..6dae777 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -948,6 +948,22 @@ static struct ctl_table vm_table[] = {
> > .proc_handler = &proc_dointvec,
> > },
> > {
> > + .ctl_name = VM_NR_PDFLUSH_THREADS_MIN,
> > + .procname = "nr_pdflush_threads_min",
> > + .data = &nr_pdflush_threads_min,
> > + .maxlen = sizeof nr_pdflush_threads_min,
> > + .mode = 0644 /* read-only*/,
> > + .proc_handler = &proc_dointvec,
> > + },
> > + {
> > + .ctl_name = VM_NR_PDFLUSH_THREADS_MAX,
> > + .procname = "nr_pdflush_threads_max",
> > + .data = &nr_pdflush_threads_max,
> > + .maxlen = sizeof nr_pdflush_threads_max,
> > + .mode = 0644 /* read-only*/,
> > + .proc_handler = &proc_dointvec,
> > + },
> > + {
> > .ctl_name = VM_SWAPPINESS,
> > .procname = "swappiness",
> > .data = &vm_swappiness,
> > diff --git a/mm/pdflush.c b/mm/pdflush.c
> > index 481680f..9a6f835 100644
> > --- a/mm/pdflush.c
> > +++ b/mm/pdflush.c
> > @@ -58,6 +58,14 @@ static DEFINE_SPINLOCK(pdflush_lock);
> > int nr_pdflush_threads = 0;
> >
> > /*
> > + * The max/min number of pdflush threads. R/W by sysctl at
> > + * /proc/sys/vm/nr_pdflush_threads_max
> > + */
> > +int nr_pdflush_threads_max = MAX_PDFLUSH_THREADS;
> > +int nr_pdflush_threads_min = MIN_PDFLUSH_THREADS;
> > +
> > +
> > +/*
> > * The time at which the pdflush thread pool last went empty
> > */
> > static unsigned long last_empty_jifs;
> > @@ -68,7 +76,7 @@ static unsigned long last_empty_jifs;
> > * Thread pool management algorithm:
> > *
> > * - The minimum and maximum number of pdflush instances are bound
> > - * by MIN_PDFLUSH_THREADS and MAX_PDFLUSH_THREADS.
> > + * by nr_pdflush_threads_min and nr_pdflush_threads_max.
> > *
> > * - If there have been no idle pdflush instances for 1 second, create
> > * a new one.
> > @@ -133,7 +141,8 @@ static int __pdflush(struct pdflush_work *my_work)
> > */
> > if (time_after(jiffies, last_empty_jifs + 1 * HZ)) {
> > if (list_empty(&pdflush_list)) {
> > - if (nr_pdflush_threads < MAX_PDFLUSH_THREADS) {
> > + if (nr_pdflush_threads <
> > + nr_pdflush_threads_max) {
> > nr_pdflush_threads++;
> > spin_unlock_irq(&pdflush_lock);
> > start_one_pdflush_thread();
> > @@ -150,7 +159,7 @@ static int __pdflush(struct pdflush_work *my_work)
> > */
> > if (list_empty(&pdflush_list))
> > continue;
> > - if (nr_pdflush_threads <= MIN_PDFLUSH_THREADS)
> > + if (nr_pdflush_threads <= nr_pdflush_threads_min)
> > continue;
> > pdf = list_entry(pdflush_list.prev, struct pdflush_work, list);
> > if (time_after(jiffies, pdf->when_i_went_to_sleep + 1 * HZ)) {
> > @@ -246,9 +255,9 @@ static int __init pdflush_init(void)
> > * Pre-set nr_pdflush_threads... If we fail to create,
> > * the count will be decremented.
> > */
> > - nr_pdflush_threads = MIN_PDFLUSH_THREADS;
> > + nr_pdflush_threads = nr_pdflush_threads_min;
> >
> > - for (i = 0; i < MIN_PDFLUSH_THREADS; i++)
> > + for (i = 0; i < nr_pdflush_threads_min; i++)
> > start_one_pdflush_thread();
> > return 0;
> > }
>
>

2008-12-31 00:27:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/2] pdflush fix and enhancement

Peter W Morreale <[email protected]> writes:

> The following series implements a fix for pdflush thread creation and an
> enhancement for allowing the setting of the minimum and maximun number
> of pdflush threads.

You forgot to state why the admin should control that?

Each sysctl essentially presents a failure of the automatic tuning
algorithms of the kernel (kernel hackers admitting defeat).

So a patch adding new ones at least needs a clear rationale what
problem it is trying to fix and why the automatic tuning cannot be
fixed instead to address this case without new knobs.

-Andi

--
[email protected]

2008-12-31 01:56:47

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH 0/2] pdflush fix and enhancement

On Wed, 2008-12-31 at 01:28 +0100, Andi Kleen wrote:
> Peter W Morreale <[email protected]> writes:
>
> > The following series implements a fix for pdflush thread creation and an
> > enhancement for allowing the setting of the minimum and maximun number
> > of pdflush threads.
>
> You forgot to state why the admin should control that?
>
> Each sysctl essentially presents a failure of the automatic tuning
> algorithms of the kernel (kernel hackers admitting defeat).
>
> So a patch adding new ones at least needs a clear rationale what
> problem it is trying to fix and why the automatic tuning cannot be
> fixed instead to address this case without new knobs.
>

Understood.

The rational is simply because one-size may not fit all. Currently
there is minimalistic tuning wrt pdflush thread creation.

Each pdflush thread arbitrarily decides to create another thread solely
on the basis of all currently existing threads being busy for >= one
second. So this can result in NCPUS pdflush threads being created
'concurrently' should all existing threads wind up in the idle check at
the same time (I have seen this). Or it can result in a thread being
created every second, up to the max. That seems indeterministic at
best.

Note that this patch doesn't change that algorithm at all, it merely
allows the admin to control the bounds without having to recompile the
kernel.

More to the point, on small systems with few file systems, what is the
point of having 8 (the current max) threads competing with each other on
a single disk? Likewise, on a 64-way, or larger system with dozens of
filesystems/disks, why wouldn't I want more background flushing?

I actually think the question is: Why not allow the admin to control
this? Since it seems like this is a matter of policy based on machine
configuration. I don't look at this as a matter of defeat, rather, a
matter of policy. Note that the current defaults remain the same.

So that's the rational. Its not an attempt to fix background flushing,
and regardless, I would still argue that any 'fix' to background
flushing would involve a number of threads that were bounded by some
min/max and that min/max would still need to be set on some admin
policy based on intended machine size/use.

And btw Andi, I should have cc'ed you on this, I know from the flush
code that you were heavily involved. I only cc'ed Peter Z. since his
name was in pdflush.c. My apologies.

Best,
-PWM



> -Andi
>

2008-12-31 02:33:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/2] pdflush fix and enhancement

On Tue, Dec 30, 2008 at 06:56:29PM -0700, Peter W. Morreale wrote:
> The rational is simply because one-size may not fit all. Currently
> there is minimalistic tuning wrt pdflush thread creation.

Thanks. Please put this into the commit description for future
reference.

Some comments below.

> Each pdflush thread arbitrarily decides to create another thread solely
> on the basis of all currently existing threads being busy for >= one
> second. So this can result in NCPUS pdflush threads being created
> 'concurrently' should all existing threads wind up in the idle check at
> the same time (I have seen this). Or it can result in a thread being

So perhaps a simple lock serializing creation of new pdflush threads
would avoid the problem?

> More to the point, on small systems with few file systems, what is the
> point of having 8 (the current max) threads competing with each other on
> a single disk? Likewise, on a 64-way, or larger system with dozens of
> filesystems/disks, why wouldn't I want more background flushing?

That makes some sense, but perhaps it would be better to base the default
size on the number of underlying block devices then?

Ok one issue there is that there are lots of different types of
block devices, e.g. a big raid array may look like a single disk.
Still I suspect defaults based on the block devices would do reasonably
well.

>
> I actually think the question is: Why not allow the admin to control
> this? Since it seems like this is a matter of policy based on machine
> configuration.

The kernel should know the current machine config and most
admins don't really want to do very fine grained configuration;
they expect the system to perform well out of the box. That is
why it is adventageous to try to come up with good auto tuning.

> And btw Andi, I should have cc'ed you on this, I know from the flush
> code that you were heavily involved. I only cc'ed Peter Z. since his
> name was in pdflush.c. My apologies.

No problem, but I think you're confusing me with someone else :). I don't
remember ever hacking on this. Perhaps you thought of Andrew Morton
who did pdflush originally.

-Andi

--
[email protected]

2008-12-31 02:38:47

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add /proc controls for pdflush threads

On Tue, 2008-12-30 at 15:59 -0800, Randy Dunlap wrote:
> Peter W Morreale wrote:
> > From: \"Peter W. Morreale\" <[email protected]>
> >
> > This patch adds /proc entries to give the admin the ability to
> > control the minimum and maximum number of pdflush threads. This allows
> > finer control of pdflush on both large and small machines.
> >
> > The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and
> > '/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions.
>
> These need to be documented in (ugh) either
> Documentation/filesystems/proc.txt or
> Documentation/sysctl/vm.txt. Hard to say which one, but I'd
> recommend the latter (vm.txt) since they are sysctls.
>
>
> It looks like there is a lot of duplication between them.
> That's not good IMO.
>

Hummm... worse than that. Documentation/sysctl/vm.txt is missing 9
proc/sys/vm entries and hasn't been updated since (ahem) 2.2.10.

Documentation/file_systems/proc.txt is much more recent -
2.4.0-test11-pre4. My how time flies... :-)

I can take a stab at updating vm.txt. Would it be ok to remove the vm
verbiage from proc.txt and replace it with a pointer to vm.txt?

-PWM

> > ---
> >
> > Signed-off-by: Peter W Morreale <[email protected]>
> >
> > include/linux/sysctl.h | 2 ++
> > include/linux/writeback.h | 2 ++
> > kernel/sysctl.c | 16 ++++++++++++++++
> > mm/pdflush.c | 19 ++++++++++++++-----
> > 4 files changed, 34 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> > index d0437f3..9921e62 100644
> > --- a/include/linux/sysctl.h
> > +++ b/include/linux/sysctl.h
> > @@ -205,6 +205,8 @@ enum
> > VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
> > VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
> > VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
> > + VM_NR_PDFLUSH_THREADS_MAX=36, /* nr_pdflush_threads_max */
> > + VM_NR_PDFLUSH_THREADS_MIN=37, /* nr_pdflush_threads_min */
> > };
> >
> >
> > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > index 12b15c5..ee566a0 100644
> > --- a/include/linux/writeback.h
> > +++ b/include/linux/writeback.h
> > @@ -150,6 +150,8 @@ void writeback_set_ratelimit(void);
> > /* pdflush.c */
> > extern int nr_pdflush_threads; /* Global so it can be exported to sysctl
> > read-only. */
> > +extern int nr_pdflush_threads_max; /* Global so it can be exported to sysctl */
> > +extern int nr_pdflush_threads_min; /* Global so it can be exported to sysctl */
> >
> >
> > #endif /* WRITEBACK_H */
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 50ec088..6dae777 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -948,6 +948,22 @@ static struct ctl_table vm_table[] = {
> > .proc_handler = &proc_dointvec,
> > },
> > {
> > + .ctl_name = VM_NR_PDFLUSH_THREADS_MIN,
> > + .procname = "nr_pdflush_threads_min",
> > + .data = &nr_pdflush_threads_min,
> > + .maxlen = sizeof nr_pdflush_threads_min,
> > + .mode = 0644 /* read-only*/,
> > + .proc_handler = &proc_dointvec,
> > + },
> > + {
> > + .ctl_name = VM_NR_PDFLUSH_THREADS_MAX,
> > + .procname = "nr_pdflush_threads_max",
> > + .data = &nr_pdflush_threads_max,
> > + .maxlen = sizeof nr_pdflush_threads_max,
> > + .mode = 0644 /* read-only*/,
> > + .proc_handler = &proc_dointvec,
> > + },
> > + {
> > .ctl_name = VM_SWAPPINESS,
> > .procname = "swappiness",
> > .data = &vm_swappiness,
> > diff --git a/mm/pdflush.c b/mm/pdflush.c
> > index 481680f..9a6f835 100644
> > --- a/mm/pdflush.c
> > +++ b/mm/pdflush.c
> > @@ -58,6 +58,14 @@ static DEFINE_SPINLOCK(pdflush_lock);
> > int nr_pdflush_threads = 0;
> >
> > /*
> > + * The max/min number of pdflush threads. R/W by sysctl at
> > + * /proc/sys/vm/nr_pdflush_threads_max
> > + */
> > +int nr_pdflush_threads_max = MAX_PDFLUSH_THREADS;
> > +int nr_pdflush_threads_min = MIN_PDFLUSH_THREADS;
> > +
> > +
> > +/*
> > * The time at which the pdflush thread pool last went empty
> > */
> > static unsigned long last_empty_jifs;
> > @@ -68,7 +76,7 @@ static unsigned long last_empty_jifs;
> > * Thread pool management algorithm:
> > *
> > * - The minimum and maximum number of pdflush instances are bound
> > - * by MIN_PDFLUSH_THREADS and MAX_PDFLUSH_THREADS.
> > + * by nr_pdflush_threads_min and nr_pdflush_threads_max.
> > *
> > * - If there have been no idle pdflush instances for 1 second, create
> > * a new one.
> > @@ -133,7 +141,8 @@ static int __pdflush(struct pdflush_work *my_work)
> > */
> > if (time_after(jiffies, last_empty_jifs + 1 * HZ)) {
> > if (list_empty(&pdflush_list)) {
> > - if (nr_pdflush_threads < MAX_PDFLUSH_THREADS) {
> > + if (nr_pdflush_threads <
> > + nr_pdflush_threads_max) {
> > nr_pdflush_threads++;
> > spin_unlock_irq(&pdflush_lock);
> > start_one_pdflush_thread();
> > @@ -150,7 +159,7 @@ static int __pdflush(struct pdflush_work *my_work)
> > */
> > if (list_empty(&pdflush_list))
> > continue;
> > - if (nr_pdflush_threads <= MIN_PDFLUSH_THREADS)
> > + if (nr_pdflush_threads <= nr_pdflush_threads_min)
> > continue;
> > pdf = list_entry(pdflush_list.prev, struct pdflush_work, list);
> > if (time_after(jiffies, pdf->when_i_went_to_sleep + 1 * HZ)) {
> > @@ -246,9 +255,9 @@ static int __init pdflush_init(void)
> > * Pre-set nr_pdflush_threads... If we fail to create,
> > * the count will be decremented.
> > */
> > - nr_pdflush_threads = MIN_PDFLUSH_THREADS;
> > + nr_pdflush_threads = nr_pdflush_threads_min;
> >
> > - for (i = 0; i < MIN_PDFLUSH_THREADS; i++)
> > + for (i = 0; i < nr_pdflush_threads_min; i++)
> > start_one_pdflush_thread();
> > return 0;
> > }
>
>

2008-12-31 03:31:15

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add /proc controls for pdflush threads

Peter W. Morreale wrote:
> On Tue, 2008-12-30 at 15:59 -0800, Randy Dunlap wrote:
>> Peter W Morreale wrote:
>>> From: \"Peter W. Morreale\" <[email protected]>
>>>
>>> This patch adds /proc entries to give the admin the ability to
>>> control the minimum and maximum number of pdflush threads. This allows
>>> finer control of pdflush on both large and small machines.
>>>
>>> The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and
>>> '/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions.
>> These need to be documented in (ugh) either
>> Documentation/filesystems/proc.txt or
>> Documentation/sysctl/vm.txt. Hard to say which one, but I'd
>> recommend the latter (vm.txt) since they are sysctls.
>>
>>
>> It looks like there is a lot of duplication between them.
>> That's not good IMO.
>>
>
> Hummm... worse than that. Documentation/sysctl/vm.txt is missing 9
> proc/sys/vm entries and hasn't been updated since (ahem) 2.2.10.
>
> Documentation/file_systems/proc.txt is much more recent -
> 2.4.0-test11-pre4. My how time flies... :-)

Thanks for checking on those.

> I can take a stab at updating vm.txt. Would it be ok to remove the vm
> verbiage from proc.txt and replace it with a pointer to vm.txt?

Yes, sounds good to me (for sysctls). Thanks again.

>
> -PWM
>
>>> ---
>>>
>>> Signed-off-by: Peter W Morreale <[email protected]>
>>>
>>> include/linux/sysctl.h | 2 ++
>>> include/linux/writeback.h | 2 ++
>>> kernel/sysctl.c | 16 ++++++++++++++++
>>> mm/pdflush.c | 19 ++++++++++++++-----
>>> 4 files changed, 34 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
>>> index d0437f3..9921e62 100644
>>> --- a/include/linux/sysctl.h
>>> +++ b/include/linux/sysctl.h
>>> @@ -205,6 +205,8 @@ enum
>>> VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
>>> VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
>>> VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
>>> + VM_NR_PDFLUSH_THREADS_MAX=36, /* nr_pdflush_threads_max */
>>> + VM_NR_PDFLUSH_THREADS_MIN=37, /* nr_pdflush_threads_min */
>>> };
>>>
>>>
>>> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
>>> index 12b15c5..ee566a0 100644
>>> --- a/include/linux/writeback.h
>>> +++ b/include/linux/writeback.h
>>> @@ -150,6 +150,8 @@ void writeback_set_ratelimit(void);
>>> /* pdflush.c */
>>> extern int nr_pdflush_threads; /* Global so it can be exported to sysctl
>>> read-only. */
>>> +extern int nr_pdflush_threads_max; /* Global so it can be exported to sysctl */
>>> +extern int nr_pdflush_threads_min; /* Global so it can be exported to sysctl */
>>>
>>>
>>> #endif /* WRITEBACK_H */
>>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>>> index 50ec088..6dae777 100644
>>> --- a/kernel/sysctl.c
>>> +++ b/kernel/sysctl.c
>>> @@ -948,6 +948,22 @@ static struct ctl_table vm_table[] = {
>>> .proc_handler = &proc_dointvec,
>>> },
>>> {
>>> + .ctl_name = VM_NR_PDFLUSH_THREADS_MIN,
>>> + .procname = "nr_pdflush_threads_min",
>>> + .data = &nr_pdflush_threads_min,
>>> + .maxlen = sizeof nr_pdflush_threads_min,
>>> + .mode = 0644 /* read-only*/,
>>> + .proc_handler = &proc_dointvec,
>>> + },
>>> + {
>>> + .ctl_name = VM_NR_PDFLUSH_THREADS_MAX,
>>> + .procname = "nr_pdflush_threads_max",
>>> + .data = &nr_pdflush_threads_max,
>>> + .maxlen = sizeof nr_pdflush_threads_max,
>>> + .mode = 0644 /* read-only*/,
>>> + .proc_handler = &proc_dointvec,
>>> + },
>>> + {
>>> .ctl_name = VM_SWAPPINESS,
>>> .procname = "swappiness",
>>> .data = &vm_swappiness,


--
~Randy

2008-12-31 04:11:20

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH 0/2] pdflush fix and enhancement

On Wed, 2008-12-31 at 03:46 +0100, Andi Kleen wrote:
> On Tue, Dec 30, 2008 at 06:56:29PM -0700, Peter W. Morreale wrote:
> > The rational is simply because one-size may not fit all. Currently
> > there is minimalistic tuning wrt pdflush thread creation.
>
> Thanks. Please put this into the commit description for future
> reference.
>

Will do.

> Some comments below.
>
> > Each pdflush thread arbitrarily decides to create another thread solely
> > on the basis of all currently existing threads being busy for >= one
> > second. So this can result in NCPUS pdflush threads being created
> > 'concurrently' should all existing threads wind up in the idle check at
> > the same time (I have seen this). Or it can result in a thread being
>
> So perhaps a simple lock serializing creation of new pdflush threads
> would avoid the problem?

nod. It's probably even simpler than that. By merely updating
'last_empty_jifs" when we decide to create the next pdflush thread, we
could easily account for _most_ of the races without adding significant
overhead.

I say most because the assumption would be that we will be successful in
creating the new thread. Not that bad an assumption I think. Besides,
the consequences of a miss are not harmful.

>
> > More to the point, on small systems with few file systems, what is the
> > point of having 8 (the current max) threads competing with each other on
> > a single disk? Likewise, on a 64-way, or larger system with dozens of
> > filesystems/disks, why wouldn't I want more background flushing?
>
> That makes some sense, but perhaps it would be better to base the default
> size on the number of underlying block devices then?
>
> Ok one issue there is that there are lots of different types of
> block devices, e.g. a big raid array may look like a single disk.
> Still I suspect defaults based on the block devices would do reasonably
> well.

Could be... However bear in mind that we traverse *filesystems*, not
block devs with background_writeout() (the pdflush work function).

But even if we did block devices, consider that we still don't know the
speed of those devices (consider SSD v. raid v. disk) and consequently,
we don't know how many threads to throw at the device before it becomes
congested and we're merely spinning our wheels. I mean, an SSD at
500MB/s (or greater) certainly could handle more pages being thrown at
it than an IDE drive...

And this ties back to MAX_WRITEBACK_PAGES (currently 1k) which is the
chunk that we write out in one pass. In order to not "hold the inode
lock too long", this is the chunk we attempt to write out.

What is the right magic number for the various types of block devs? 1k
for all? for all time? :-)

Anyway, back to the traversal of filesystems. In writeback_inodes(), we
currently traverse the super block list in reverse. I don't quite
understand why we do this, but <shrug>.

What this does mean is that unfairly penalize certain file systems when
attempting to clean dirty pages. If I have 5 filesystems, all getting
hit on, then the last one in will always be the 'cleanest'. Not sure
that makes sense.

I was thinking about a patch that would go both directions - forward and
reverse depending upon, say, a bit in jiffies... Certainly not perfect,
but a bit more fair.

Actually, it seems to me that we need to look at a radically different
approach. What about making background writes a property of the super
block? (which implies the file system) Has that been discussed before?


Best,
-PWM



>
> >
> > I actually think the question is: Why not allow the admin to control
> > this? Since it seems like this is a matter of policy based on machine
> > configuration.
>
> The kernel should know the current machine config and most
> admins don't really want to do very fine grained configuration;
> they expect the system to perform well out of the box. That is
> why it is adventageous to try to come up with good auto tuning.
>
> > And btw Andi, I should have cc'ed you on this, I know from the flush
> > code that you were heavily involved. I only cc'ed Peter Z. since his
> > name was in pdflush.c. My apologies.
>
> No problem, but I think you're confusing me with someone else :). I don't
> remember ever hacking on this. Perhaps you thought of Andrew Morton
> who did pdflush originally.
>
> -Andi
>

2008-12-31 07:08:18

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/2] pdflush fix and enhancement

On Tue, Dec 30, 2008 at 09:11:04PM -0700, Peter W. Morreale wrote:
> Actually, it seems to me that we need to look at a radically different
> approach. What about making background writes a property of the super
> block? (which implies the file system) Has that been discussed before?

Sure - there was a recent discussion in the context of how broken the
sync(2) syscall is.

That is, some filesystems (e.g. XFS) have certain requirements
to ensure sync actually works in all circumstances and the current
methods that sync employs make it impossible to sync correctly.

It's made worse by the fact that XFS already has internal methods
to completely and cleanly sync the filesystem (i.e. the freeze
code) but that can't be called in the sync() context because
writeback is a property of the VFS and not the filesystem.

It was discussed on linux-fsdevel. <rummage>

http://marc.info/?l=linux-fsdevel&m=122535673232638&w=2

I simply haven't had time to dig into this far enough to
come up with a patchset to fix the problem....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-12-31 08:02:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add /proc controls for pdflush threads

On Tue, 30 Dec 2008 16:12:33 -0700 Peter W Morreale <[email protected]> wrote:

> From: \"Peter W. Morreale\" <[email protected]>
>
> This patch adds /proc entries to give the admin the ability to
> control the minimum and maximum number of pdflush threads. This allows
> finer control of pdflush on both large and small machines.
>
> The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and
> '/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions.

Why is this needed? Where's the benefit? What observations led you to
develop this patch? etc.

> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -205,6 +205,8 @@ enum
> VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
> VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
> VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
> + VM_NR_PDFLUSH_THREADS_MAX=36, /* nr_pdflush_threads_max */
> + VM_NR_PDFLUSH_THREADS_MIN=37, /* nr_pdflush_threads_min */
> };

We don't do this any more...

> .proc_handler = &proc_dointvec,
> },
> {
> + .ctl_name = VM_NR_PDFLUSH_THREADS_MIN,

please just use CTL_UNNUMBERED here.

2008-12-31 11:47:41

by Martin Knoblauch

[permalink] [raw]
Subject: Re: [PATCH 0/2] pdflush fix and enhancement

please CC me on replies, I am not subscribed to LKML.


----- Original Message ----
> From: Andi Kleen <[email protected]>
> To: Peter W. Morreale <[email protected]>
> Cc: Andi Kleen <[email protected]>; [email protected]
> Sent: Wednesday, December 31, 2008 3:46:09 AM
> Subject: Re: [PATCH 0/2] pdflush fix and enhancement
>

snip

> > I actually think the question is: Why not allow the admin to control
> > this? Since it seems like this is a matter of policy based on machine
> > configuration.
>
> The kernel should know the current machine config and most
> admins don't really want to do very fine grained configuration;
> they expect the system to perform well out of the box. That is
> why it is adventageous to try to come up with good auto tuning.
>

Independent of the patch in question, the problem with this seems to me that [some/many of] the kernel developers [seem to] try to get it right for 100% of all thinkable use-cases. But this fails to take into account that:

- you cannot think of every single use-case. And not only because predicting furure use-cases is difficult
- getting it right for every case very often creates complexity that leads to subtle problems tha are hard to analyse and fix
- it may waste developer ressources
- and, think about it, do we really want the kernel to be smarter than ourselves ? :-)

You are right that the kernel should work out of the box most of the times. And it usually is pretty good at that. But there are corner-cases where more flexibility for the admins is desirable - if only to debug a problem without doing deep code hacking. So we should be careful adding tuning-knobs, but we should also admit that sometimes they are useful.


Happy New Year
Martin

2008-12-31 13:14:25

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/2] pdflush fix and enhancement

> I say most because the assumption would be that we will be successful in
> creating the new thread. Not that bad an assumption I think. Besides,

And that the memory read is not reordered (rmb()).

> the consequences of a miss are not harmful.

Nod. Sounds reasonable.

>
> >
> > > More to the point, on small systems with few file systems, what is the
> > > point of having 8 (the current max) threads competing with each other on
> > > a single disk? Likewise, on a 64-way, or larger system with dozens of
> > > filesystems/disks, why wouldn't I want more background flushing?
> >
> > That makes some sense, but perhaps it would be better to base the default
> > size on the number of underlying block devices then?
> >
> > Ok one issue there is that there are lots of different types of
> > block devices, e.g. a big raid array may look like a single disk.
> > Still I suspect defaults based on the block devices would do reasonably
> > well.
>
> Could be... However bear in mind that we traverse *filesystems*, not
> block devs with background_writeout() (the pdflush work function).

My thinking was that on traditional block devices you roughly
want only N, N small number, flushers per spingle because
otherwise they will just seek too much.

Anyways iirc there's a way now to distingush SSDs from normal
block devices based on hints from the block layer, but that still
doesn't handle the big RAID array case well.

>
> But even if we did block devices, consider that we still don't know the
> speed of those devices (consider SSD v. raid v. disk) and consequently,
> we don't know how many threads to throw at the device before it becomes
> congested and we're merely spinning our wheels. I mean, an SSD at
> 500MB/s (or greater) certainly could handle more pages being thrown at
> it than an IDE drive...

I was thinking just of the initial default, but you're right
it really needs to tune the upper limit.

>
> And this ties back to MAX_WRITEBACK_PAGES (currently 1k) which is the
> chunk that we write out in one pass. In order to not "hold the inode
> lock too long", this is the chunk we attempt to write out.
>
> What is the right magic number for the various types of block devs? 1k
> for all? for all time? :-)

Ok it probably needs some kind of feedback mechanism.

Perhaps have keep an estimate of the average IO time for a single
flush and when it reaches some threshold start more threads?
Or have feedback from the elevators how busy they are.

Of course it would still need a upper limit to prevent
a thread explosion in case IO suddenly becomes very slow
(e.g. in a error recovery case), but it could be much
higher than today.

>
> Anyway, back to the traversal of filesystems. In writeback_inodes(), we
> currently traverse the super block list in reverse. I don't quite
> understand why we do this, but <shrug>.
>
> What this does mean is that unfairly penalize certain file systems when
> attempting to clean dirty pages. If I have 5 filesystems, all getting
> hit on, then the last one in will always be the 'cleanest'. Not sure
> that makes sense.

Probably not.

>
> I was thinking about a patch that would go both directions - forward and
> reverse depending upon, say, a bit in jiffies... Certainly not perfect,
> but a bit more fair.

Better a real RNG. But such probalistic schemes unfortunately tend to drive
benchmarkers crazy, that is why it is better to avoid them.

I suppose you could just keep some state per fs to ensure fairness.

-Andi

2008-12-31 14:54:22

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add /proc controls for pdflush threads

On Wed, 2008-12-31 at 00:01 -0800, Andrew Morton wrote:
> On Tue, 30 Dec 2008 16:12:33 -0700 Peter W Morreale <[email protected]> wrote:
>
> > From: \"Peter W. Morreale\" <[email protected]>
> >
> > This patch adds /proc entries to give the admin the ability to
> > control the minimum and maximum number of pdflush threads. This allows
> > finer control of pdflush on both large and small machines.
> >
> > The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and
> > '/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions.
>
> Why is this needed? Where's the benefit? What observations led you to
> develop this patch? etc.

I assume you are caught up on this thread Andrew? (A later post
outlined the rational)

>
> > --- a/include/linux/sysctl.h
> > +++ b/include/linux/sysctl.h
> > @@ -205,6 +205,8 @@ enum
> > VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
> > VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
> > VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
> > + VM_NR_PDFLUSH_THREADS_MAX=36, /* nr_pdflush_threads_max */
> > + VM_NR_PDFLUSH_THREADS_MIN=37, /* nr_pdflush_threads_min */
> > };
>
> We don't do this any more...
>

nod.

> > .proc_handler = &proc_dointvec,
> > },
> > {
> > + .ctl_name = VM_NR_PDFLUSH_THREADS_MIN,
>
> please just use CTL_UNNUMBERED here.
>

nod. Thanks,

-PWM

>

2008-12-31 15:41:17

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH 0/2] pdflush fix and enhancement

On Wed, 2008-12-31 at 18:08 +1100, Dave Chinner wrote:
> On Tue, Dec 30, 2008 at 09:11:04PM -0700, Peter W. Morreale wrote:
> > Actually, it seems to me that we need to look at a radically different
> > approach. What about making background writes a property of the super
> > block? (which implies the file system) Has that been discussed before?
>
> Sure - there was a recent discussion in the context of how broken the
> sync(2) syscall is.
>
> That is, some filesystems (e.g. XFS) have certain requirements
> to ensure sync actually works in all circumstances and the current
> methods that sync employs make it impossible to sync correctly.
>
<snip>


Good point, but different. I was thinking merely in terms of the
forthcoming SSD devices and flushing, not syncing. We are approaching
the point (from hardware...) where persistent storage is becoming
balanced (wrt speed) with RAM.

This opens up a whole new world for cache considerations. Consider that
if my persistent storage is as fast as memory, then I want my memory
cache size for that device to be 0 (zero) sized - there is no point.

However, I have a number of different devices on my system, some disk,
some SSD, some optical, etc. Each has different characteristics, yet we
treat them identically.

(well, almost identically - we run through the SB list (and
consequently, the devices) in reverse all the time :-)

WIRWTD ("What I Really Want To Do") is to incorporate the
characteristics of the devices into the caching so I can optimize both
my use of cache as well as the particular device(s).

At the moment, we have two triggers, memory pressure (the dirty_*
tunings) and time (kupdate). Once these thresholds are reached, we
indiscriminately (wrt devices) begin flushing to achieve the minimum
threshold again. These are probably the right triggers from a system
perspective, but there are others we could consider as well.

For example, on a 'slow' device, I probably want to start flushing
sooner, rather than later. On a fast device, perhaps we wait a bit
longer before starting flushing.

At the end of the day we are governed by Little's Law, so we have to
optimize the exit from the system.

In general, we want flushing to reach the minimum dirty threshold as
fast as possible since we are taking cycles away from our applications.
(To me this is far more important than age...) So, WIRWTD is to create
a heuristic that takes into account:

o Device speed
o nr pages dirty 'owned' by the device.
o nr system dirty pages (e.g. existing dirty stuff)
o age (or do we really care?)
o tunings

Now we can weight flushing towards 'fast' devices to reach our
thresholds as well as ignore devices that offer little relief (e.g. have
no dirty pages outstanding)

Perhaps the "cache maintenance responsibility" belongs to the device???

Best,
-PWM




2008-12-31 16:08:47

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH 0/2] pdflush fix and enhancement

On Wed, 2008-12-31 at 14:27 +0100, Andi Kleen wrote:
> > I say most because the assumption would be that we will be successful in
> > creating the new thread. Not that bad an assumption I think. Besides,
>
> And that the memory read is not reordered (rmb()).
>

At the risk of showing my b*tt here... I'm not very clear on memory
barriers, is this necessary even inside a critical region? (recall
we're protected by the spin lock). If so, does the barrier go after the
read, or before? (Thanks for not laughing, however grins are allowed)


>
> Ok it probably needs some kind of feedback mechanism.
>

Actually, I tend to think we need an entirely different approach to
flushing, please see my post to David Chinner which outlines some
thoughts. Basically a flushing heuristic that takes into account the
characteristics of the various block devices.


> >
> > I was thinking about a patch that would go both directions - forward and
> > reverse depending upon, say, a bit in jiffies... Certainly not perfect,
> > but a bit more fair.
>
> Better a real RNG. But such probalistic schemes unfortunately tend to drive
> benchmarkers crazy, that is why it is better to avoid them.
>

Nod, but that's ok. Having been one for several years I can truthfully
say that benchmarkers are a little crazy anyways... :-)


> I suppose you could just keep some state per fs to ensure fairness.
>

Nod, this would be ideal.

-PWM

> -Andi

2009-01-01 01:35:16

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/2] pdflush fix and enhancement

On Wed, Dec 31, 2008 at 09:08:26AM -0700, Peter W. Morreale wrote:
> On Wed, 2008-12-31 at 14:27 +0100, Andi Kleen wrote:
> > > I say most because the assumption would be that we will be successful in
> > > creating the new thread. Not that bad an assumption I think. Besides,
> >
> > And that the memory read is not reordered (rmb()).
> >
>
> At the risk of showing my b*tt here... I'm not very clear on memory
> barriers, is this necessary even inside a critical region? (recall
> we're protected by the spin lock).

You're right the implied barriers in the spinlock are probably enough.
Never mind.

> If so, does the barrier go after the
> read, or before? (Thanks for not laughing, however grins are allowed)

Before.

BTW on x86 it's a nop either way, but not on all other architectures.

>
>
> >
> > Ok it probably needs some kind of feedback mechanism.
> >
>
> Actually, I tend to think we need an entirely different approach to
> flushing, please see my post to David Chinner which outlines some
> thoughts. Basically a flushing heuristic that takes into account the
> characteristics of the various block devices.

Ideally discovered at runtime (e.g. by watching queue lengths/service
times etc.) though. Otherwise the kernel would need to have knowledge
about the properties of all kinds of devices.

-Andi
--
[email protected]

2009-01-01 23:27:32

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/2] pdflush fix and enhancement

On Wed, Dec 31, 2008 at 08:40:56AM -0700, Peter W. Morreale wrote:
> On Wed, 2008-12-31 at 18:08 +1100, Dave Chinner wrote:
> > On Tue, Dec 30, 2008 at 09:11:04PM -0700, Peter W. Morreale wrote:
> > > Actually, it seems to me that we need to look at a radically different
> > > approach. What about making background writes a property of the super
> > > block? (which implies the file system) Has that been discussed before?
> >
> > Sure - there was a recent discussion in the context of how broken the
> > sync(2) syscall is.
> >
> > That is, some filesystems (e.g. XFS) have certain requirements
> > to ensure sync actually works in all circumstances and the current
> > methods that sync employs make it impossible to sync correctly.
> >
> <snip>
>
>
> Good point, but different. I was thinking merely in terms of the
> forthcoming SSD devices and flushing, not syncing. We are approaching
> the point (from hardware...) where persistent storage is becoming
> balanced (wrt speed) with RAM.

Flushing is simply the first part of syncing ;)

> This opens up a whole new world for cache considerations. Consider that
> if my persistent storage is as fast as memory, then I want my memory
> cache size for that device to be 0 (zero) sized - there is no point.

mmap()?

> However, I have a number of different devices on my system, some disk,
> some SSD, some optical, etc. Each has different characteristics, yet we
> treat them identically.
>
> (well, almost identically - we run through the SB list (and
> consequently, the devices) in reverse all the time :-)
>
> WIRWTD ("What I Really Want To Do") is to incorporate the
> characteristics of the devices into the caching so I can optimize both
> my use of cache as well as the particular device(s).

IOWs you want the block device characteristics determine the caching
and flushing techniques the higher layers use.

> At the moment, we have two triggers, memory pressure (the dirty_*
> tunings) and time (kupdate). Once these thresholds are reached, we
> indiscriminately (wrt devices) begin flushing to achieve the minimum
> threshold again. These are probably the right triggers from a system
> perspective, but there are others we could consider as well.
>
> For example, on a 'slow' device, I probably want to start flushing
> sooner, rather than later. On a fast device, perhaps we wait a bit
> longer before starting flushing.

I think you'll find the other way around. It's *hard* to keep a fast
block device busy - you turn the memory cache over much, much faster
so reclaim of the cache needs to happen much faster and this becomes
the limiting factor when trying to write back multiple GB of data
every second.

> At the end of the day we are governed by Little's Law, so we have to
> optimize the exit from the system.
>
> In general, we want flushing to reach the minimum dirty threshold as
> fast as possible since we are taking cycles away from our applications.

I disagree. You need a certain amount of memory for optimising
operations such as avoiding writeback of short-term temporary files.
e.g. do a build in one directory followed by a clean to make sure
the build works. In that case, you want the object data held in
memory so the only operations that hit the disk are creates and
unlinks.

Remember, SSDs are still limited in their bandwidth and IOPS. The
current SSDs (e.g. intel) have the capability of a small RAID array
with a NVRAM write cache. Even though they are fast, we still need
to optimise at a high level to make most efficient use of their
limited resources....

> (To me this is far more important than age...) So, WIRWTD is to create
> a heuristic that takes into account:
>
> o Device speed

We caclculate that on the fly based on the flushing rate of
the BDI.

> o nr pages dirty 'owned' by the device.

Already got that (bdi stats) and it is used for the above
caclulation.

> o nr system dirty pages (e.g. existing dirty stuff)

Already got that.

> o age (or do we really care?)

Got that because we do care about ensuring the oldest
dirty stuff gets written back before something newly dirtied.

> o tunings
>
> Now we can weight flushing towards 'fast' devices to reach our
> thresholds as well as ignore devices that offer little relief (e.g. have
> no dirty pages outstanding)

Check out:

/sys/block/*/bdi/min_ratio
/sys/block/*/bdi/max_ratio

To change the proportions of the writeback pie a given block device
will be given. I think that is what you want.

However, this still doesn't address the real problem with pdflush
and flushing. That is, pdflush (like sync) assumes that the fastest
(and only) way to flush a filesystem is to walk across dirty inodes
in age order and flush their data followed by immediate flushing of
the inode. That may work for ext3, but it's far from optimal for
XFS, btrfs, etc which have far different optimal flushing
strategies.

There's a bunch more info about these problems and the path we're
trying to head down for XFS here:

http://xfs.org/index.php/Improving_inode_Caching

and specifically to this topic:

http://xfs.org/index.php/Improving_inode_Caching#Avoiding_the_Generic_pdflush_Code

Cheers,

Dave.
--
Dave Chinner
[email protected]

2009-01-02 02:08:20

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH 0/2] pdflush fix and enhancement

On Fri, 2009-01-02 at 10:27 +1100, Dave Chinner wrote:
> On Wed, Dec 31, 2008 at 08:40:56AM -0700, Peter W. Morreale wrote:
..
> >
> > For example, on a 'slow' device, I probably want to start flushing
> > sooner, rather than later. On a fast device, perhaps we wait a bit
> > longer before starting flushing.
>
> I think you'll find the other way around. It's *hard* to keep a fast
> block device busy - you turn the memory cache over much, much faster
> so reclaim of the cache needs to happen much faster and this becomes
> the limiting factor when trying to write back multiple GB of data
> every second.

Nod. However biasing flushing towards the fast block devices penalizes
applications referencing those devices. Its good for reclaim in that we
get our space quicker, but future references may involve another read
since those pages may have been reallocated elsewhere. Consequently if
the onus of creating free space is biased towards the fastest devices,
applications referencing them may suffer. Do I have that right?

>From a 1000ft view, it seems to me that we want all devices to reach the
finish line at the same time. Each performing their appropriate share
of cleaning up based on the amount they contributed to the issue of a
dirty memory space.

It seems to me that we want (by default) to spread out the cost of
reclaim on an even basis across all block devs.


>
> > At the end of the day we are governed by Little's Law, so we have to
> > optimize the exit from the system.
> >
> > In general, we want flushing to reach the minimum dirty threshold as
> > fast as possible since we are taking cycles away from our applications.
>
> I disagree. You need a certain amount of memory for optimising
> operations such as avoiding writeback of short-term temporary files.
> e.g. do a build in one directory followed by a clean to make sure
> the build works. In that case, you want the object data held in
> memory so the only operations that hit the disk are creates and
> unlinks.

Heh, I don't think you disagree with giving cycles to applications,
right?

Rereading, my statement, what I probably should have said was "reach the
_maximum_ dirty threshold", meaning that that we stop generic flushing
as soon as possible so things like temporary files are still cached.

>
> Remember, SSDs are still limited in their bandwidth and IOPS. The
> current SSDs (e.g. intel) have the capability of a small RAID array
> with a NVRAM write cache. Even though they are fast, we still need
> to optimise at a high level to make most efficient use of their
> limited resources....

> > (To me this is far more important than age...) So, WIRWTD is to create
> > a heuristic that takes into account:
> >
> > o Device speed
>
> We caclculate that on the fly based on the flushing rate of
> the BDI.
>
> > o nr pages dirty 'owned' by the device.
>
> Already got that (bdi stats) and it is used for the above
> caclulation.
>
> > o nr system dirty pages (e.g. existing dirty stuff)
>
> Already got that.
>
> > o age (or do we really care?)
>
> Got that because we do care about ensuring the oldest
> dirty stuff gets written back before something newly dirtied.
>
> > o tunings
> >
> > Now we can weight flushing towards 'fast' devices to reach our
> > thresholds as well as ignore devices that offer little relief (e.g. have
> > no dirty pages outstanding)
>
> Check out:
>
> /sys/block/*/bdi/min_ratio
> /sys/block/*/bdi/max_ratio
>
> To change the proportions of the writeback pie a given block device
> will be given. I think that is what you want.
>
> However, this still doesn't address the real problem with pdflush
> and flushing. That is, pdflush (like sync) assumes that the fastest
> (and only) way to flush a filesystem is to walk across dirty inodes
> in age order and flush their data followed by immediate flushing of
> the inode. That may work for ext3, but it's far from optimal for
> XFS, btrfs, etc which have far different optimal flushing
> strategies.
>
> There's a bunch more info about these problems and the path we're
> trying to head down for XFS here:
>
> http://xfs.org/index.php/Improving_inode_Caching
>
> and specifically to this topic:
>
> http://xfs.org/index.php/Improving_inode_Caching#Avoiding_the_Generic_pdflush_Code
>

Thanks for these, I'll read on...

Best,
-PWM


> Cheers,
>
> Dave.