2010-07-16 12:49:33

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 00/16] kill unnecessary bdi wakeups + cleanups

Hi,

here is a series of patches which lessens amount of unnecessary wake-ups in
the linux kernel.

THE PROBLEM
~~~~~~~~~~~

Each block device has corresponding "flusher" thread, which are usually seen
as "flusher-x:y" in your 'ps' output. Flusher threads are responsible to do
background write-back for different kernel code paths like memory reclamation
as well as the periodic background writeout.

The flusher threads wake up every 5 seconds and check whether they have to
write anything back or not. In idle systems with good dynamic power-management
this means that they force it to wake up from deep sleep, find out that there
is nothing to do, and waste power. This hurts small battery-powered devices,
e.g., linux-based phones.

But idle wake-ups do not last forever, the bdi threads kill themselves if
nothing useful has been done for 5 minutes.

Also, there is the bdi forker thread, seen as 'bdi-default' in your 'ps'
output. This thread also wakes up every 5 seconds and checks whether it has to
fork a bid flusher thread, in case there is dirt to write-back, but the thread
was killed. This is also bad for battery-powered embedded devices.

SOLUTION
~~~~~~~~

This patch set makes bdi threads and the forker thread wake-up only if there is
job to do, otherwise they will sleep. The main idea is to wake-up the needed
thread when adding dirty data to the bdi.

To implement this:

1. I address various rare race conditions in the current code.
2. Then I move the killing logic from bdi threads to the forker thread, so that
we would have one central place where we make decisions about killing inactive
bdi threads. The reason I do this is because otherwise it is difficult to make
sure that a bdi thread does not decide to exit while you are waking it up.
3. I add a small piece of code to '__mark_inode_dirt()' which wakes up the bdi
thread when dirty inodes arrive
4. There are also clean-up patches and nicification patches which I found to be
good for better code readability.
5. Some patches are just preparations which make the following real patches
simpler and easier to review.
6. Some patches are just simplifications of current code.

With this patchset bdi threads wake up considerably less.

This is RFC so far, because:
1. I'm new to this area
2. patches really need review
3. patches need more testing and polishing

Side note, there is another set of patches which are solving a similar problem
waiting for comments/acceptance:
http://thread.gmane.org/gmane.linux.file-systems/42834

Artem.


2010-07-16 12:49:09

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 02/16] writeback: remove redundant list initialization

From: Artem Bityutskiy <[email protected]>

We do not have to call 'INIT_LIST_HEAD()' for list elements
('bdi->bdi_list') before inserting them to the 'bdi_pending_list',
because 'list_add_tail' will initialize the element anyway. Thus,
kill the redundant initialization.

Signed-off-by: Artem Bityutskiy <[email protected]>
---
mm/backing-dev.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ff5669a..0fad60d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -416,7 +416,6 @@ static void bdi_add_to_pending(struct rcu_head *head)
struct backing_dev_info *bdi;

bdi = container_of(head, struct backing_dev_info, rcu_head);
- INIT_LIST_HEAD(&bdi->bdi_list);

spin_lock(&bdi_lock);
list_add_tail(&bdi->bdi_list, &bdi_pending_list);
--
1.7.1.1

2010-07-16 12:49:38

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 06/16] writeback: improve bdi_has_dirty_io

From: Artem Bityutskiy <[email protected]>

This patch is of clean-up type. Currently the purpose of the
'bdi_has_dirty_io()' function is not very clear - it is equivalent
to 'wb_has_dirty_io()' and there is little point for it to exist.
It is used inconsistently as well, e.g., in 'bdi_forker_thread()'
we use

if (wb_has_dirty_io() || !list_empty(&me->bdi->work_list))

but in the other place we use another construct to achieve the
essentially same goal:

if (!bdi_has_dirty_io(bdi) && list_empty(&bdi->work_list))

which is inconsistent and make code more difficult to follow.

This patch changes semantics of 'bdi_has_dirty_io()' a bit, and now
it checks for dirty io _and_ bdi works. The code which needs to check
only dirty inodes can use 'wb_has_dirty_io()'. This makes the bdi
forker code a bit nicer, and justifies the existence of
'bdi_has_dirty_io()'.

Just a small cleanup.

Signed-off-by: Artem Bityutskiy <[email protected]>
---
fs/fs-writeback.c | 2 +-
mm/backing-dev.c | 10 +++++-----
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5e6b7fc..3fc5194 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -891,7 +891,7 @@ void wakeup_flusher_threads(long nr_pages)

rcu_read_lock();
list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
- if (!bdi_has_dirty_io(bdi))
+ if (!wb_has_dirty_io(&bdi->wb))
continue;
__bdi_start_writeback(bdi, nr_pages, false, false);
}
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8be2e13..03a3d82 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -263,7 +263,8 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)

int bdi_has_dirty_io(struct backing_dev_info *bdi)
{
- return wb_has_dirty_io(&bdi->wb);
+ return wb_has_dirty_io(&bdi->wb) ||
+ !list_empty(&bdi->work_list);
}

static void bdi_flush_io(struct backing_dev_info *bdi)
@@ -338,7 +339,7 @@ static int bdi_forker_thread(void *ptr)
* Temporary measure, we want to make sure we don't see
* dirty data on the default backing_dev_info
*/
- if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list))
+ if (bdi_has_dirty_io(me->bdi))
wb_do_writeback(me, 0);

spin_lock_bh(&bdi_lock);
@@ -350,8 +351,7 @@ static int bdi_forker_thread(void *ptr)
list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
if (bdi->wb.task)
continue;
- if (list_empty(&bdi->work_list) &&
- !bdi_has_dirty_io(bdi))
+ if (!bdi_has_dirty_io(bdi))
continue;

bdi_add_default_flusher_thread(bdi);
@@ -637,7 +637,7 @@ void bdi_destroy(struct backing_dev_info *bdi)
* Splice our entries to the default_backing_dev_info, if this
* bdi disappears
*/
- if (bdi_has_dirty_io(bdi)) {
+ if (wb_has_dirty_io(&bdi->wb)) {
struct bdi_writeback *dst = &default_backing_dev_info.wb;

spin_lock(&inode_lock);
--
1.7.1.1

2010-07-16 12:49:44

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 10/16] writeback: simplify bdi code a little

From: Artem Bityutskiy <[email protected]>

This patch simplifies bdi code a little by removing the 'pending_list'
which is redundant. Indeed, currently the forker thread
('bdi_forker_thread()') is working like this:

1. In a loop, fetch all bdi's which have works but have no writeback
thread and move them to the 'pending_list'.
2. If the list is empty, sleep for 5 s.
3. Otherwise, take one BDI from the list, fork the writeback thread for
this bdi, and repeat the loop again.

IOW, it first moves everything to the 'pending_list', then process only
one element, and so on. With this patch the algorithm is:

1. Find the first bdi which has work and remove it from the global
list of bdi's (bdi_list).
2. If there were not such bdi, sleep 5 s.
3. Fork the writeback thread for this bdi and repeat the loop again.

IOW, now we find the first bdi to process, process it, and so on. This
is simpler and involves less lists.

This patch also removes unneeded and unused complications which involve
'rcu_call()' and 'bdi->rcu_head' - now we use simple 'synchronize_rcu()'
and remove the 'rcu_head' field from 'struct backing_dev_info'.

Signed-off-by: Artem Bityutskiy <[email protected]>
---
include/linux/backing-dev.h | 1 -
mm/backing-dev.c | 75 ++++++++++--------------------------------
2 files changed, 18 insertions(+), 58 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index f0936f5..95ecb2b 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -58,7 +58,6 @@ struct bdi_writeback {

struct backing_dev_info {
struct list_head bdi_list;
- struct rcu_head rcu_head;
unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
unsigned long state; /* Always use atomic bitops on this */
unsigned int capabilities; /* Device capabilities */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ab783b9..a42e5ef 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -50,8 +50,6 @@ static struct timer_list sync_supers_timer;
static int bdi_sync_supers(void *);
static void sync_supers_timer_fn(unsigned long);

-static void bdi_add_default_flusher_thread(struct backing_dev_info *bdi);
-
#ifdef CONFIG_DEBUG_FS
#include <linux/debugfs.h>
#include <linux/seq_file.h>
@@ -332,6 +330,7 @@ static int bdi_forker_thread(void *ptr)
set_user_nice(current, 0);

for (;;) {
+ bool fork = false;
struct backing_dev_info *bdi, *tmp;
struct task_struct *task;

@@ -346,22 +345,29 @@ static int bdi_forker_thread(void *ptr)
* a thread registered. If so, set that up.
*/
list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
+ if (!bdi_cap_writeback_dirty(bdi))
+ continue;
if (bdi->wb.task)
continue;
if (!bdi_has_dirty_io(bdi))
continue;

- bdi_add_default_flusher_thread(bdi);
+ WARN(!test_bit(BDI_registered, &bdi->state),
+ "bdi %p/%s is not registered!\n", bdi, bdi->name);
+
+ list_del_rcu(&bdi->bdi_list);
+ fork = true;
+ break;
}
+ spin_unlock_bh(&bdi_lock);

/* Keep working if default bdi still has some works */
if (!list_empty(&me->bdi->work_list))
__set_current_state(TASK_RUNNING);

- if (list_empty(&bdi_pending_list)) {
+ if (!fork) {
unsigned long wait;

- spin_unlock_bh(&bdi_lock);
wait = msecs_to_jiffies(dirty_writeback_interval * 10);
if (wait)
schedule_timeout(wait);
@@ -374,13 +380,13 @@ static int bdi_forker_thread(void *ptr)
__set_current_state(TASK_RUNNING);

/*
- * This is our real job - check for pending entries in
- * bdi_pending_list, and create the threads that got added
+ * Set the pending bit - if someone will try to unregister this
+ * bdi - it'll wait on this bit.
*/
- bdi = list_entry(bdi_pending_list.next, struct backing_dev_info,
- bdi_list);
- list_del_init(&bdi->bdi_list);
- spin_unlock_bh(&bdi_lock);
+ set_bit(BDI_pending, &bdi->state);
+
+ /* Make sure no one uses the picked bdi */
+ synchronize_rcu();

task = kthread_run(bdi_writeback_thread, &bdi->wb, "flush-%s",
dev_name(bdi->dev));
@@ -393,7 +399,7 @@ static int bdi_forker_thread(void *ptr)
* flush other bdi's to free memory.
*/
spin_lock_bh(&bdi_lock);
- list_add_tail(&bdi->bdi_list, &bdi_pending_list);
+ list_add_tail(&bdi->bdi_list, &bdi_list);
spin_unlock_bh(&bdi_lock);

bdi_flush_io(bdi);
@@ -404,50 +410,6 @@ static int bdi_forker_thread(void *ptr)
return 0;
}

-static void bdi_add_to_pending(struct rcu_head *head)
-{
- struct backing_dev_info *bdi;
-
- bdi = container_of(head, struct backing_dev_info, rcu_head);
-
- spin_lock(&bdi_lock);
- list_add_tail(&bdi->bdi_list, &bdi_pending_list);
- spin_unlock(&bdi_lock);
-}
-
-/*
- * Add the default flusher thread that gets created for any bdi
- * that has dirty data pending writeout
- */
-void static bdi_add_default_flusher_thread(struct backing_dev_info *bdi)
-{
- if (!bdi_cap_writeback_dirty(bdi))
- return;
-
- if (WARN_ON(!test_bit(BDI_registered, &bdi->state))) {
- printk(KERN_ERR "bdi %p/%s is not registered!\n",
- bdi, bdi->name);
- return;
- }
-
- /*
- * Check with the helper whether to proceed adding a thread. Will only
- * abort if we two or more simultanous calls to
- * bdi_add_default_flusher_thread() occured, further additions will
- * block waiting for previous additions to finish.
- */
- if (!test_and_set_bit(BDI_pending, &bdi->state)) {
- list_del_rcu(&bdi->bdi_list);
-
- /*
- * We must wait for the current RCU period to end before
- * moving to the pending list. So schedule that operation
- * from an RCU callback.
- */
- call_rcu(&bdi->rcu_head, bdi_add_to_pending);
- }
-}
-
/*
* Look up for bdi in the bdi_list. If found, remove it, ensure that it is
* no longer visible, and return 0. If not found, return 1.
@@ -603,7 +565,6 @@ int bdi_init(struct backing_dev_info *bdi)
bdi->max_ratio = 100;
bdi->max_prop_frac = PROP_FRAC_BASE;
spin_lock_init(&bdi->wb_lock);
- INIT_RCU_HEAD(&bdi->rcu_head);
INIT_LIST_HEAD(&bdi->bdi_list);
INIT_LIST_HEAD(&bdi->work_list);

--
1.7.1.1

2010-07-16 12:49:51

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 11/16] writeback: move last_active to bdi

From: Artem Bityutskiy <[email protected]>

bdi threads use local variable 'last_active' which stores last
jiffies when the bdi thread did some useful work. Move this
local variable to 'struct bdi_writeback'. This is just a
preparation for the further patches which will make the
forker thread decide when bdi threads should be killed.

Signed-off-by: Artem Bityutskiy <[email protected]>
---
fs/fs-writeback.c | 6 +++---
include/linux/backing-dev.h | 13 +++++++------
2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index f045450..8469b93 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -800,7 +800,6 @@ int bdi_writeback_thread(void *data)
{
struct bdi_writeback *wb = data;
struct backing_dev_info *bdi = wb->bdi;
- unsigned long last_active = jiffies;
unsigned long wait_jiffies = -1UL;
long pages_written;

@@ -813,6 +812,7 @@ int bdi_writeback_thread(void *data)

current->flags |= PF_FLUSHER | PF_SWAPWRITE;
set_freezable();
+ wb->last_active = jiffies;

/*
* Our parent may run at a different priority, just set us to normal
@@ -834,7 +834,7 @@ int bdi_writeback_thread(void *data)
trace_writeback_pages_written(pages_written);

if (pages_written)
- last_active = jiffies;
+ wb->last_active = jiffies;
else if (wait_jiffies != -1UL) {
unsigned long max_idle;

@@ -844,7 +844,7 @@ int bdi_writeback_thread(void *data)
* recreated automatically.
*/
max_idle = max(5UL * 60 * HZ, wait_jiffies);
- if (time_after(jiffies, max_idle + last_active))
+ if (time_after(jiffies, max_idle + wb->last_active))
break;
}

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 95ecb2b..71b6223 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -45,15 +45,16 @@ enum bdi_stat_item {
#define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))

struct bdi_writeback {
- struct backing_dev_info *bdi; /* our parent bdi */
+ struct backing_dev_info *bdi; /* our parent bdi */
unsigned int nr;

- unsigned long last_old_flush; /* last old data flush */
+ unsigned long last_old_flush; /* last old data flush */
+ unsigned long last_active; /* last time bdi thread was active */

- struct task_struct *task; /* writeback thread */
- struct list_head b_dirty; /* dirty inodes */
- struct list_head b_io; /* parked for writeback */
- struct list_head b_more_io; /* parked for more writeback */
+ struct task_struct *task; /* writeback thread */
+ struct list_head b_dirty; /* dirty inodes */
+ struct list_head b_io; /* parked for writeback */
+ struct list_head b_more_io; /* parked for more writeback */
};

struct backing_dev_info {
--
1.7.1.1

2010-07-16 12:49:56

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 09/16] writeback: do not lose wake-ups in bdi threads

From: Artem Bityutskiy <[email protected]>

The bdi threads ('bdi_writeback_thread()') can lose wake-ups if,
for example, 'bdi_queue_work()' is executed while after the bdi
thread finished 'wb_do_writeback()' but before it has called
'schedule_timeout_interruptible()'.

To fix this issue, we have to check whether we have works to
process after we change the task state to 'TASK_INTERRUPTIBLE'.

Also, 'bdi_writeback_thread()' inconsistently handles the cases
when 'dirty_writeback_interval' is zero and non-zero. But there is
no fundamental difference between these cases, so they have to be
handled the same way, which this patch also does.

This patch also removes strange 'list_empty_careful()' call.

Signed-off-by: Artem Bityutskiy <[email protected]>
---
fs/fs-writeback.c | 17 +++++++++--------
1 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3fc5194..f045450 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -848,17 +848,18 @@ int bdi_writeback_thread(void *data)
break;
}

- if (dirty_writeback_interval) {
- wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
- schedule_timeout_interruptible(wait_jiffies);
- } else {
- set_current_state(TASK_INTERRUPTIBLE);
- if (list_empty_careful(&wb->bdi->work_list) &&
- !kthread_should_stop())
- schedule();
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (!list_empty(&bdi->work_list)) {
__set_current_state(TASK_RUNNING);
+ continue;
}

+ if (dirty_writeback_interval) {
+ wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
+ schedule_timeout(wait_jiffies);
+ } else
+ schedule();
+
try_to_freeze();
}

--
1.7.1.1

2010-07-16 12:50:00

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 08/16] writeback: do not lose default bdi wake-ups

From: Artem Bityutskiy <[email protected]>

Currently the default bdi can lose wake ups. E.g., if
'bdi_queue_work()' is called when 'bdi_forker_thread()' is executing
code after

if (bdi_has_dirty_io(me->bdi))
wb_do_writeback(me, 0);

but before 'set_current_state(TASK_INTERRUPTIBLE)'. This is unlikely,
and the result is not very severe - we'll just delay the execution
of the work, but this is still not very nice.

This patch fixes the issue by checking whether the default bdi has
works before the forker thread goes sleep.

This patch also removes strange comment about a temporary measure,
because there is nothing temporary there - many FSes use default
bdi.

Signed-off-by: Artem Bityutskiy <[email protected]>
---
mm/backing-dev.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 14fe1bb..ab783b9 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -335,10 +335,6 @@ static int bdi_forker_thread(void *ptr)
struct backing_dev_info *bdi, *tmp;
struct task_struct *task;

- /*
- * Temporary measure, we want to make sure we don't see
- * dirty data on the default backing_dev_info
- */
if (bdi_has_dirty_io(me->bdi))
wb_do_writeback(me, 0);

@@ -358,6 +354,10 @@ static int bdi_forker_thread(void *ptr)
bdi_add_default_flusher_thread(bdi);
}

+ /* Keep working if default bdi still has some works */
+ if (!list_empty(&me->bdi->work_list))
+ __set_current_state(TASK_RUNNING);
+
if (list_empty(&bdi_pending_list)) {
unsigned long wait;

--
1.7.1.1

2010-07-16 12:50:08

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 12/16] writeback: add to bdi_list in the forker thread

From: Artem Bityutskiy <[email protected]>

Before creating a bdi thread, the forker thread first removes the
corresponding bdi from the 'bdi_list', then creates the bdi thread
and wakes it up. The thread then adds itself back to the 'bdi_list'.

There is no problem with this, except that it makes the logic a
tiny bit more twisted, because the code reader has to spend some
time to figure out when the bdi is moved back. But this is minor.

However, this patch makes the forker thread add the bdi back to
the 'bdi_list' itself, rather than letting the bdi thread do this.

The reason of the change is to prepare for further changes. Namely,
the further patches will move the bdi thread exiting logic from
bdi threads to the forker thread. So the forker thread will kill
inactive bdi threads. And for the killing case, the forker thread
will have to add bdi's back to to the 'bdi_list' itself. And to make
the code consistent, it is better if we use the same approach for
the bdi thread creation path as well. This is just more elegant.

Note, bdi threads added themselves to the head of the 'bdi_list',
but in the error path the forker thread added them to the tail of
the 'bdi_list'. This patch modifies the code so that bdi's are
always added to the tail. There is no fundamental reason for this,
this is again, just for consistency and to make the code simpler.
I just do not see any problem adding them back to the tail instead
of the head. And this should even be a bit better, because the
next iteration of the bdi forker thread loop will start looking
to the 'bdi_list' from the head, and it will find a bdi which
requires attention a bit faster.

And for absolutely the same reasons, this patch also moves the
piece of code which clears the BDI_pending bit and wakes up the
waiters from bdi threads to the forker thread.

Signed-off-by: Artem Bityutskiy <[email protected]>
---
fs/fs-writeback.c | 14 --------------
mm/backing-dev.c | 21 +++++++++++++++------
2 files changed, 15 insertions(+), 20 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8469b93..968dc8e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -803,13 +803,6 @@ int bdi_writeback_thread(void *data)
unsigned long wait_jiffies = -1UL;
long pages_written;

- /*
- * Add us to the active bdi_list
- */
- spin_lock_bh(&bdi_lock);
- list_add_rcu(&bdi->bdi_list, &bdi_list);
- spin_unlock_bh(&bdi_lock);
-
current->flags |= PF_FLUSHER | PF_SWAPWRITE;
set_freezable();
wb->last_active = jiffies;
@@ -819,13 +812,6 @@ int bdi_writeback_thread(void *data)
*/
set_user_nice(current, 0);

- /*
- * Clear pending bit and wakeup anybody waiting to tear us down
- */
- clear_bit(BDI_pending, &bdi->state);
- smp_mb__after_clear_bit();
- wake_up_bit(&bdi->state, BDI_pending);
-
trace_writeback_thread_start(bdi);

while (!kthread_should_stop()) {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index a42e5ef..0123d6f 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -390,21 +390,30 @@ static int bdi_forker_thread(void *ptr)

task = kthread_run(bdi_writeback_thread, &bdi->wb, "flush-%s",
dev_name(bdi->dev));
+
+ spin_lock_bh(&bdi_lock);
+ list_add_tail(&bdi->bdi_list, &bdi_list);
+ spin_unlock_bh(&bdi_lock);
+
if (IS_ERR(task)) {
/*
* If thread creation fails, then readd the bdi back to
* the list and force writeout of the bdi from this
* forker thread. That will free some memory and we can
- * try again. Add it to the tail so we get a chance to
- * flush other bdi's to free memory.
+ * try again. The bdi was added to the tail so we'll
+ * get a chance to flush other bdi's to free memory.
*/
- spin_lock_bh(&bdi_lock);
- list_add_tail(&bdi->bdi_list, &bdi_list);
- spin_unlock_bh(&bdi_lock);
-
bdi_flush_io(bdi);
} else
bdi->wb.task = task;
+
+ /*
+ * Clear pending bit and wakeup anybody waiting to tear us
+ * down.
+ */
+ clear_bit(BDI_pending, &bdi->state);
+ smp_mb__after_clear_bit();
+ wake_up_bit(&bdi->state, BDI_pending);
}

return 0;
--
1.7.1.1

2010-07-16 12:50:13

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 15/16] writeback: clean-up the warning about non-registered bdi

From: Artem Bityutskiy <[email protected]>

In '__mark_inode_dirty()', when the corresponding bdi was not properly
registered, we print a warning. But the corresponding code is a bit
untidy, it used if and 'WARN_ON(1)' and printk.

This patch turns it into one multi-line WARN() statement which looks
tidier and also uses 'unlikely()' for the condition which might matter
a tiny bit for this hot-path function.

Signed-off-by: Artem Bityutskiy <[email protected]>
---
fs/fs-writeback.c | 9 +++------
1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 559092d..83662fb 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -981,12 +981,9 @@ void __mark_inode_dirty(struct inode *inode, int flags)
struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
struct backing_dev_info *bdi = wb->bdi;

- if (bdi_cap_writeback_dirty(bdi) &&
- !test_bit(BDI_registered, &bdi->state)) {
- WARN_ON(1);
- printk(KERN_ERR "bdi-%s not registered\n",
- bdi->name);
- }
+ WARN(bdi_cap_writeback_dirty(bdi) &&
+ !test_bit(BDI_registered, &bdi->state),
+ "bdi-%s not registered\n", bdi->name);

inode->dirtied_when = jiffies;
list_move(&inode->i_list, &wb->b_dirty);
--
1.7.1.1

2010-07-16 12:49:48

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 05/16] writeback: fix possible race when creating bdi threads

From: Artem Bityutskiy <[email protected]>

This patch fixes a very unlikely race condition on the error path.
When bdi thread creation fails, 'bdi->wb.task' may temporary, for
very short time, contain an error code. If at the same time someone
submits a task to this bdi, the following code will oops
in 'bdi_queue_work()'

if (wb->task)
wake_up_process(wb->task);

Fix this by introducing a temporary variable 'task' and storing
the possible error code there, so that 'wb->task' will never
take bogus values.

Note, this race is very unlikely and I never hit it, so it is
theoretical, but nevertheless worth fixing.

This patch also merges 2 comments which were previously separate.

Signed-off-by: Artem Bityutskiy <[email protected]>
---
mm/backing-dev.c | 28 +++++++++++-----------------
1 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index a445ff0..8be2e13 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -332,7 +332,7 @@ static int bdi_forker_thread(void *ptr)

for (;;) {
struct backing_dev_info *bdi, *tmp;
- struct bdi_writeback *wb;
+ struct task_struct *task;

/*
* Temporary measure, we want to make sure we don't see
@@ -383,29 +383,23 @@ static int bdi_forker_thread(void *ptr)
list_del_init(&bdi->bdi_list);
spin_unlock_bh(&bdi_lock);

- wb = &bdi->wb;
- wb->task = kthread_run(bdi_writeback_thread, wb, "flush-%s",
- dev_name(bdi->dev));
- /*
- * If thread creation fails, then readd the bdi to
- * the pending list and force writeout of the bdi
- * from this forker thread. That will free some memory
- * and we can try again.
- */
- if (IS_ERR(wb->task)) {
- wb->task = NULL;
-
+ task = kthread_run(bdi_writeback_thread, &bdi->wb, "flush-%s",
+ dev_name(bdi->dev));
+ if (IS_ERR(task)) {
/*
- * Add this 'bdi' to the back, so we get
- * a chance to flush other bdi's to free
- * memory.
+ * If thread creation fails, then readd the bdi back to
+ * the list and force writeout of the bdi from this
+ * forker thread. That will free some memory and we can
+ * try again. Add it to the tail so we get a chance to
+ * flush other bdi's to free memory.
*/
spin_lock_bh(&bdi_lock);
list_add_tail(&bdi->bdi_list, &bdi_pending_list);
spin_unlock_bh(&bdi_lock);

bdi_flush_io(bdi);
- }
+ } else
+ bdi->wb.task = task;
}

return 0;
--
1.7.1.1

2010-07-16 12:49:41

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 07/16] writeback: do not lose wake-ups in the forker thread

From: Artem Bityutskiy <[email protected]>

Currently the forker thread can lose wakeups which may lead to
unnecessary delays in processing bdi works. E.g., consider the
following scenario.

1. 'bdi_forker_thread()' walks the 'bdi_list', finds out there
is nothing to do, and is about to finish the loop.
2. A bdi thread decides to exit because it was inactive for long
time.
3. 'bdi_queue_work()' adds a work to the bdi which just exited,
so it wakes up 'bdi_forker_thread()'.
4. 'bdi_forker_thread()' executes
'set_current_state(TASK_INTERRUPTIBLE)' and goes sleep. We
loose a wake-up.

This situation is theoretica, I never hit it, but it is worth
fixing. The fix is to execute
'set_current_state(TASK_INTERRUPTIBLE)' before walking 'bdi_list'.

Signed-off-by: Artem Bityutskiy <[email protected]>
---
mm/backing-dev.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 03a3d82..14fe1bb 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -343,6 +343,7 @@ static int bdi_forker_thread(void *ptr)
wb_do_writeback(me, 0);

spin_lock_bh(&bdi_lock);
+ set_current_state(TASK_INTERRUPTIBLE);

/*
* Check if any existing bdi's have dirty data without
@@ -357,8 +358,6 @@ static int bdi_forker_thread(void *ptr)
bdi_add_default_flusher_thread(bdi);
}

- set_current_state(TASK_INTERRUPTIBLE);
-
if (list_empty(&bdi_pending_list)) {
unsigned long wait;

--
1.7.1.1

2010-07-16 12:49:36

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 01/16] writeback: do not self-wakeup

From: Artem Bityutskiy <[email protected]>

The bdi forker thread wakes up itself via 'bdi_add_to_pending()'.
And the thread is the only user of 'bdi_add_to_pending()'. Thus,
we can safely zap the unneeded 'wake_up_process()' call.

Signed-off-by: Artem Bityutskiy <[email protected]>
---
mm/backing-dev.c | 6 ------
1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ac78a33..ff5669a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -421,12 +421,6 @@ static void bdi_add_to_pending(struct rcu_head *head)
spin_lock(&bdi_lock);
list_add_tail(&bdi->bdi_list, &bdi_pending_list);
spin_unlock(&bdi_lock);
-
- /*
- * We are now on the pending list, wake up bdi_forker_task()
- * to finish the job and add us back to the active bdi_list
- */
- wake_up_process(default_backing_dev_info.wb.task);
}

/*
--
1.7.1.1

2010-07-16 12:49:30

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 03/16] writeback: harmonize writeback threads naming

From: Artem Bityutskiy <[email protected]>

The write-back code mixes words "thread" and "task" for the
same things - this is not a problem, but this is still an
inconsistency which makes it a bit difficult to read the code.

hch: a convention I tend to use and I've seen in various places
is to always use _task for the storage of the task_struct pointer,
and thread everywhere else. This especially helps with having
foo_thread for the actual thread and foo_task for a global
variable keeping the task_struct pointer

This patch renames:

'bdi_add_default_flusher_task()' -> 'bdi_add_default_flusher_thread()'
'bdi_forker_task()' -> 'bdi_forker_thread()'

This is just more consistent, because the per-bdi flushers are
'bdi_writeback_thread()'.

This patch also amends commentaries and makes them refer the forker
thread and the write-back threads as "thread", not "task".

Signed-off-by: Artem Bityutskiy <[email protected]>
---
fs/fs-writeback.c | 2 +-
include/linux/backing-dev.h | 2 +-
mm/backing-dev.c | 26 +++++++++++++-------------
3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index bf10cbf..5e6b7fc 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -840,7 +840,7 @@ int bdi_writeback_thread(void *data)

/*
* Longest period of inactivity that we tolerate. If we
- * see dirty data again later, the task will get
+ * see dirty data again later, the thread will get
* recreated automatically.
*/
max_idle = max(5UL * 60 * HZ, wait_jiffies);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index e536f3a..f0936f5 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -50,7 +50,7 @@ struct bdi_writeback {

unsigned long last_old_flush; /* last old data flush */

- struct task_struct *task; /* writeback task */
+ struct task_struct *task; /* writeback thread */
struct list_head b_dirty; /* dirty inodes */
struct list_head b_io; /* parked for writeback */
struct list_head b_more_io; /* parked for more writeback */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 0fad60d..b34c12a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -50,7 +50,7 @@ static struct timer_list sync_supers_timer;
static int bdi_sync_supers(void *);
static void sync_supers_timer_fn(unsigned long);

-static void bdi_add_default_flusher_task(struct backing_dev_info *bdi);
+static void bdi_add_default_flusher_thread(struct backing_dev_info *bdi);

#ifdef CONFIG_DEBUG_FS
#include <linux/debugfs.h>
@@ -279,10 +279,10 @@ static void bdi_flush_io(struct backing_dev_info *bdi)
}

/*
- * kupdated() used to do this. We cannot do it from the bdi_forker_task()
+ * kupdated() used to do this. We cannot do it from the bdi_forker_thread()
* or we risk deadlocking on ->s_umount. The longer term solution would be
* to implement sync_supers_bdi() or similar and simply do it from the
- * bdi writeback tasks individually.
+ * bdi writeback thread individually.
*/
static int bdi_sync_supers(void *unused)
{
@@ -318,7 +318,7 @@ static void sync_supers_timer_fn(unsigned long unused)
bdi_arm_supers_timer();
}

-static int bdi_forker_task(void *ptr)
+static int bdi_forker_thread(void *ptr)
{
struct bdi_writeback *me = ptr;

@@ -354,7 +354,7 @@ static int bdi_forker_task(void *ptr)
!bdi_has_dirty_io(bdi))
continue;

- bdi_add_default_flusher_task(bdi);
+ bdi_add_default_flusher_thread(bdi);
}

set_current_state(TASK_INTERRUPTIBLE);
@@ -376,7 +376,7 @@ static int bdi_forker_task(void *ptr)

/*
* This is our real job - check for pending entries in
- * bdi_pending_list, and create the tasks that got added
+ * bdi_pending_list, and create the threads that got added
*/
bdi = list_entry(bdi_pending_list.next, struct backing_dev_info,
bdi_list);
@@ -387,7 +387,7 @@ static int bdi_forker_task(void *ptr)
wb->task = kthread_run(bdi_writeback_thread, wb, "flush-%s",
dev_name(bdi->dev));
/*
- * If task creation fails, then readd the bdi to
+ * If thread creation fails, then readd the bdi to
* the pending list and force writeout of the bdi
* from this forker thread. That will free some memory
* and we can try again.
@@ -423,10 +423,10 @@ static void bdi_add_to_pending(struct rcu_head *head)
}

/*
- * Add the default flusher task that gets created for any bdi
+ * Add the default flusher thread that gets created for any bdi
* that has dirty data pending writeout
*/
-void static bdi_add_default_flusher_task(struct backing_dev_info *bdi)
+void static bdi_add_default_flusher_thread(struct backing_dev_info *bdi)
{
if (!bdi_cap_writeback_dirty(bdi))
return;
@@ -438,10 +438,10 @@ void static bdi_add_default_flusher_task(struct backing_dev_info *bdi)
}

/*
- * Check with the helper whether to proceed adding a task. Will only
+ * Check with the helper whether to proceed adding a thread. Will only
* abort if we two or more simultanous calls to
- * bdi_add_default_flusher_task() occured, further additions will block
- * waiting for previous additions to finish.
+ * bdi_add_default_flusher_thread() occured, further additions will
+ * block waiting for previous additions to finish.
*/
if (!test_and_set_bit(BDI_pending, &bdi->state)) {
list_del_rcu(&bdi->bdi_list);
@@ -499,7 +499,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
if (bdi_cap_flush_forker(bdi)) {
struct bdi_writeback *wb = &bdi->wb;

- wb->task = kthread_run(bdi_forker_task, wb, "bdi-%s",
+ wb->task = kthread_run(bdi_forker_thread, wb, "bdi-%s",
dev_name(dev));
if (IS_ERR(wb->task)) {
wb->task = NULL;
--
1.7.1.1

2010-07-16 12:52:24

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 14/16] writeback: move bdi threads exiting logic to the forker thread

From: Artem Bityutskiy <[email protected]>

Currently, bdi threads can decide to exit if there was no useful
activity for 5 minutes. However, this introduces races. Namely,
in the 'bdi_queue_work()' we can easily oops if the thread decided
to exit while we are waking it up.

And even if we do not oops, but the bdi tread exits immediately
after we wake it up, we'll have to way up to 5 seconds for the
bdi forker thread to wake up, find a bdi without a thread, and
create it. This introduces unwanted delays in bdi jobs handling.

So, to fix this we just make the forker thread to be the central
place which not only creates bdi threads, but also kills them when
it is necessary. This is conceptually better as well, IMO.

Note, I tried to fix this a simpler way, e.g. by protecting
'bdi->wb.task' by 'bdi->wb_lock' in bdi treads, but this does not
solve the issue of possible delays in bdi works handling which
was indicated above.

Another reason why this change was done is to prepare for the
further changes which will prevent the bdi threads from waking
up every 5 s. and wasting power. Indeed, when the task does not
wake up periodically anymore, it won't be able to exit either.

Signed-off-by: Artem Bityutskiy <[email protected]>
---
fs/fs-writeback.c | 34 +++++++---------------
mm/backing-dev.c | 81 +++++++++++++++++++++++++++++++++++++++--------------
2 files changed, 71 insertions(+), 44 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 968dc8e..559092d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -74,24 +74,25 @@ int writeback_in_progress(struct backing_dev_info *bdi)
static void bdi_queue_work(struct backing_dev_info *bdi,
struct wb_writeback_work *work)
{
+ bool wakeup_default = false;
+
trace_writeback_queue(bdi, work);

spin_lock(&bdi->wb_lock);
list_add_tail(&work->list, &bdi->work_list);
- spin_unlock(&bdi->wb_lock);
-
/*
* If the default thread isn't there, make sure we add it. When
* it gets created and wakes up, we'll run this work.
*/
- if (unlikely(!bdi->wb.task)) {
+ if (unlikely(!bdi->wb.task))
+ wakeup_default = true;
+ else
+ wake_up_process(bdi->wb.task);
+ spin_unlock(&bdi->wb_lock);
+
+ if (wakeup_default) {
trace_writeback_nothread(bdi, work);
wake_up_process(default_backing_dev_info.wb.task);
- } else {
- struct bdi_writeback *wb = &bdi->wb;
-
- if (wb->task)
- wake_up_process(wb->task);
}
}

@@ -800,7 +801,6 @@ int bdi_writeback_thread(void *data)
{
struct bdi_writeback *wb = data;
struct backing_dev_info *bdi = wb->bdi;
- unsigned long wait_jiffies = -1UL;
long pages_written;

current->flags |= PF_FLUSHER | PF_SWAPWRITE;
@@ -821,18 +821,6 @@ int bdi_writeback_thread(void *data)

if (pages_written)
wb->last_active = jiffies;
- else if (wait_jiffies != -1UL) {
- unsigned long max_idle;
-
- /*
- * Longest period of inactivity that we tolerate. If we
- * see dirty data again later, the thread will get
- * recreated automatically.
- */
- max_idle = max(5UL * 60 * HZ, wait_jiffies);
- if (time_after(jiffies, max_idle + wb->last_active))
- break;
- }

set_current_state(TASK_INTERRUPTIBLE);
if (!list_empty(&bdi->work_list)) {
@@ -841,6 +829,8 @@ int bdi_writeback_thread(void *data)
}

if (dirty_writeback_interval) {
+ unsigned long wait_jiffies;
+
wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
schedule_timeout(wait_jiffies);
} else
@@ -849,8 +839,6 @@ int bdi_writeback_thread(void *data)
try_to_freeze();
}

- wb->task = NULL;
-
/*
* Flush any work that raced with us exiting. No new work
* will be added, since this bdi isn't discoverable anymore.
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 18d7c22..65cb88a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -317,6 +317,18 @@ static void sync_supers_timer_fn(unsigned long unused)
bdi_arm_supers_timer();
}

+/*
+ * Calculate the longest interval (jiffies) bdi threads are allowed to be
+ * inactive.
+ */
+static unsigned long bdi_longest_inactive(void)
+{
+ unsigned long interval;
+
+ interval = msecs_to_jiffies(dirty_writeback_interval * 10);
+ return max(5UL * 60 * HZ, wait_jiffies);
+}
+
static int bdi_forker_thread(void *ptr)
{
struct bdi_writeback *me = ptr;
@@ -330,9 +342,9 @@ static int bdi_forker_thread(void *ptr)
set_user_nice(current, 0);

for (;;) {
- bool fork = false;
+ bool fork = false, kill = false;
struct backing_dev_info *bdi, *tmp;
- struct task_struct *task;
+ struct task_struct *uninitialized_var(task);

if (bdi_has_dirty_io(me->bdi))
wb_do_writeback(me, 0);
@@ -357,6 +369,25 @@ static int bdi_forker_thread(void *ptr)
fork = true;
break;
}
+
+ spin_lock(&bdi->wb_lock);
+ /*
+ * If there is no work to do and the bdi thread was
+ * inactive long enough - kill it. The wb_lock is taken
+ * to make sure no-one adds more work to this bdi and
+ * wakes the bdi thread up.
+ */
+ if (bdi->wb.task && !bdi_has_dirty_io(bdi) &&
+ time_after(jiffies, bdi->wb.last_active +
+ bdi_longest_inactive())) {
+ task = bdi->wb.task;
+ bdi->wb.task = NULL;
+ list_del_rcu(&bdi->bdi_list);
+ spin_unlock(&bdi->wb_lock);
+ kill = true;
+ break;
+ }
+ spin_unlock(&bdi->wb_lock);
}
spin_unlock_bh(&bdi_lock);

@@ -364,7 +395,7 @@ static int bdi_forker_thread(void *ptr)
if (!list_empty(&me->bdi->work_list))
__set_current_state(TASK_RUNNING);

- if (!fork) {
+ if (!fork && !kill) {
unsigned long wait;

wait = msecs_to_jiffies(dirty_writeback_interval * 10);
@@ -387,24 +418,32 @@ static int bdi_forker_thread(void *ptr)
/* Make sure no one uses the picked bdi */
synchronize_rcu();

- task = kthread_run(bdi_writeback_thread, &bdi->wb, "flush-%s",
- dev_name(bdi->dev));
+ if (fork) {
+ task = kthread_run(bdi_writeback_thread, &bdi->wb, "flush-%s",
+ dev_name(bdi->dev));

- spin_lock_bh(&bdi_lock);
- list_add_tail(&bdi->bdi_list, &bdi_list);
- spin_unlock_bh(&bdi_lock);
+ spin_lock_bh(&bdi_lock);
+ list_add_tail(&bdi->bdi_list, &bdi_list);
+ spin_unlock_bh(&bdi_lock);

- if (IS_ERR(task)) {
- /*
- * If thread creation fails, then readd the bdi back to
- * the list and force writeout of the bdi from this
- * forker thread. That will free some memory and we can
- * try again. The bdi was added to the tail so we'll
- * get a chance to flush other bdi's to free memory.
- */
- bdi_flush_io(bdi);
- } else
- bdi->wb.task = task;
+ if (IS_ERR(task)) {
+ /*
+ * If thread creation fails, then readd the bdi back to
+ * the list and force writeout of the bdi from this
+ * forker thread. That will free some memory and we can
+ * try again. The bdi was added to the tail so we'll
+ * get a chance to flush other bdi's to free memory.
+ */
+ bdi_flush_io(bdi);
+ } else
+ bdi->wb.task = task;
+ } else {
+ kthread_stop(task);
+
+ spin_lock_bh(&bdi_lock);
+ list_add_tail(&bdi->bdi_list, &bdi_list);
+ spin_unlock_bh(&bdi_lock);
+ }

/*
* Clear pending bit and wakeup anybody waiting to tear us
--
1.7.1.1

2010-07-16 12:49:25

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 04/16] writeback: fix possible race when shutting down bdi

From: Artem Bityutskiy <[email protected]>

Current bdi code has the following race between 'bdi_wb_shutdown()'
and 'bdi_forker_thread()'.

Initial condition: BDI_pending is cleaned, bdi has no writeback thread,
because it was inactive and exited, 'bdi_wb_shutdown()' and
'bdi_forker_thread()' are executed concurrently.

1. bdi_wb_shutdown() executes wait_on_bit(), tests the BDI_pending bit,
it is clean, so it does not wait for anything.

2. 'bdi_forker_thread()' takes the 'bdi_lock', finds out that bdi has
work to do, takes it out of the 'bdi_list', sets the BDI_pending flag,
unlocks the 'bdi_lock' lock

3. 'bdi_wb_shutdown()' takes the lock, and nasty things start happening:
a) it removes the bdi from bdi->bdi_list, but the bdi is not in any
list
b) it starts deleting the bdi, but 'bdi_forker_thread()' is still working
with it.

Note, it is very difficult to hit this race, and I never observed it, so it
is quite theoretical, but it is still a race. Also note, this race exist
without my previous clean-ups as well.

This patch fixes this race by making 'bdi_wb_shutdown()' first search for
the bdi in the 'bdi_list', and only if it is there, remove it from 'bdi_list'
and destroy. But if it is not there, assume it is in transit and re-try
waiting on the BDI_pending bit.

Signed-off-by: Artem Bityutskiy <[email protected]>
---
mm/backing-dev.c | 41 ++++++++++++++++++++++++++++-------------
1 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b34c12a..a445ff0 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -456,15 +456,26 @@ void static bdi_add_default_flusher_thread(struct backing_dev_info *bdi)
}

/*
- * Remove bdi from bdi_list, and ensure that it is no longer visible
+ * Look up for bdi in the bdi_list. If found, remove it, ensure that it is
+ * no longer visible, and return 0. If not found, return 1.
*/
-static void bdi_remove_from_list(struct backing_dev_info *bdi)
+static int bdi_remove_from_list(struct backing_dev_info *me)
{
+ struct backing_dev_info *bdi;
+
spin_lock_bh(&bdi_lock);
- list_del_rcu(&bdi->bdi_list);
+ list_for_each_entry(bdi, &bdi_list, bdi_list) {
+ if (bdi == me) {
+ list_del_rcu(&me->bdi_list);
+ spin_unlock_bh(&bdi_lock);
+ synchronize_rcu();
+ return 0;
+ }
+
+ }
spin_unlock_bh(&bdi_lock);
+ return 1;

- synchronize_rcu();
}

int bdi_register(struct backing_dev_info *bdi, struct device *parent,
@@ -532,16 +543,20 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
if (!bdi_cap_writeback_dirty(bdi))
return;

- /*
- * If setup is pending, wait for that to complete first
- */
- wait_on_bit(&bdi->state, BDI_pending, bdi_sched_wait,
- TASK_UNINTERRUPTIBLE);
+ do {
+ /*
+ * If setup is pending, wait for that to complete first
+ */
+ wait_on_bit(&bdi->state, BDI_pending, bdi_sched_wait,
+ TASK_UNINTERRUPTIBLE);

- /*
- * Make sure nobody finds us on the bdi_list anymore
- */
- bdi_remove_from_list(bdi);
+ /*
+ * Make sure nobody finds us on the bdi_list anymore. However,
+ * bdi may be temporary be not in the bdi_list but be in transit
+ * in bdi_forker_thread. Namely, this may happen if we race
+ * with the forker thread.
+ */
+ } while (bdi_remove_from_list(bdi));

/*
* Finally, kill the kernel thread. We don't need to be RCU
--
1.7.1.1

2010-07-16 12:52:47

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 16/16] writeback: prevent unnecessary bdi threads wakeups

From: Artem Bityutskiy <[email protected]>

Finally, we can get rid of unnecessary wake-ups in bdi threads,
which are very bad for battery-driven devices.

There are two type of work bdi threads do:
1. process bdi works from the 'bdi->work_list'
2. periodic write-back

So there are 2 sources of wake-up events for bdi threads:

1. 'bdi_queue_work()' - submits bdi works
2. '__mark_inode_dirty()' - adds dirty I/O to bdi's

The former already has bdi wake-up code. The latter does not,
and this patch adds it.

'__mark_inode_dirty()' is hot-path function, but this patch adds
another 'spin_lock(&bdi->wb_lock)' there. However, it is taken only
in rare cases when the bdi has no dirty inodes. So adding this
spinlock should be fine and should not affect performance.

This patch makes sure bdi threads and the forker thread do not
wake-up if there is nothing to do. The forker thread will nevertheless
wake up at least every 5m. to check whether it has to kill a
bdi thread. This can also be optimized, but is not worth it.

Signed-off-by: Artem Bityutskiy <[email protected]>
---
fs/fs-writeback.c | 45 ++++++++++++++++++++++++++++++++++++++-------
mm/backing-dev.c | 18 +++++++++++++-----
2 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 83662fb..f94f08c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -828,12 +828,17 @@ int bdi_writeback_thread(void *data)
continue;
}

- if (dirty_writeback_interval) {
- unsigned long wait_jiffies;
+ if (wb_has_dirty_io(wb) && dirty_writeback_interval) {
+ unsigned long wait;

- wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
- schedule_timeout(wait_jiffies);
+ wait = msecs_to_jiffies(dirty_writeback_interval * 10);
+ schedule_timeout(wait);
} else
+ /*
+ * We have nothing to do, so can go sleep without any
+ * timeout and save power. When a work is queued or
+ * something is made dirty - we will be woken up.
+ */
schedule();

try_to_freeze();
@@ -924,7 +929,9 @@ static noinline void block_dump___mark_inode_dirty(struct inode *inode)
*/
void __mark_inode_dirty(struct inode *inode, int flags)
{
+ bool wakeup_bdi;
struct super_block *sb = inode->i_sb;
+ struct backing_dev_info *uninitialized_var(bdi);

/*
* Don't do this for I_DIRTY_PAGES - that doesn't actually
@@ -948,6 +955,8 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if (unlikely(block_dump))
block_dump___mark_inode_dirty(inode);

+ wakeup_bdi = false;
+
spin_lock(&inode_lock);
if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;
@@ -978,19 +987,41 @@ void __mark_inode_dirty(struct inode *inode, int flags)
* reposition it (that would break b_dirty time-ordering).
*/
if (!was_dirty) {
- struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
- struct backing_dev_info *bdi = wb->bdi;
+ bdi = inode_to_bdi(inode);

WARN(bdi_cap_writeback_dirty(bdi) &&
!test_bit(BDI_registered, &bdi->state),
"bdi-%s not registered\n", bdi->name);

+ /*
+ * If this is the first dirty inode for this bdi, we
+ * have to wake-up the corresponding bdi thread to make
+ * sure background write-back happens later.
+ */
+ if (!wb_has_dirty_io(&bdi->wb) &&
+ bdi_cap_writeback_dirty(bdi))
+ wakeup_bdi = true;
+
inode->dirtied_when = jiffies;
- list_move(&inode->i_list, &wb->b_dirty);
+ list_move(&inode->i_list, &bdi->wb.b_dirty);
}
}
out:
spin_unlock(&inode_lock);
+
+ if (wakeup_bdi) {
+ bool wakeup_default = false;
+
+ spin_lock(&bdi->wb_lock);
+ if (unlikely(!bdi->wb.task))
+ wakeup_default = true;
+ else
+ wake_up_process(bdi->wb.task);
+ spin_unlock(&bdi->wb_lock);
+
+ if (wakeup_default)
+ wake_up_process(default_backing_dev_info.wb.task);
+ }
}
EXPORT_SYMBOL(__mark_inode_dirty);

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 65cb88a..818f934 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -326,7 +326,7 @@ static unsigned long bdi_longest_inactive(void)
unsigned long interval;

interval = msecs_to_jiffies(dirty_writeback_interval * 10);
- return max(5UL * 60 * HZ, wait_jiffies);
+ return max(5UL * 60 * HZ, interval);
}

static int bdi_forker_thread(void *ptr)
@@ -399,10 +399,18 @@ static int bdi_forker_thread(void *ptr)
unsigned long wait;

wait = msecs_to_jiffies(dirty_writeback_interval * 10);
- if (wait)
- schedule_timeout(wait);
- else
- schedule();
+ if (!wb_has_dirty_io(me) || !wait) {
+ /*
+ * There are no dirty data. The only thing we
+ * should now care is checking for inactive bdi
+ * threads and killing them. Thus, let's sleep
+ * for longer time to avoid unnecessary
+ * wake-ups, save energy and be friendly for
+ * battery-driven devices.
+ */
+ wait = bdi_longest_inactive();
+ }
+ schedule_timeout(wait);
try_to_freeze();
continue;
}
--
1.7.1.1

2010-07-16 12:53:09

by Artem Bityutskiy

[permalink] [raw]
Subject: [RFC][PATCH 13/16] writeback: restructure bdi forker loop a little

From: Artem Bityutskiy <[email protected]>

This patch re-structures the loop which walks bdi a little. This
is really just a micro-step towards the coming change where the
forker thread will kill the bdi threads. Just a tiny preparation
which should simplify reviewing and make it easier to see what
I changed and catch mistakes by reviewing.

To put it differently, the final patch which moves the inactive bdi
threads exiting logic will be smaller if I do this little preparation.
Smaller patches are easier to review.

This patch also adds the 'bdi_cap_flush_forker(bdi)' condition to
the loop. The reason for this is that the forker thread can start
_before_ the BID_pending flag is set (see 'bdi_register()'), so
the WARN() statement will fire for the default bdi. I observed
this warning the system boots up.

This patch also amends comments a little.

Signed-off-by: Artem Bityutskiy <[email protected]>
---
mm/backing-dev.c | 23 +++++++++++------------
1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 0123d6f..18d7c22 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -340,24 +340,23 @@ static int bdi_forker_thread(void *ptr)
spin_lock_bh(&bdi_lock);
set_current_state(TASK_INTERRUPTIBLE);

- /*
- * Check if any existing bdi's have dirty data without
- * a thread registered. If so, set that up.
- */
list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
- if (!bdi_cap_writeback_dirty(bdi))
- continue;
- if (bdi->wb.task)
- continue;
- if (!bdi_has_dirty_io(bdi))
+ if (!bdi_cap_writeback_dirty(bdi) ||
+ bdi_cap_flush_forker(bdi))
continue;

WARN(!test_bit(BDI_registered, &bdi->state),
"bdi %p/%s is not registered!\n", bdi, bdi->name);

- list_del_rcu(&bdi->bdi_list);
- fork = true;
- break;
+ /*
+ * If the bdi has work to do, but the thread does not
+ * exist - create it.
+ */
+ if (!bdi->wb.task && bdi_has_dirty_io(bdi)) {
+ list_del_rcu(&bdi->bdi_list);
+ fork = true;
+ break;
+ }
}
spin_unlock_bh(&bdi_lock);

--
1.7.1.1

2010-07-18 06:44:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 01/16] writeback: do not self-wakeup

On Fri, Jul 16, 2010 at 03:44:57PM +0300, Artem Bityutskiy wrote:
> From: Artem Bityutskiy <[email protected]>
>
> The bdi forker thread wakes up itself via 'bdi_add_to_pending()'.
> And the thread is the only user of 'bdi_add_to_pending()'. Thus,
> we can safely zap the unneeded 'wake_up_process()' call.

At least right now it's called using call_rcu, which offloads the function
to a RCU thread of some sorts.

2010-07-18 06:44:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 02/16] writeback: remove redundant list initialization

On Fri, Jul 16, 2010 at 03:44:58PM +0300, Artem Bityutskiy wrote:
> From: Artem Bityutskiy <[email protected]>
>
> We do not have to call 'INIT_LIST_HEAD()' for list elements
> ('bdi->bdi_list') before inserting them to the 'bdi_pending_list',
> because 'list_add_tail' will initialize the element anyway. Thus,
> kill the redundant initialization.

Looks good,


Reviewed-by: Christoph Hellwig <[email protected]>

2010-07-18 06:45:13

by Christoph Hellwig

[permalink] [raw]

2010-07-18 06:47:22

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/16] writeback: fix possible race when shutting down bdi

On Fri, Jul 16, 2010 at 03:45:00PM +0300, Artem Bityutskiy wrote:
> From: Artem Bityutskiy <[email protected]>
>
> Current bdi code has the following race between 'bdi_wb_shutdown()'
> and 'bdi_forker_thread()'.
>
> Initial condition: BDI_pending is cleaned, bdi has no writeback thread,
> because it was inactive and exited, 'bdi_wb_shutdown()' and
> 'bdi_forker_thread()' are executed concurrently.

Wouldn't it be better to have a per-bdi mutex to serialize thread
creation and shutdown? And please also kill the bit wait in favour
of a proper wait queue - the bit wait interface really is just a hack
for structures that are very size sensitive, which the backing device
is not.

2010-07-18 06:49:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 06/16] writeback: improve bdi_has_dirty_io

Can you skip this one for now? I have a patch moving the remaining
writeback related fields (wb_lock and work_list) into
struct bdi_writeback, which also means getting rid of bdi_has_dirty_io
entirely in favour of wb_has_dirty_io.

2010-07-18 06:52:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 08/16] writeback: do not lose default bdi wake-ups

> This patch also removes strange comment about a temporary measure,
> because there is nothing temporary there - many FSes use default
> bdi.

I don't think that's correct. We're not using default_backing_dev_info
for filesystems anymore for various reasons. Look at the spread of
private bdi initializations for all the non-block backed filesystems.

The actual __set_current_state change looks good to me, but shouldn't
have been mixed up with this to start with.

2010-07-18 06:56:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 10/16] writeback: simplify bdi code a little

Looks correct, but I would do this slightly different. Add a new
bdi_writeback_start_thread helper, and then call it from the list
walk as long as there are bdis that need a thread. Only break
out of the loop once nothing needs to be done and then sleep.

2010-07-18 06:58:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 12/16] writeback: add to bdi_list in the forker thread

What about never removing a bdi from bdi_list? If we have the
correct checks for dirty_io and the task there's no need to
ever remove a life bdi from the list. Just add it in bdi_register
and remove it in bdi_unregister.

2010-07-18 07:02:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 14/16] writeback: move bdi threads exiting logic to the forker thread

Yes, only killing threads from the caller is much better, that's how
the kthread API is supposed to be used anyway.

> static void bdi_queue_work(struct backing_dev_info *bdi,
> struct wb_writeback_work *work)
> {
> + bool wakeup_default = false;
> +
> trace_writeback_queue(bdi, work);
>
> spin_lock(&bdi->wb_lock);
> list_add_tail(&work->list, &bdi->work_list);
> - spin_unlock(&bdi->wb_lock);
> -
> /*
> * If the default thread isn't there, make sure we add it. When
> * it gets created and wakes up, we'll run this work.
> */
> - if (unlikely(!bdi->wb.task)) {
> + if (unlikely(!bdi->wb.task))
> + wakeup_default = true;
> + else
> + wake_up_process(bdi->wb.task);
> + spin_unlock(&bdi->wb_lock);
> +
> + if (wakeup_default) {
> trace_writeback_nothread(bdi, work);
> wake_up_process(default_backing_dev_info.wb.task);

Why not simply do the defaul thread wakeup under wb_lock, too?
It keeps the code a lot simpler, and this is not a typical path anyway.

> if (dirty_writeback_interval) {
> + unsigned long wait_jiffies;
> +
> wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
> schedule_timeout(wait_jiffies);

No real need for a local variable here.

> @@ -364,7 +395,7 @@ static int bdi_forker_thread(void *ptr)
> if (!list_empty(&me->bdi->work_list))
> __set_current_state(TASK_RUNNING);
>
> - if (!fork) {
> + if (!fork && !kill) {

I think the code here would be a lot cleaner if you implement the
suggestion I have for the forking restructuring.

2010-07-18 07:03:07

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 11/16] writeback: move last_active to bdi

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2010-07-18 07:45:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 16/16] writeback: prevent unnecessary bdi threads wakeups

> + if (wb_has_dirty_io(wb) && dirty_writeback_interval) {
> + unsigned long wait;
>
> - wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
> - schedule_timeout(wait_jiffies);
> + wait = msecs_to_jiffies(dirty_writeback_interval * 10);
> + schedule_timeout(wait);

No need for a local variable. If you want to shorten things a bit a
schedule_timeout_msecs helper in generic code would be nice, as there
are lots of patterns like this in various kernel threads.

> void __mark_inode_dirty(struct inode *inode, int flags)
> {
> + bool wakeup_bdi;
> struct super_block *sb = inode->i_sb;
> + struct backing_dev_info *uninitialized_var(bdi);

Just initialize wakeup_bdi and bdi here - a smart compiler will defer
them until we need them, and it makes the code a lot easier to read, as
well as getting rid of the uninitialized_var hack.

> */
> if (!was_dirty) {
> - struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
> - struct backing_dev_info *bdi = wb->bdi;
> + bdi = inode_to_bdi(inode);
>
> WARN(bdi_cap_writeback_dirty(bdi) &&
> !test_bit(BDI_registered, &bdi->state),
> "bdi-%s not registered\n", bdi->name);
>
> + /*
> + * If this is the first dirty inode for this bdi, we
> + * have to wake-up the corresponding bdi thread to make
> + * sure background write-back happens later.
> + */
> + if (!wb_has_dirty_io(&bdi->wb) &&
> + bdi_cap_writeback_dirty(bdi))
> + wakeup_bdi = true;

How about redoing this as:

if (bdi_cap_writeback_dirty(bdi)) {
WARN(!test_bit(BDI_registered, &bdi->state),
"bdi-%s not registered\n", bdi->name);

/*
* If this is the first dirty inode for this
* bdi, we have to wake-up the corresponding
* flusher thread to make sure background
* writeback happens later.
*/
if (!wb_has_dirty_io(&bdi->wb))
wakeup_bdi = true;
}

> + if (wakeup_bdi) {
> + bool wakeup_default = false;
> +
> + spin_lock(&bdi->wb_lock);
> + if (unlikely(!bdi->wb.task))
> + wakeup_default = true;
> + else
> + wake_up_process(bdi->wb.task);
> + spin_unlock(&bdi->wb_lock);
> +
> + if (wakeup_default)
> + wake_up_process(default_backing_dev_info.wb.task);

Same comment about just keeping wb_lock over the
default_backing_dev_info wakup as for one of the earlier patches applies
here.


Except for these nitpicks the patch looks good to me.

> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 65cb88a..818f934 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -326,7 +326,7 @@ static unsigned long bdi_longest_inactive(void)
> unsigned long interval;
>
> interval = msecs_to_jiffies(dirty_writeback_interval * 10);
> - return max(5UL * 60 * HZ, wait_jiffies);
> + return max(5UL * 60 * HZ, interval);

So previously we just ignored interval here?

2010-07-18 09:43:15

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [RFC][PATCH 01/16] writeback: do not self-wakeup

On Sun, 2010-07-18 at 02:44 -0400, Christoph Hellwig wrote:
> On Fri, Jul 16, 2010 at 03:44:57PM +0300, Artem Bityutskiy wrote:
> > From: Artem Bityutskiy <[email protected]>
> >
> > The bdi forker thread wakes up itself via 'bdi_add_to_pending()'.
> > And the thread is the only user of 'bdi_add_to_pending()'. Thus,
> > we can safely zap the unneeded 'wake_up_process()' call.
>
> At least right now it's called using call_rcu, which offloads the function
> to a RCU thread of some sorts.

Christoph, thanks for your review. I'll amend my series and re-send. And
I also have to think about some of your suggestions.

I did not CC you to spam your mailbox less, since I'm sure you would not
miss the series in the fsdevel ML. But if you prefer being CCed, let me
know.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-07-20 09:05:22

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/16] writeback: fix possible race when shutting down bdi

On Sun, 2010-07-18 at 02:47 -0400, Christoph Hellwig wrote:
> On Fri, Jul 16, 2010 at 03:45:00PM +0300, Artem Bityutskiy wrote:
> > From: Artem Bityutskiy <[email protected]>
> >
> > Current bdi code has the following race between 'bdi_wb_shutdown()'
> > and 'bdi_forker_thread()'.
> >
> > Initial condition: BDI_pending is cleaned, bdi has no writeback thread,
> > because it was inactive and exited, 'bdi_wb_shutdown()' and
> > 'bdi_forker_thread()' are executed concurrently.
>
> Wouldn't it be better to have a per-bdi mutex to serialize thread
> creation and shutdown?

There are several parties which want to have some serialization with bdi
trheads creation and shutdown:

1. 'bdi_queue_work()' - this should not take any mutex and should be
fast. It uses spinlock and this is should stay this way

2. I'm going to modify '__mark_inode_dirty()' to wake-up bdi thread -
this is similar to 'bdi_queue_work()'

3. 'bdi_wb_shutdown()' - this uses the 'BDI_pending' for serialization
now, but can use a mutex instead.

I guess you mean that for 1 and 2 things stay the same, but for 3 we can
use a mutex. Then the forker thread should also take this mutex. Right?

If yes, this looks fine for me. I am going to try this approach. Then
-->

> And please also kill the bit wait in favour
> of a proper wait queue - the bit wait interface really is just a hack
> for structures that are very size sensitive, which the backing device
> is not.

--> the bit should go away and so no wait queue will be needed as well.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-07-20 10:41:16

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [RFC][PATCH 10/16] writeback: simplify bdi code a little

On Sun, 2010-07-18 at 02:56 -0400, Christoph Hellwig wrote:
> Looks correct, but I would do this slightly different. Add a new
> bdi_writeback_start_thread helper, and then call it from the list
> walk as long as there are bdis that need a thread. Only break
> out of the loop once nothing needs to be done and then sleep.

Well, in order to have small and reviewable patches, I'd like to keep my
path as it is. Indeed, my patch now does one simple thing - removes the
'pending_list' variable, and then some other stuff as a result.

I can do what you suggest on top then, but I do not really see the
elegance of what you suggest: to start the thread, we have to drop the
'bdi_lock', which means we would then have to re-start list walking from
the beginning, which is not nicer than what we have now.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-07-20 11:13:40

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [RFC][PATCH 12/16] writeback: add to bdi_list in the forker thread

On Sun, 2010-07-18 at 02:58 -0400, Christoph Hellwig wrote:
> What about never removing a bdi from bdi_list? If we have the
> correct checks for dirty_io and the task there's no need to
> ever remove a life bdi from the list. Just add it in bdi_register
> and remove it in bdi_unregister.

Good idea, thanks, indeed I do not see why we need removing it.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-07-20 11:38:47

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [RFC][PATCH 12/16] writeback: add to bdi_list in the forker thread

On Sun, 2010-07-18 at 02:58 -0400, Christoph Hellwig wrote:
> What about never removing a bdi from bdi_list? If we have the
> correct checks for dirty_io and the task there's no need to
> ever remove a life bdi from the list. Just add it in bdi_register
> and remove it in bdi_unregister.

But I think it will be less error-prone and nicer to still have this
patch, then move the killing logic from bdi threads to the forker task,
and then, as a separate patch on top of that, get rid of this removing
bdi from bdi_list part. I mean, this way the patch series will be more
logical and finer grained and easier to review.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-07-20 12:30:06

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [RFC][PATCH 14/16] writeback: move bdi threads exiting logic to the forker thread

On Sun, 2010-07-18 at 03:02 -0400, Christoph Hellwig wrote:
> > + if (wakeup_default) {
> > trace_writeback_nothread(bdi, work);
> > wake_up_process(default_backing_dev_info.wb.task);
>
> Why not simply do the defaul thread wakeup under wb_lock, too?
> It keeps the code a lot simpler, and this is not a typical path anyway.

Will address.

> > if (dirty_writeback_interval) {
> > + unsigned long wait_jiffies;
> > +
> > wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
> > schedule_timeout(wait_jiffies);
>
> No real need for a local variable here.

Will address.

> > @@ -364,7 +395,7 @@ static int bdi_forker_thread(void *ptr)
> > if (!list_empty(&me->bdi->work_list))
> > __set_current_state(TASK_RUNNING);
> >
> > - if (!fork) {
> > + if (!fork && !kill) {
>
> I think the code here would be a lot cleaner if you implement the
> suggestion I have for the forking restructuring.

As I replied earlier, to fork/kill the the thread from inside list walk
we'd need to drop the spinlock, which is not very nice. So I am keeping
this part intact so far.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-07-20 13:01:10

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [RFC][PATCH 14/16] writeback: move bdi threads exiting logic to the forker thread

On Sun, 2010-07-18 at 03:02 -0400, Christoph Hellwig wrote:
> Yes, only killing threads from the caller is much better, that's how
> the kthread API is supposed to be used anyway.
>
> > static void bdi_queue_work(struct backing_dev_info *bdi,
> > struct wb_writeback_work *work)
> > {
> > + bool wakeup_default = false;
> > +
> > trace_writeback_queue(bdi, work);
> >
> > spin_lock(&bdi->wb_lock);
> > list_add_tail(&work->list, &bdi->work_list);
> > - spin_unlock(&bdi->wb_lock);
> > -
> > /*
> > * If the default thread isn't there, make sure we add it. When
> > * it gets created and wakes up, we'll run this work.
> > */
> > - if (unlikely(!bdi->wb.task)) {
> > + if (unlikely(!bdi->wb.task))
> > + wakeup_default = true;
> > + else
> > + wake_up_process(bdi->wb.task);
> > + spin_unlock(&bdi->wb_lock);
> > +
> > + if (wakeup_default) {
> > trace_writeback_nothread(bdi, work);
> > wake_up_process(default_backing_dev_info.wb.task);
>
> Why not simply do the defaul thread wakeup under wb_lock, too?
> It keeps the code a lot simpler, and this is not a typical path anyway.

Hmm, actually, I want to take this lock in __mark_inode_dirty() as well,
so it makes sense to micro-optimize this. Also, can
'trace_writeback_nothread()' be called under a spinlock? If no, then a
variable is needed anyway.


--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-07-20 13:20:20

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [RFC][PATCH 16/16] writeback: prevent unnecessary bdi threads wakeups

On Sun, 2010-07-18 at 03:45 -0400, Christoph Hellwig wrote:
> > + if (wb_has_dirty_io(wb) && dirty_writeback_interval) {
> > + unsigned long wait;
> >
> > - wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
> > - schedule_timeout(wait_jiffies);
> > + wait = msecs_to_jiffies(dirty_writeback_interval * 10);
> > + schedule_timeout(wait);
>
> No need for a local variable. If you want to shorten things a bit a
> schedule_timeout_msecs helper in generic code would be nice, as there
> are lots of patterns like this in various kernel threads.

OK, do you want me to ignore the 80-lines limitation or you want me
to add schedule_timeout_msecs() as a part of this patch series?

> > void __mark_inode_dirty(struct inode *inode, int flags)
> > {
> > + bool wakeup_bdi;
> > struct super_block *sb = inode->i_sb;
> > + struct backing_dev_info *uninitialized_var(bdi);
>
> Just initialize wakeup_bdi and bdi here - a smart compiler will defer
> them until we need them, and it makes the code a lot easier to read, as
> well as getting rid of the uninitialized_var hack.

OK.

> > + /*
> > + * If this is the first dirty inode for this bdi, we
> > + * have to wake-up the corresponding bdi thread to make
> > + * sure background write-back happens later.
> > + */
> > + if (!wb_has_dirty_io(&bdi->wb) &&
> > + bdi_cap_writeback_dirty(bdi))
> > + wakeup_bdi = true;
>
> How about redoing this as:
>
> if (bdi_cap_writeback_dirty(bdi)) {
> WARN(!test_bit(BDI_registered, &bdi->state),
> "bdi-%s not registered\n", bdi->name);
>
> /*
> * If this is the first dirty inode for this
> * bdi, we have to wake-up the corresponding
> * flusher thread to make sure background
> * writeback happens later.
> */
> if (!wb_has_dirty_io(&bdi->wb))
> wakeup_bdi = true;
> }

OK.

> > + if (wakeup_bdi) {
> > + bool wakeup_default = false;
> > +
> > + spin_lock(&bdi->wb_lock);
> > + if (unlikely(!bdi->wb.task))
> > + wakeup_default = true;
> > + else
> > + wake_up_process(bdi->wb.task);
> > + spin_unlock(&bdi->wb_lock);
> > +
> > + if (wakeup_default)
> > + wake_up_process(default_backing_dev_info.wb.task);
>
> Same comment about just keeping wb_lock over the
> default_backing_dev_info wakup as for one of the earlier patches applies
> here.

I just figured that I have to add 'trace_writeback_nothread(bdi, work)'
here, just like in 'bdi_queue_work()'. I'd feel safer to call tracer
outside the spinlock. What do you think?

> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -326,7 +326,7 @@ static unsigned long bdi_longest_inactive(void)
> > unsigned long interval;
> >
> > interval = msecs_to_jiffies(dirty_writeback_interval * 10);
> > - return max(5UL * 60 * HZ, wait_jiffies);
> > + return max(5UL * 60 * HZ, interval);
>
> So previously we just ignored interval here?

Yes, my fault, thanks for catching.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)