2013-10-03 00:19:00

by Mikulas Patocka

[permalink] [raw]
Subject: dm-writeboost testing

Hi

I tested dm-writeboost and I found these problems:


Performance problems:

I tested dm-writeboost with disk as backing device and ramdisk as cache
device. When I run mkfs.ext4 on the dm-writeboost device, it writes data
to the cache on the first time. However, on next mkfs.ext4 invocations,
dm-writeboost writes data to the disk, not to the cache.

mkfs.ext4 on raw disk: 1.5s
mkfs.ext4 on dm-cache using raw disk and ramdisk:
1st time - 0.15s
next time - 0.12s
mkfs.ext4 on dm-writeboost using raw disk and ramdisk:
1st time - 0.11s
next time - 1.71s, 1.31s, 0.91s, 0.86s, 0.82s

- there seems to be some error in logic in dm-writeboost that makes it not
cache writes if these writes are already placed in the cache. In
real-world scenarios where the same piece of disk is overwritten over and
over again (for example journal), this could cause performance problems.

dm-cache doesn't have this problem, if you overwrite the same piece of
data again and again, it goes to the cache device.


Reliability problems (these problems are reproducible, they happen every
time). Tests were done on 3.10.13 on Opteron and 3.11.3 on PA-RISC.

On 3.10.13 on Opteron on preemptible kernel, I get "BUG: workqueue leaked
lock or atomic" when unloading the device with dmsetup remove. Also, on
this machine, I see a bug, if I load dm-writeboost and terminate X server
- Xorg hangs flusing workqueue.

On 3.11.3 on PA-RISC without preemption, the device unloads (although it
takes many seconds and vmstat shows that the machine is idle during this
time), but I get deadlock when I load the device the second time. The
deadlock happens either on load on when writing to the newly loaded
device.

----

deadlock when the device is loaded the second time:
[ 8336.212499] SysRq : Show Blocked State
[ 8336.212499] task PC stack pid father
[ 8336.212499] dmsetup D 00000000401040c0 0 1594 1572
0x00000010 [ 8336.212499] Backtrace:
[ 8336.212499] [<0000000040111608>] __schedule+0x280/0x678
[ 8336.212499] [<0000000040111e88>] schedule+0x38/0x90
[ 8336.212499] [<000000004010ed1c>] schedule_timeout+0x1b4/0x208
[ 8336.212499] [<0000000040111c1c>] wait_for_common+0x124/0x1e8
[ 8336.212499] [<0000000040111d04>] wait_for_completion+0x24/0x38
[ 8336.212499] [<00000000107d9778>] wait_for_migration+0x38/0xb0
[dm_writeboost]
[ 8336.212499] [<00000000107d7cf8>] resume_cache+0x1100/0x16f8
[dm_writeboost]

Another deadlock when loaded the second time and running mkfs.ex3 on the
writeboost device (it got cut off in the scrollback buffer):
[ 782.579921] [<0000000040112280>] schedule_preempt_disabled+0x20/0x38
[ 782.579921] [<0000000040110764>] __mutex_lock_slowpath+0x15c/0x290
[ 782.579921] [<0000000040110928>] mutex_lock+0x90/0x98
[ 782.579921] [<00000000107f8b74>] flush_current_buffer+0x2c/0xb0
[dm_writeboost]
[ 782.579921] [<00000000107fecbc>] sync_proc+0x7c/0xc8 [dm_writeboost]
[ 782.579921] [<00000000401591d0>] process_one_work+0x160/0x460
[ 782.579921] [<0000000040159bb8>] worker_thread+0x300/0x478
[ 782.579921] [<0000000040161a68>] kthread+0x118/0x128
[ 782.579921] [<0000000040104020>] end_fault_vector+0x20/0x28
[ 782.579921] timer_interrupt(CPU 0): delayed! cycles A099A8C1 rem 22345C
next/now CDFE401953/CDFE1DE4F7
[ 785.403254]
[ 785.403254] mkfs.ext3 D 00000000401040c0 0 2309 2237
0x00000010 [ 785.403254] Backtrace:
[ 785.403254] [<0000000040111608>] __schedule+0x280/0x678
[ 785.403254] [<0000000040111e88>] schedule+0x38/0x90
[ 785.403254] [<000000004010ed1c>] schedule_timeout+0x1b4/0x208
[ 785.403254] [<0000000040111c1c>] wait_for_common+0x124/0x1e8
[ 785.403254] [<0000000040111d04>] wait_for_completion+0x24/0x38
[ 785.403254] [<00000000107fe778>] wait_for_migration+0x38/0xb0
[dm_writeboost]
[ 785.403254] [<00000000107f7fe8>] queue_current_buffer+0x78/0x3c8
[dm_writeboost]
[ 785.403254] [<00000000107f96b8>] writeboost_map+0x660/0x970
[dm_writeboost] [ 785.403254] [<000000001079477c>] __map_bio+0x9c/0x148
[dm_mod]
[ 785.403254] [<0000000010794cf0>] __clone_and_map_data_bio+0x188/0x288
[dm_mod]
[ 785.403254] [<0000000010795594>] __split_and_process_bio+0x474/0x6c8
[dm_mod]
[ 785.403254] [<0000000010796180>] dm_request+0x118/0x278 [dm_mod]
[ 785.403254] [<00000000402d8360>] generic_make_request+0x128/0x1a0
[ 785.403254] [<00000000402d8448>] submit_bio+0x70/0x140
[ 785.403254] [<0000000040231c68>] _submit_bh+0x200/0x3b8
[ 785.403254] [<0000000040231e34>] submit_bh+0x14/0x20

A leaked prermpt count on unload (with preemptible kernel):
BUG: workqueue leaked lock or atomic: kworker/u26:1/0x00000001/1031
last function: flush_proc [dm_writeboost]
CPU: 10 PID: 1031 Comm: kworker/u26:1 Tainted: P O 3.10.13 #9
Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009
Workqueue: flushwq flush_proc [dm_writeboost]
ffffffff8134e746 ffffffff81052c1d 000000003e5428c0 ffff88023f578c00
ffff88023f578c18 ffff880446870a70 ffff88043ee8b5e0 ffff88043ee8b5e0
ffff880446870a40 ffffffff8105368b ffff88043ee8b5e0 ffff88043ee8b5e0
Call Trace:
[<ffffffff8134e746>] ? dump_stack+0xc/0x15
[<ffffffff81052c1d>] ? process_one_work+0x33d/0x470
[<ffffffff8105368b>] ? worker_thread+0x10b/0x390
[<ffffffff81053580>] ? manage_workers.isra.26+0x290/0x290
[<ffffffff81058d4f>] ? kthread+0xaf/0xc0
[<ffffffff81058ca0>] ? kthread_create_on_node+0x120/0x120
[<ffffffff8135356c>] ? ret_from_fork+0x7c/0xb0
[<ffffffff81058ca0>] ? kthread_create_on_node+0x120/0x120
BUG: scheduling while atomic: kworker/u26:1/1031/0x00000002
Modules linked in: brd dm_mirror dm_region_hash dm_log dm_loop
dm_writeboost(O)r unix
CPU: 10 PID: 1031 Comm: kworker/u26:1 Tainted: P O 3.10.13 #9
Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009
ffffffff8134e746 ffffffff8134c9ef ffffffff8135186b 0000000000012700
ffff88043ef05fd8 ffff88043ef05fd8 0000000000012700 ffff88023f578c00
ffff88023f578c18 ffff880446870a70 ffff88043ee8b5e0 ffff88043ee8b5e0
Call Trace:
[<ffffffff8134e746>] ? dump_stack+0xc/0x15
[<ffffffff8134c9ef>] ? __schedule_bug+0x3f/0x4c
[<ffffffff8135186b>] ? __schedule+0x70b/0x720
[<ffffffff81053730>] ? worker_thread+0x1b0/0x390
[<ffffffff81053580>] ? manage_workers.isra.26+0x290/0x290
[<ffffffff81058d4f>] ? kthread+0xaf/0xc0
[<ffffffff81058ca0>] ? kthread_create_on_node+0x120/0x120
[<ffffffff8135356c>] ? ret_from_fork+0x7c/0xb0
[<ffffffff81058ca0>] ? kthread_create_on_node+0x120/0x120

When I load dm-writeboost and terminate X-window, the Xorg process hangs
in this state (it is reproducible - happens each time) - it seems that
dm-writeboost corrupts something connected with workqueues in the kernel:
SysRq : Show Blocked State
task PC stack pid father
Xorg D 0000000000000001 0 4023 1 0x00400004
ffff88043e3124c0 0000000000000086 0000000000012700 ffff8804468bbfd8
ffff8804468bbfd8 0000000000012700 7fffffffffffffff ffff8804468bbcb8
ffff8804468bbd58 ffff88043e3124c0 0000000000000001 ffff8804468bbd40
Call Trace:
[<ffffffff8134f519>] ? schedule_timeout+0x1d9/0x2b0
[<ffffffffa0b1d233>] ? _nv015437rm+0x3f/0x78 [nvidia]
[<ffffffff81350735>] ? wait_for_completion+0x95/0x100
[<ffffffff81066970>] ? wake_up_state+0x10/0x10
[<ffffffff810501d5>] ? flush_workqueue+0x115/0x5a0
[<ffffffffa0ffc474>] ? os_flush_work_queue+0x44/0x50 [nvidia]
[<ffffffffa0fd54cf>] ? rm_disable_adapter+0x81/0x107 [nvidia]
[<ffffffffa0ff4927>] ? nv_kern_close+0x137/0x420 [nvidia]
[<ffffffff81119249>] ? __fput+0xd9/0x230
[<ffffffff81055df7>] ? task_work_run+0x87/0xc0
[<ffffffff810023e1>] ? do_notify_resume+0x61/0x90
[<ffffffff811197e1>] ? fput+0x71/0xe0
[<ffffffff81353820>] ? int_signal+0x12/0x17

Mikulas


2013-10-03 13:28:03

by Akira Hayakawa

[permalink] [raw]
Subject: Re: [dm-devel] dm-writeboost testing

Hi, Mikulas,

Thank you for reporting.
I am really happy to see this report.

First, I respond to the performance problem.
I will make time later for investigating the rest and answer.
Some deadlock issues are difficult to solve in short time.



> I tested dm-writeboost with disk as backing device and ramdisk as cache
> device. When I run mkfs.ext4 on the dm-writeboost device, it writes data
> to the cache on the first time. However, on next mkfs.ext4 invocations,
> dm-writeboost writes data to the disk, not to the cache.
>
> mkfs.ext4 on raw disk: 1.5s
> mkfs.ext4 on dm-cache using raw disk and ramdisk:
> 1st time - 0.15s
> next time - 0.12s
> mkfs.ext4 on dm-writeboost using raw disk and ramdisk:
> 1st time - 0.11s
> next time - 1.71s, 1.31s, 0.91s, 0.86s, 0.82s
>
> - there seems to be some error in logic in dm-writeboost that makes it not
> cache writes if these writes are already placed in the cache. In
> real-world scenarios where the same piece of disk is overwritten over and
> over again (for example journal), this could cause performance problems.
>
> dm-cache doesn't have this problem, if you overwrite the same piece of
> data again and again, it goes to the cache device.

It is not a bug but should/can be optimized.

Below is the cache hit path for writes.
writeboost performs very poorly when a partial write hits
which then turns `needs_cleanup_perv_cache` to true.
Partial write hits is believed to be unlikely so
I decided to give up this path to make other likely-paths optimized.
I think this is just a tradeoff issue of what to be optimized the most.

if (found) {

if (unlikely(on_buffer)) {
mutex_unlock(&cache->io_lock);

update_mb_idx = mb->idx;
goto write_on_buffer;
} else {
u8 dirty_bits = atomic_read_mb_dirtiness(seg, mb);

/*
* First clean up the previous cache
* and migrate the cache if needed.
*/
bool needs_cleanup_prev_cache =
!bio_fullsize || !(dirty_bits == 255);

if (unlikely(needs_cleanup_prev_cache)) {
wait_for_completion(&seg->flush_done);
migrate_mb(cache, seg, mb, dirty_bits, true);
}

I checked that the mkfs.ext4 writes only in 4KB size
so it is not gonna turn the boolean value true for going into the slowpath.

Problem:
Problem is that
it chooses the slowpath even though the bio is full-sized overwrite
in the test.

The reason is that the dirty bits is sometimes seen as 0
and the suspect is the migration daemon.

I guess you created the writeboost device with the default configuration.
In that case migration daemon always works and
some metadata is cleaned up in background.

If you turns both enable_migration_modulator and allow_migrate to 0
before beginning the test
to stop migration at all
it never goes into the slowpath with the test.

Solution:
Changing the code to
avoid going into the slowpath when the dirty bits is zero
will solve this problem.

And done. Please pull the latest one from the repo.
--- a/Driver/dm-writeboost-target.c
+++ b/Driver/dm-writeboost-target.c
@@ -688,6 +688,14 @@ static int writeboost_map(struct dm_target *ti, struct bio *bio
bool needs_cleanup_prev_cache =
!bio_fullsize || !(dirty_bits == 255);

+ /*
+ * Migration works in background
+ * and may have cleaned up the metablock.
+ * If the metablock is clean we need not to migrate.
+ */
+ if (!dirty_bits)
+ needs_cleanup_prev_cache = false;
+
if (unlikely(needs_cleanup_prev_cache)) {
wait_for_completion(&seg->flush_done);
migrate_mb(cache, seg, mb, dirty_bits, true);

Thanks,
Akira

2013-10-04 02:28:54

by Akira Hayakawa

[permalink] [raw]
Subject: Re: [dm-devel] dm-writeboost testing

Hi, Mikulas,

I am sorry to say that
I don't have such machines to reproduce the problem.

But agree with that I am dealing with workqueue subsystem
in a little bit weird way.
I should clean them up.

For example,
free_cache() routine below is
a deconstructor of the cache metadata
including all the workqueues.

void free_cache(struct wb_cache *cache)
{
cache->on_terminate = true;

/* Kill in-kernel daemons */
cancel_work_sync(&cache->sync_work);
cancel_work_sync(&cache->recorder_work);
cancel_work_sync(&cache->modulator_work);

cancel_work_sync(&cache->flush_work);
destroy_workqueue(cache->flush_wq);

cancel_work_sync(&cache->barrier_deadline_work);

cancel_work_sync(&cache->migrate_work);
destroy_workqueue(cache->migrate_wq);
free_migration_buffer(cache);

/* Destroy in-core structures */
free_ht(cache);
free_segment_header_array(cache);

free_rambuf_pool(cache);
}

cancel_work_sync() before destroy_workqueue()
can probably be removed because destroy_workqueue() first
flush all the works.

Although I prepares independent workqueue
for each flush_work and migrate_work
other four works are queued into the system_wq
through schedule_work() routine.
This asymmetricity is not welcome for
architecture-portable code.
Dependencies to the subsystem should be minimized.
In detail, workqueue subsystem is really changing
about its concurrency support so
trusting only the single threaded workqueue
will be a good idea for stability.

To begin with,
these works are never out of queue
until the deconstructor is called
but they are repeating running and sleeping.
Queuing these kind of works to system_wq
may be unsupported.

So,
my strategy is to clean them up in a way that
1. all daemons are having their own workqueue
2. never use cancel_work_sync() but only calls destroy_workqueue()
in the deconstructor free_cache() and error handling in resume_cache().

Could you please run the same test again
after I fixed these points
to see whether it is still reproducible?


> On 3.11.3 on PA-RISC without preemption, the device unloads (although it
> takes many seconds and vmstat shows that the machine is idle during this
> time)
This behavior is benign but probably should be improved.
In said free_cache() it first turns `on_terminate` flag to true
to notify all the daemons that we are shutting down.
Since the `update_interval` and `sync_interval` are 60 seconds by default
we must wait for them to finish for a while.

Akira

2013-10-04 13:39:10

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [dm-devel] dm-writeboost testing



On Fri, 4 Oct 2013, Akira Hayakawa wrote:

> Hi, Mikulas,
>
> I am sorry to say that
> I don't have such machines to reproduce the problem.
>
> But agree with that I am dealing with workqueue subsystem
> in a little bit weird way.
> I should clean them up.
>
> For example,
> free_cache() routine below is
> a deconstructor of the cache metadata
> including all the workqueues.
>
> void free_cache(struct wb_cache *cache)
> {
> cache->on_terminate = true;
>
> /* Kill in-kernel daemons */
> cancel_work_sync(&cache->sync_work);
> cancel_work_sync(&cache->recorder_work);
> cancel_work_sync(&cache->modulator_work);
>
> cancel_work_sync(&cache->flush_work);
> destroy_workqueue(cache->flush_wq);
>
> cancel_work_sync(&cache->barrier_deadline_work);
>
> cancel_work_sync(&cache->migrate_work);
> destroy_workqueue(cache->migrate_wq);
> free_migration_buffer(cache);
>
> /* Destroy in-core structures */
> free_ht(cache);
> free_segment_header_array(cache);
>
> free_rambuf_pool(cache);
> }
>
> cancel_work_sync() before destroy_workqueue()
> can probably be removed because destroy_workqueue() first
> flush all the works.
>
> Although I prepares independent workqueue
> for each flush_work and migrate_work
> other four works are queued into the system_wq
> through schedule_work() routine.
> This asymmetricity is not welcome for
> architecture-portable code.
> Dependencies to the subsystem should be minimized.
> In detail, workqueue subsystem is really changing
> about its concurrency support so
> trusting only the single threaded workqueue
> will be a good idea for stability.

The problem is that you are using workqueues the wrong way. You submit a
work item to a workqueue and the work item is active until the device is
unloaded.

If you submit a work item to a workqueue, it is required that the work
item finishes in finite time. Otherwise, it may stall stall other tasks.
The deadlock when I terminate Xserver is caused by this - the nvidia
driver tries to flush system workqueue and it waits for all work items to
terminate - but your work items don't terminate.

If you need a thread that runs for a long time, you should use
kthread_create, not workqueues (see this
http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/old-3/dm-crypt-encryption-threads.patch
or this
http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/old-3/dm-crypt-offload-writes-to-thread.patch
as an example how to use kthreads).

Mikulas

> To begin with,
> these works are never out of queue
> until the deconstructor is called
> but they are repeating running and sleeping.
> Queuing these kind of works to system_wq
> may be unsupported.
>
> So,
> my strategy is to clean them up in a way that
> 1. all daemons are having their own workqueue
> 2. never use cancel_work_sync() but only calls destroy_workqueue()
> in the deconstructor free_cache() and error handling in resume_cache().
>
> Could you please run the same test again
> after I fixed these points
> to see whether it is still reproducible?
>
>
> > On 3.11.3 on PA-RISC without preemption, the device unloads (although it
> > takes many seconds and vmstat shows that the machine is idle during this
> > time)
> This behavior is benign but probably should be improved.
> In said free_cache() it first turns `on_terminate` flag to true
> to notify all the daemons that we are shutting down.
> Since the `update_interval` and `sync_interval` are 60 seconds by default
> we must wait for them to finish for a while.
>
> Akira
>

2013-10-04 14:25:20

by Akira Hayakawa

[permalink] [raw]
Subject: Re: [dm-devel] dm-writeboost testing

Mikulas,

Thanks for your pointing out.

> The problem is that you are using workqueues the wrong way. You submit a
> work item to a workqueue and the work item is active until the device is
> unloaded.
>
> If you submit a work item to a workqueue, it is required that the work
> item finishes in finite time. Otherwise, it may stall stall other tasks.
> The deadlock when I terminate Xserver is caused by this - the nvidia
> driver tries to flush system workqueue and it waits for all work items to
> terminate - but your work items don't terminate.
>
> If you need a thread that runs for a long time, you should use
> kthread_create, not workqueues (see this
> http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/old-3/dm-crypt-encryption-threads.patch
> or this
> http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/old-3/dm-crypt-offload-writes-to-thread.patch
> as an example how to use kthreads).

But I see no reason why you recommend
using a kthread for looping job
instead of putting a looping work item
into a single-threaded not-system workqueue.

For me, they both seem to be working.

Is it documented that
looping job should not be put into
any type of workqueue?

You are only mentioning that
putting a looping work item in system_wq
is the wrong way since
nvidia driver flush the workqueue.

Akira

On 10/4/13 10:38 PM, Mikulas Patocka wrote:
>
>
> On Fri, 4 Oct 2013, Akira Hayakawa wrote:
>
>> Hi, Mikulas,
>>
>> I am sorry to say that
>> I don't have such machines to reproduce the problem.
>>
>> But agree with that I am dealing with workqueue subsystem
>> in a little bit weird way.
>> I should clean them up.
>>
>> For example,
>> free_cache() routine below is
>> a deconstructor of the cache metadata
>> including all the workqueues.
>>
>> void free_cache(struct wb_cache *cache)
>> {
>> cache->on_terminate = true;
>>
>> /* Kill in-kernel daemons */
>> cancel_work_sync(&cache->sync_work);
>> cancel_work_sync(&cache->recorder_work);
>> cancel_work_sync(&cache->modulator_work);
>>
>> cancel_work_sync(&cache->flush_work);
>> destroy_workqueue(cache->flush_wq);
>>
>> cancel_work_sync(&cache->barrier_deadline_work);
>>
>> cancel_work_sync(&cache->migrate_work);
>> destroy_workqueue(cache->migrate_wq);
>> free_migration_buffer(cache);
>>
>> /* Destroy in-core structures */
>> free_ht(cache);
>> free_segment_header_array(cache);
>>
>> free_rambuf_pool(cache);
>> }
>>
>> cancel_work_sync() before destroy_workqueue()
>> can probably be removed because destroy_workqueue() first
>> flush all the works.
>>
>> Although I prepares independent workqueue
>> for each flush_work and migrate_work
>> other four works are queued into the system_wq
>> through schedule_work() routine.
>> This asymmetricity is not welcome for
>> architecture-portable code.
>> Dependencies to the subsystem should be minimized.
>> In detail, workqueue subsystem is really changing
>> about its concurrency support so
>> trusting only the single threaded workqueue
>> will be a good idea for stability.
>
> The problem is that you are using workqueues the wrong way. You submit a
> work item to a workqueue and the work item is active until the device is
> unloaded.
>
> If you submit a work item to a workqueue, it is required that the work
> item finishes in finite time. Otherwise, it may stall stall other tasks.
> The deadlock when I terminate Xserver is caused by this - the nvidia
> driver tries to flush system workqueue and it waits for all work items to
> terminate - but your work items don't terminate.
>
> If you need a thread that runs for a long time, you should use
> kthread_create, not workqueues (see this
> http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/old-3/dm-crypt-encryption-threads.patch
> or this
> http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/old-3/dm-crypt-offload-writes-to-thread.patch
> as an example how to use kthreads).
>
> Mikulas
>
>> To begin with,
>> these works are never out of queue
>> until the deconstructor is called
>> but they are repeating running and sleeping.
>> Queuing these kind of works to system_wq
>> may be unsupported.
>>
>> So,
>> my strategy is to clean them up in a way that
>> 1. all daemons are having their own workqueue
>> 2. never use cancel_work_sync() but only calls destroy_workqueue()
>> in the deconstructor free_cache() and error handling in resume_cache().
>>
>> Could you please run the same test again
>> after I fixed these points
>> to see whether it is still reproducible?
>>
>>
>>> On 3.11.3 on PA-RISC without preemption, the device unloads (although it
>>> takes many seconds and vmstat shows that the machine is idle during this
>>> time)
>> This behavior is benign but probably should be improved.
>> In said free_cache() it first turns `on_terminate` flag to true
>> to notify all the daemons that we are shutting down.
>> Since the `update_interval` and `sync_interval` are 60 seconds by default
>> we must wait for them to finish for a while.
>>
>> Akira
>>

2013-10-04 15:06:59

by Joe Thornber

[permalink] [raw]
Subject: Re: [dm-devel] dm-writeboost testing

On Thu, Oct 03, 2013 at 10:27:54PM +0900, Akira Hayakawa wrote:

> > dm-cache doesn't have this problem, if you overwrite the same piece of
> > data again and again, it goes to the cache device.
>
> It is not a bug but should/can be optimized.
>
> Below is the cache hit path for writes.
> writeboost performs very poorly when a partial write hits
> which then turns `needs_cleanup_perv_cache` to true.

Are you using fixed size blocks for caching then? The whole point of
using a journal/log based disk layout for caching is you can slurp up
all writes irrespective of their size.

What are the scenarios where you out perform dm-cache?

- Joe




> Partial write hits is believed to be unlikely so
> I decided to give up this path to make other likely-paths optimized.
> I think this is just a tradeoff issue of what to be optimized the most.
>
> if (found) {
>
> if (unlikely(on_buffer)) {
> mutex_unlock(&cache->io_lock);
>
> update_mb_idx = mb->idx;
> goto write_on_buffer;
> } else {
> u8 dirty_bits = atomic_read_mb_dirtiness(seg, mb);
>
> /*
> * First clean up the previous cache
> * and migrate the cache if needed.
> */
> bool needs_cleanup_prev_cache =
> !bio_fullsize || !(dirty_bits == 255);
>
> if (unlikely(needs_cleanup_prev_cache)) {
> wait_for_completion(&seg->flush_done);
> migrate_mb(cache, seg, mb, dirty_bits, true);
> }
>
> I checked that the mkfs.ext4 writes only in 4KB size
> so it is not gonna turn the boolean value true for going into the slowpath.
>
> Problem:
> Problem is that
> it chooses the slowpath even though the bio is full-sized overwrite
> in the test.
>
> The reason is that the dirty bits is sometimes seen as 0
> and the suspect is the migration daemon.
>
> I guess you created the writeboost device with the default configuration.
> In that case migration daemon always works and
> some metadata is cleaned up in background.
>
> If you turns both enable_migration_modulator and allow_migrate to 0
> before beginning the test
> to stop migration at all
> it never goes into the slowpath with the test.
>
> Solution:
> Changing the code to
> avoid going into the slowpath when the dirty bits is zero
> will solve this problem.
>
> And done. Please pull the latest one from the repo.
> --- a/Driver/dm-writeboost-target.c
> +++ b/Driver/dm-writeboost-target.c
> @@ -688,6 +688,14 @@ static int writeboost_map(struct dm_target *ti, struct bio *bio
> bool needs_cleanup_prev_cache =
> !bio_fullsize || !(dirty_bits == 255);
>
> + /*
> + * Migration works in background
> + * and may have cleaned up the metablock.
> + * If the metablock is clean we need not to migrate.
> + */
> + if (!dirty_bits)
> + needs_cleanup_prev_cache = false;
> +
> if (unlikely(needs_cleanup_prev_cache)) {
> wait_for_completion(&seg->flush_done);
> migrate_mb(cache, seg, mb, dirty_bits, true);
>
> Thanks,
> Akira

2013-10-04 16:00:42

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [dm-devel] dm-writeboost testing



On Fri, 4 Oct 2013, Akira Hayakawa wrote:

> Mikulas,
>
> Thanks for your pointing out.
>
> > The problem is that you are using workqueues the wrong way. You submit a
> > work item to a workqueue and the work item is active until the device is
> > unloaded.
> >
> > If you submit a work item to a workqueue, it is required that the work
> > item finishes in finite time. Otherwise, it may stall stall other tasks.
> > The deadlock when I terminate Xserver is caused by this - the nvidia
> > driver tries to flush system workqueue and it waits for all work items to
> > terminate - but your work items don't terminate.
> >
> > If you need a thread that runs for a long time, you should use
> > kthread_create, not workqueues (see this
> > http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/old-3/dm-crypt-encryption-threads.patch
> > or this
> > http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/old-3/dm-crypt-offload-writes-to-thread.patch
> > as an example how to use kthreads).
>
> But I see no reason why you recommend
> using a kthread for looping job
> instead of putting a looping work item
> into a single-threaded not-system workqueue.
>
> For me, they both seem to be working.

As I said, the system locks up when it tries to flush the system
workqueue. This happens for example when terminating Xwindow with the
nvidia binary driver, but it may happen in other parts of the kernel too.
The fact that it works in your setup doesn't mean that it is correct.

> Is it documented that
> looping job should not be put into
> any type of workqueue?

It is general assumption when workqueues were created. Maybe it's not
documented.

> You are only mentioning that
> putting a looping work item in system_wq
> is the wrong way since
> nvidia driver flush the workqueue.
>
> Akira

Mikulas

2013-10-05 07:13:56

by Akira Hayakawa

[permalink] [raw]
Subject: Re: [dm-devel] dm-writeboost testing

Mikulas,

> nvidia binary driver, but it may happen in other parts of the kernel too.
> The fact that it works in your setup doesn't mean that it is correct.
You are right. I am convinced.

As far as I looked around the kernel code,
it seems to be choosing kthread when one needs looping in background.
There are other examples such as loop_thread in drivers/block/loop.c .

And done. Please git pull.
commit b6d2d3892e09145cd0b9c5ad69ea2d8c410fdeab is tested in my environment.
All the looping daemons are running on kthread now.

The diff between the older version (with wq) and the new version (with kthread)
is shown below. I defined fancy CREATE_DAEMON macro.

Another by-product is that
you are no longer in need to wait for long time in dmsetup remove
since kthread_stop() forcefully wakes the thread up.

diff --git a/Driver/dm-writeboost-daemon.c b/Driver/dm-writeboost-daemon.c
index 6090661..65974e2 100644
--- a/Driver/dm-writeboost-daemon.c
+++ b/Driver/dm-writeboost-daemon.c
@@ -10,12 +10,11 @@

/*----------------------------------------------------------------*/

-void flush_proc(struct work_struct *work)
+int flush_proc(void *data)
{
unsigned long flags;

- struct wb_cache *cache =
- container_of(work, struct wb_cache, flush_work);
+ struct wb_cache *cache = data;

while (true) {
struct flush_job *job;
@@ -32,8 +31,12 @@ void flush_proc(struct work_struct *work)
msecs_to_jiffies(100));
spin_lock_irqsave(&cache->flush_queue_lock, flags);

- if (cache->on_terminate)
- return;
+ /*
+ * flush daemon can exit
+ * only if no flush job is queued.
+ */
+ if (kthread_should_stop())
+ return 0;
}

/*
@@ -84,6 +87,7 @@ void flush_proc(struct work_struct *work)

kfree(job);
}
+ return 0;
}

/*----------------------------------------------------------------*/
@@ -353,19 +357,15 @@ migrate_write:
}
}

-void migrate_proc(struct work_struct *work)
+int migrate_proc(void *data)
{
- struct wb_cache *cache =
- container_of(work, struct wb_cache, migrate_work);
+ struct wb_cache *cache = data;

- while (true) {
+ while (!kthread_should_stop()) {
bool allow_migrate;
size_t i, nr_mig_candidates, nr_mig, nr_max_batch;
struct segment_header *seg, *tmp;

- if (cache->on_terminate)
- return;
-
/*
* If reserving_id > 0
* Migration should be immediate.
@@ -430,6 +430,7 @@ void migrate_proc(struct work_struct *work)
list_del(&seg->migrate_list);
}
}
+ return 0;
}

/*
@@ -453,19 +454,16 @@ void wait_for_migration(struct wb_cache *cache, u64 id)

/*----------------------------------------------------------------*/

-void modulator_proc(struct work_struct *work)
+int modulator_proc(void *data)
{
- struct wb_cache *cache =
- container_of(work, struct wb_cache, modulator_work);
+ struct wb_cache *cache = data;
struct wb_device *wb = cache->wb;

struct hd_struct *hd = wb->device->bdev->bd_part;
unsigned long old = 0, new, util;
unsigned long intvl = 1000;

- while (true) {
- if (cache->on_terminate)
- return;
+ while (!kthread_should_stop()) {

new = jiffies_to_msecs(part_stat_read(hd, io_ticks));

@@ -484,6 +482,7 @@ modulator_update:

schedule_timeout_interruptible(msecs_to_jiffies(intvl));
}
+ return 0;
}

/*----------------------------------------------------------------*/
@@ -517,16 +516,12 @@ static void update_superblock_record(struct wb_cache *cache)
kfree(buf);
}

-void recorder_proc(struct work_struct *work)
+int recorder_proc(void *data)
{
- struct wb_cache *cache =
- container_of(work, struct wb_cache, recorder_work);
+ struct wb_cache *cache = data;
unsigned long intvl;

- while (true) {
- if (cache->on_terminate)
- return;
-
+ while (!kthread_should_stop()) {
/* sec -> ms */
intvl = cache->update_record_interval * 1000;

@@ -539,20 +534,17 @@ void recorder_proc(struct work_struct *work)

schedule_timeout_interruptible(msecs_to_jiffies(intvl));
}
+ return 0;
}

/*----------------------------------------------------------------*/

-void sync_proc(struct work_struct *work)
+int sync_proc(void *data)
{
- struct wb_cache *cache =
- container_of(work, struct wb_cache, sync_work);
+ struct wb_cache *cache = data;
unsigned long intvl;

- while (true) {
- if (cache->on_terminate)
- return;
-
+ while (!kthread_should_stop()) {
/* sec -> ms */
intvl = cache->sync_interval * 1000;

@@ -566,4 +558,5 @@ void sync_proc(struct work_struct *work)

schedule_timeout_interruptible(msecs_to_jiffies(intvl));
}
+ return 0;
}
diff --git a/Driver/dm-writeboost-daemon.h b/Driver/dm-writeboost-daemon.h
index 3bdd862..d21d66c 100644
--- a/Driver/dm-writeboost-daemon.h
+++ b/Driver/dm-writeboost-daemon.h
@@ -9,7 +9,7 @@

/*----------------------------------------------------------------*/

-void flush_proc(struct work_struct *);
+int flush_proc(void *);

/*----------------------------------------------------------------*/

@@ -19,20 +19,20 @@ void flush_barrier_ios(struct work_struct *);

/*----------------------------------------------------------------*/

-void migrate_proc(struct work_struct *);
+int migrate_proc(void *);
void wait_for_migration(struct wb_cache *, u64 id);

/*----------------------------------------------------------------*/

-void modulator_proc(struct work_struct *);
+int modulator_proc(void *);

/*----------------------------------------------------------------*/

-void sync_proc(struct work_struct *);
+int sync_proc(void *);

/*----------------------------------------------------------------*/

-void recorder_proc(struct work_struct *);
+int recorder_proc(void *);

/*----------------------------------------------------------------*/

diff --git a/Driver/dm-writeboost-metadata.c b/Driver/dm-writeboost-metadata.c
index 1cd9e0c..4f5fa5e 100644
--- a/Driver/dm-writeboost-metadata.c
+++ b/Driver/dm-writeboost-metadata.c
@@ -994,6 +994,17 @@ void free_migration_buffer(struct wb_cache *cache)

/*----------------------------------------------------------------*/

+#define CREATE_DAEMON(name)\
+ cache->name##_daemon = kthread_create(name##_proc, cache,\
+ #name "_daemon");\
+ if (IS_ERR(cache->name##_daemon)) {\
+ r = PTR_ERR(cache->name##_daemon);\
+ cache->name##_daemon = NULL;\
+ WBERR("couldn't spawn" #name "daemon");\
+ goto bad_##name##_daemon;\
+ }\
+ wake_up_process(cache->name##_daemon);
+
int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)
{
int r = 0;
@@ -1010,8 +1021,6 @@ int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)

mutex_init(&cache->io_lock);

- cache->on_terminate = false;
-
/*
* (i) Harmless Initializations
*/
@@ -1042,12 +1051,6 @@ int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)
* Recovering the cache metadata
* prerequires the migration daemon working.
*/
- cache->migrate_wq = create_singlethread_workqueue("migratewq");
- if (!cache->migrate_wq) {
- r = -ENOMEM;
- WBERR();
- goto bad_migratewq;
- }

/* Migration Daemon */
atomic_set(&cache->migrate_fail_count, 0);
@@ -1070,9 +1073,7 @@ int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)

cache->allow_migrate = true;
cache->reserving_segment_id = 0;
- INIT_WORK(&cache->migrate_work, migrate_proc);
- queue_work(cache->migrate_wq, &cache->migrate_work);
-
+ CREATE_DAEMON(migrate);

r = recover_cache(cache);
if (r) {
@@ -1086,19 +1087,12 @@ int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)
* These are only working
* after the logical device created.
*/
- cache->flush_wq = create_singlethread_workqueue("flushwq");
- if (!cache->flush_wq) {
- r = -ENOMEM;
- WBERR();
- goto bad_flushwq;
- }

/* Flush Daemon */
- INIT_WORK(&cache->flush_work, flush_proc);
spin_lock_init(&cache->flush_queue_lock);
INIT_LIST_HEAD(&cache->flush_queue);
init_waitqueue_head(&cache->flush_wait_queue);
- queue_work(cache->flush_wq, &cache->flush_work);
+ CREATE_DAEMON(flush);

/* Deferred ACK for barrier writes */

@@ -1118,29 +1112,30 @@ int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)

/* Migartion Modulator */
cache->enable_migration_modulator = true;
- INIT_WORK(&cache->modulator_work, modulator_proc);
- schedule_work(&cache->modulator_work);
+ CREATE_DAEMON(modulator);

/* Superblock Recorder */
cache->update_record_interval = 60;
- INIT_WORK(&cache->recorder_work, recorder_proc);
- schedule_work(&cache->recorder_work);
+ CREATE_DAEMON(recorder);

/* Dirty Synchronizer */
cache->sync_interval = 60;
- INIT_WORK(&cache->sync_work, sync_proc);
- schedule_work(&cache->sync_work);
+ CREATE_DAEMON(sync);

return 0;

-bad_flushwq:
+bad_sync_daemon:
+ kthread_stop(cache->recorder_daemon);
+bad_recorder_daemon:
+ kthread_stop(cache->modulator_daemon);
+bad_modulator_daemon:
+ kthread_stop(cache->flush_daemon);
+bad_flush_daemon:
bad_recover:
- cache->on_terminate = true;
- cancel_work_sync(&cache->migrate_work);
+ kthread_stop(cache->migrate_daemon);
+bad_migrate_daemon:
free_migration_buffer(cache);
bad_alloc_migrate_buffer:
- destroy_workqueue(cache->migrate_wq);
-bad_migratewq:
free_ht(cache);
bad_alloc_ht:
free_segment_header_array(cache);
@@ -1153,20 +1148,21 @@ bad_init_rambuf_pool:

void free_cache(struct wb_cache *cache)
{
- cache->on_terminate = true;
+ /*
+ * Must clean up all the volatile data
+ * before termination.
+ */
+ flush_current_buffer(cache);

- /* Kill in-kernel daemons */
- cancel_work_sync(&cache->sync_work);
- cancel_work_sync(&cache->recorder_work);
- cancel_work_sync(&cache->modulator_work);
+ kthread_stop(cache->sync_daemon);
+ kthread_stop(cache->recorder_daemon);
+ kthread_stop(cache->modulator_daemon);

- cancel_work_sync(&cache->flush_work);
- destroy_workqueue(cache->flush_wq);
+ kthread_stop(cache->flush_daemon);

cancel_work_sync(&cache->barrier_deadline_work);

- cancel_work_sync(&cache->migrate_work);
- destroy_workqueue(cache->migrate_wq);
+ kthread_stop(cache->migrate_daemon);
free_migration_buffer(cache);

/* Destroy in-core structures */
diff --git a/Driver/dm-writeboost-target.c b/Driver/dm-writeboost-target.c
index 6aa4c7c..4b5b7aa 100644
--- a/Driver/dm-writeboost-target.c
+++ b/Driver/dm-writeboost-target.c
@@ -973,12 +973,6 @@ static void writeboost_dtr(struct dm_target *ti)
struct wb_device *wb = ti->private;
struct wb_cache *cache = wb->cache;

- /*
- * Synchronize all the dirty writes
- * before termination.
- */
- cache->sync_interval = 1;
-
free_cache(cache);
kfree(cache);

diff --git a/Driver/dm-writeboost.h b/Driver/dm-writeboost.h
index 74ff194..092c100 100644
--- a/Driver/dm-writeboost.h
+++ b/Driver/dm-writeboost.h
@@ -16,6 +16,7 @@
#include <linux/list.h>
#include <linux/slab.h>
#include <linux/mutex.h>
+#include <linux/kthread.h>
#include <linux/sched.h>
#include <linux/timer.h>
#include <linux/device-mapper.h>
@@ -265,8 +266,7 @@ struct wb_cache {
* and flush daemon asynchronously
* flush them to the cache device.
*/
- struct work_struct flush_work;
- struct workqueue_struct *flush_wq;
+ struct task_struct *flush_daemon;
spinlock_t flush_queue_lock;
struct list_head flush_queue;
wait_queue_head_t flush_wait_queue;
@@ -288,8 +288,7 @@ struct wb_cache {
* migrate daemon goes into migration
* if they are segments to migrate.
*/
- struct work_struct migrate_work;
- struct workqueue_struct *migrate_wq;
+ struct task_struct *migrate_daemon;
bool allow_migrate; /* param */

/*
@@ -314,7 +313,7 @@ struct wb_cache {
* the migration
* according to the load of backing store.
*/
- struct work_struct modulator_work;
+ struct task_struct *modulator_daemon;
bool enable_migration_modulator; /* param */

/*
@@ -323,7 +322,7 @@ struct wb_cache {
* Update the superblock record
* periodically.
*/
- struct work_struct recorder_work;
+ struct task_struct *recorder_daemon;
unsigned long update_record_interval; /* param */

/*
@@ -332,16 +331,9 @@ struct wb_cache {
* Sync the dirty writes
* periodically.
*/
- struct work_struct sync_work;
+ struct task_struct *sync_daemon;
unsigned long sync_interval; /* param */

- /*
- * on_terminate is true
- * to notify all the background daemons to
- * stop their operations.
- */
- bool on_terminate;
-
atomic64_t stat[STATLEN];
};

Akira

2013-10-06 00:47:32

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [dm-devel] dm-writeboost testing



On Sat, 5 Oct 2013, Akira Hayakawa wrote:

> Mikulas,
>
> > nvidia binary driver, but it may happen in other parts of the kernel too.
> > The fact that it works in your setup doesn't mean that it is correct.
> You are right. I am convinced.
>
> As far as I looked around the kernel code,
> it seems to be choosing kthread when one needs looping in background.
> There are other examples such as loop_thread in drivers/block/loop.c .
>
> And done. Please git pull.
> commit b6d2d3892e09145cd0b9c5ad69ea2d8c410fdeab is tested in my environment.
> All the looping daemons are running on kthread now.
>
> The diff between the older version (with wq) and the new version (with kthread)
> is shown below. I defined fancy CREATE_DAEMON macro.
>
> Another by-product is that
> you are no longer in need to wait for long time in dmsetup remove
> since kthread_stop() forcefully wakes the thread up.

The change seems ok. Please, also move this piece of code in flush_proc
out of the spinlock:
if (kthread_should_stop())
return 0;

It caused the workqueue warning I reported before and still causes warning
with kthreads:
note: flush_daemon[5145] exited with preempt_count 1

I will send you next email with more bugs that I found in your code.

Mikulas

> diff --git a/Driver/dm-writeboost-daemon.c b/Driver/dm-writeboost-daemon.c
> index 6090661..65974e2 100644
> --- a/Driver/dm-writeboost-daemon.c
> +++ b/Driver/dm-writeboost-daemon.c
> @@ -10,12 +10,11 @@
>
> /*----------------------------------------------------------------*/
>
> -void flush_proc(struct work_struct *work)
> +int flush_proc(void *data)
> {
> unsigned long flags;
>
> - struct wb_cache *cache =
> - container_of(work, struct wb_cache, flush_work);
> + struct wb_cache *cache = data;
>
> while (true) {
> struct flush_job *job;
> @@ -32,8 +31,12 @@ void flush_proc(struct work_struct *work)
> msecs_to_jiffies(100));
> spin_lock_irqsave(&cache->flush_queue_lock, flags);
>
> - if (cache->on_terminate)
> - return;
> + /*
> + * flush daemon can exit
> + * only if no flush job is queued.
> + */
> + if (kthread_should_stop())
> + return 0;
> }
>
> /*
> @@ -84,6 +87,7 @@ void flush_proc(struct work_struct *work)
>
> kfree(job);
> }
> + return 0;
> }
>
> /*----------------------------------------------------------------*/
> @@ -353,19 +357,15 @@ migrate_write:
> }
> }
>
> -void migrate_proc(struct work_struct *work)
> +int migrate_proc(void *data)
> {
> - struct wb_cache *cache =
> - container_of(work, struct wb_cache, migrate_work);
> + struct wb_cache *cache = data;
>
> - while (true) {
> + while (!kthread_should_stop()) {
> bool allow_migrate;
> size_t i, nr_mig_candidates, nr_mig, nr_max_batch;
> struct segment_header *seg, *tmp;
>
> - if (cache->on_terminate)
> - return;
> -
> /*
> * If reserving_id > 0
> * Migration should be immediate.
> @@ -430,6 +430,7 @@ void migrate_proc(struct work_struct *work)
> list_del(&seg->migrate_list);
> }
> }
> + return 0;
> }
>
> /*
> @@ -453,19 +454,16 @@ void wait_for_migration(struct wb_cache *cache, u64 id)
>
> /*----------------------------------------------------------------*/
>
> -void modulator_proc(struct work_struct *work)
> +int modulator_proc(void *data)
> {
> - struct wb_cache *cache =
> - container_of(work, struct wb_cache, modulator_work);
> + struct wb_cache *cache = data;
> struct wb_device *wb = cache->wb;
>
> struct hd_struct *hd = wb->device->bdev->bd_part;
> unsigned long old = 0, new, util;
> unsigned long intvl = 1000;
>
> - while (true) {
> - if (cache->on_terminate)
> - return;
> + while (!kthread_should_stop()) {
>
> new = jiffies_to_msecs(part_stat_read(hd, io_ticks));
>
> @@ -484,6 +482,7 @@ modulator_update:
>
> schedule_timeout_interruptible(msecs_to_jiffies(intvl));
> }
> + return 0;
> }
>
> /*----------------------------------------------------------------*/
> @@ -517,16 +516,12 @@ static void update_superblock_record(struct wb_cache *cache)
> kfree(buf);
> }
>
> -void recorder_proc(struct work_struct *work)
> +int recorder_proc(void *data)
> {
> - struct wb_cache *cache =
> - container_of(work, struct wb_cache, recorder_work);
> + struct wb_cache *cache = data;
> unsigned long intvl;
>
> - while (true) {
> - if (cache->on_terminate)
> - return;
> -
> + while (!kthread_should_stop()) {
> /* sec -> ms */
> intvl = cache->update_record_interval * 1000;
>
> @@ -539,20 +534,17 @@ void recorder_proc(struct work_struct *work)
>
> schedule_timeout_interruptible(msecs_to_jiffies(intvl));
> }
> + return 0;
> }
>
> /*----------------------------------------------------------------*/
>
> -void sync_proc(struct work_struct *work)
> +int sync_proc(void *data)
> {
> - struct wb_cache *cache =
> - container_of(work, struct wb_cache, sync_work);
> + struct wb_cache *cache = data;
> unsigned long intvl;
>
> - while (true) {
> - if (cache->on_terminate)
> - return;
> -
> + while (!kthread_should_stop()) {
> /* sec -> ms */
> intvl = cache->sync_interval * 1000;
>
> @@ -566,4 +558,5 @@ void sync_proc(struct work_struct *work)
>
> schedule_timeout_interruptible(msecs_to_jiffies(intvl));
> }
> + return 0;
> }
> diff --git a/Driver/dm-writeboost-daemon.h b/Driver/dm-writeboost-daemon.h
> index 3bdd862..d21d66c 100644
> --- a/Driver/dm-writeboost-daemon.h
> +++ b/Driver/dm-writeboost-daemon.h
> @@ -9,7 +9,7 @@
>
> /*----------------------------------------------------------------*/
>
> -void flush_proc(struct work_struct *);
> +int flush_proc(void *);
>
> /*----------------------------------------------------------------*/
>
> @@ -19,20 +19,20 @@ void flush_barrier_ios(struct work_struct *);
>
> /*----------------------------------------------------------------*/
>
> -void migrate_proc(struct work_struct *);
> +int migrate_proc(void *);
> void wait_for_migration(struct wb_cache *, u64 id);
>
> /*----------------------------------------------------------------*/
>
> -void modulator_proc(struct work_struct *);
> +int modulator_proc(void *);
>
> /*----------------------------------------------------------------*/
>
> -void sync_proc(struct work_struct *);
> +int sync_proc(void *);
>
> /*----------------------------------------------------------------*/
>
> -void recorder_proc(struct work_struct *);
> +int recorder_proc(void *);
>
> /*----------------------------------------------------------------*/
>
> diff --git a/Driver/dm-writeboost-metadata.c b/Driver/dm-writeboost-metadata.c
> index 1cd9e0c..4f5fa5e 100644
> --- a/Driver/dm-writeboost-metadata.c
> +++ b/Driver/dm-writeboost-metadata.c
> @@ -994,6 +994,17 @@ void free_migration_buffer(struct wb_cache *cache)
>
> /*----------------------------------------------------------------*/
>
> +#define CREATE_DAEMON(name)\
> + cache->name##_daemon = kthread_create(name##_proc, cache,\
> + #name "_daemon");\
> + if (IS_ERR(cache->name##_daemon)) {\
> + r = PTR_ERR(cache->name##_daemon);\
> + cache->name##_daemon = NULL;\
> + WBERR("couldn't spawn" #name "daemon");\
> + goto bad_##name##_daemon;\
> + }\
> + wake_up_process(cache->name##_daemon);
> +
> int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)
> {
> int r = 0;
> @@ -1010,8 +1021,6 @@ int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)
>
> mutex_init(&cache->io_lock);
>
> - cache->on_terminate = false;
> -
> /*
> * (i) Harmless Initializations
> */
> @@ -1042,12 +1051,6 @@ int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)
> * Recovering the cache metadata
> * prerequires the migration daemon working.
> */
> - cache->migrate_wq = create_singlethread_workqueue("migratewq");
> - if (!cache->migrate_wq) {
> - r = -ENOMEM;
> - WBERR();
> - goto bad_migratewq;
> - }
>
> /* Migration Daemon */
> atomic_set(&cache->migrate_fail_count, 0);
> @@ -1070,9 +1073,7 @@ int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)
>
> cache->allow_migrate = true;
> cache->reserving_segment_id = 0;
> - INIT_WORK(&cache->migrate_work, migrate_proc);
> - queue_work(cache->migrate_wq, &cache->migrate_work);
> -
> + CREATE_DAEMON(migrate);
>
> r = recover_cache(cache);
> if (r) {
> @@ -1086,19 +1087,12 @@ int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)
> * These are only working
> * after the logical device created.
> */
> - cache->flush_wq = create_singlethread_workqueue("flushwq");
> - if (!cache->flush_wq) {
> - r = -ENOMEM;
> - WBERR();
> - goto bad_flushwq;
> - }
>
> /* Flush Daemon */
> - INIT_WORK(&cache->flush_work, flush_proc);
> spin_lock_init(&cache->flush_queue_lock);
> INIT_LIST_HEAD(&cache->flush_queue);
> init_waitqueue_head(&cache->flush_wait_queue);
> - queue_work(cache->flush_wq, &cache->flush_work);
> + CREATE_DAEMON(flush);
>
> /* Deferred ACK for barrier writes */
>
> @@ -1118,29 +1112,30 @@ int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)
>
> /* Migartion Modulator */
> cache->enable_migration_modulator = true;
> - INIT_WORK(&cache->modulator_work, modulator_proc);
> - schedule_work(&cache->modulator_work);
> + CREATE_DAEMON(modulator);
>
> /* Superblock Recorder */
> cache->update_record_interval = 60;
> - INIT_WORK(&cache->recorder_work, recorder_proc);
> - schedule_work(&cache->recorder_work);
> + CREATE_DAEMON(recorder);
>
> /* Dirty Synchronizer */
> cache->sync_interval = 60;
> - INIT_WORK(&cache->sync_work, sync_proc);
> - schedule_work(&cache->sync_work);
> + CREATE_DAEMON(sync);
>
> return 0;
>
> -bad_flushwq:
> +bad_sync_daemon:
> + kthread_stop(cache->recorder_daemon);
> +bad_recorder_daemon:
> + kthread_stop(cache->modulator_daemon);
> +bad_modulator_daemon:
> + kthread_stop(cache->flush_daemon);
> +bad_flush_daemon:
> bad_recover:
> - cache->on_terminate = true;
> - cancel_work_sync(&cache->migrate_work);
> + kthread_stop(cache->migrate_daemon);
> +bad_migrate_daemon:
> free_migration_buffer(cache);
> bad_alloc_migrate_buffer:
> - destroy_workqueue(cache->migrate_wq);
> -bad_migratewq:
> free_ht(cache);
> bad_alloc_ht:
> free_segment_header_array(cache);
> @@ -1153,20 +1148,21 @@ bad_init_rambuf_pool:
>
> void free_cache(struct wb_cache *cache)
> {
> - cache->on_terminate = true;
> + /*
> + * Must clean up all the volatile data
> + * before termination.
> + */
> + flush_current_buffer(cache);
>
> - /* Kill in-kernel daemons */
> - cancel_work_sync(&cache->sync_work);
> - cancel_work_sync(&cache->recorder_work);
> - cancel_work_sync(&cache->modulator_work);
> + kthread_stop(cache->sync_daemon);
> + kthread_stop(cache->recorder_daemon);
> + kthread_stop(cache->modulator_daemon);
>
> - cancel_work_sync(&cache->flush_work);
> - destroy_workqueue(cache->flush_wq);
> + kthread_stop(cache->flush_daemon);
>
> cancel_work_sync(&cache->barrier_deadline_work);
>
> - cancel_work_sync(&cache->migrate_work);
> - destroy_workqueue(cache->migrate_wq);
> + kthread_stop(cache->migrate_daemon);
> free_migration_buffer(cache);
>
> /* Destroy in-core structures */
> diff --git a/Driver/dm-writeboost-target.c b/Driver/dm-writeboost-target.c
> index 6aa4c7c..4b5b7aa 100644
> --- a/Driver/dm-writeboost-target.c
> +++ b/Driver/dm-writeboost-target.c
> @@ -973,12 +973,6 @@ static void writeboost_dtr(struct dm_target *ti)
> struct wb_device *wb = ti->private;
> struct wb_cache *cache = wb->cache;
>
> - /*
> - * Synchronize all the dirty writes
> - * before termination.
> - */
> - cache->sync_interval = 1;
> -
> free_cache(cache);
> kfree(cache);
>
> diff --git a/Driver/dm-writeboost.h b/Driver/dm-writeboost.h
> index 74ff194..092c100 100644
> --- a/Driver/dm-writeboost.h
> +++ b/Driver/dm-writeboost.h
> @@ -16,6 +16,7 @@
> #include <linux/list.h>
> #include <linux/slab.h>
> #include <linux/mutex.h>
> +#include <linux/kthread.h>
> #include <linux/sched.h>
> #include <linux/timer.h>
> #include <linux/device-mapper.h>
> @@ -265,8 +266,7 @@ struct wb_cache {
> * and flush daemon asynchronously
> * flush them to the cache device.
> */
> - struct work_struct flush_work;
> - struct workqueue_struct *flush_wq;
> + struct task_struct *flush_daemon;
> spinlock_t flush_queue_lock;
> struct list_head flush_queue;
> wait_queue_head_t flush_wait_queue;
> @@ -288,8 +288,7 @@ struct wb_cache {
> * migrate daemon goes into migration
> * if they are segments to migrate.
> */
> - struct work_struct migrate_work;
> - struct workqueue_struct *migrate_wq;
> + struct task_struct *migrate_daemon;
> bool allow_migrate; /* param */
>
> /*
> @@ -314,7 +313,7 @@ struct wb_cache {
> * the migration
> * according to the load of backing store.
> */
> - struct work_struct modulator_work;
> + struct task_struct *modulator_daemon;
> bool enable_migration_modulator; /* param */
>
> /*
> @@ -323,7 +322,7 @@ struct wb_cache {
> * Update the superblock record
> * periodically.
> */
> - struct work_struct recorder_work;
> + struct task_struct *recorder_daemon;
> unsigned long update_record_interval; /* param */
>
> /*
> @@ -332,16 +331,9 @@ struct wb_cache {
> * Sync the dirty writes
> * periodically.
> */
> - struct work_struct sync_work;
> + struct task_struct *sync_daemon;
> unsigned long sync_interval; /* param */
>
> - /*
> - * on_terminate is true
> - * to notify all the background daemons to
> - * stop their operations.
> - */
> - bool on_terminate;
> -
> atomic64_t stat[STATLEN];
> };
>
> Akira
>

2013-10-06 02:15:03

by Akira Hayakawa

[permalink] [raw]
Subject: Re: [dm-devel] dm-writeboost testing

Mikulas,

> The change seems ok. Please, also move this piece of code in flush_proc
> out of the spinlock:
> if (kthread_should_stop())
> return 0;
>
> It caused the workqueue warning I reported before and still causes warning
> with kthreads:
> note: flush_daemon[5145] exited with preempt_count 1

You are right.
I fixed the bug.

diff --git a/Driver/dm-writeboost-daemon.c b/Driver/dm-writeboost-daemon.c
index 65974e2..cf790bf 100644
--- a/Driver/dm-writeboost-daemon.c
+++ b/Driver/dm-writeboost-daemon.c
@@ -29,7 +29,6 @@ int flush_proc(void *data)
cache->flush_wait_queue,
(!list_empty(&cache->flush_queue)),
msecs_to_jiffies(100));
- spin_lock_irqsave(&cache->flush_queue_lock, flags);

/*
* flush daemon can exit
@@ -37,6 +36,8 @@ int flush_proc(void *data)
*/
if (kthread_should_stop())
return 0;
+ else
+ spin_lock_irqsave(&cache->flush_queue_lock, flags);
}

> I will send you next email with more bugs that I found in your code
I will reply to you about this later.
So much bugs and some seems to be crucial.

Akira