2013-10-06 00:50:57

by Mikulas Patocka

[permalink] [raw]
Subject: A review of dm-writeboost

Hi

I looked at dm-writeboost source code and here I'm sending the list of
problems I found:


Polling for state
-----------------

Some of the kernel threads that you spawn poll for data in one-second
interval - see migrate_proc, modulator_proc, recorder_proc, sync_proc.

flush_proc correctly contains wait_event, but there is still some 100ms
polling timeout in flush_proc that shouldn't be there.


Don't do this. You can do polling in a piece of code that is executed
infrequently, such as driver unload, but general functionality must not
depend on polling.


If you set the polling interval too low, it wastes CPU cycle and it wastes
energy due to CPU wake-ups. If you set the polling interval too high, it
causes artifical delays. For example, current middle-class SSDs can do
writes at a rate 500MB/s. Now, think that the user uses 400MB partition
for the cache - the partition is filled up in 0.8s and the kernel waits
for additional 0.2s doing absolutely nothing, until your polling code
wakes up and notices that it should start flusing the cache.

So - the polling code introduces artifical delays that can cause
performance degradation. You may think about lowering the polling interval
to lessen the performance impact, but if you lower the polling interval,
it increases CPU and power consumption when the driver is idle. Either
way, it is wrong. Just don't use polling.


Lack of locking/barriers
------------------------

migrate_proc lacks any lock or barrier. Note that processors may execute
instructions out of order, and thus concurrent access without locks or
barriers is always wrong.

Think of this piece of code:
nr_mig_candidates = cache->last_flushed_segment_id -
cache->last_migrated_segment_id;
...
nr_mig = min(nr_mig_candidates,
cache->nr_cur_batched_migration);
...
for (i = 1; i <= nr_mig; i++) {
seg = get_segment_header_by_id(
cache,
cache->last_migrated_segment_id + i);
list_add_tail(&seg->migrate_list, &cache->migrate_list);
}

The processor may reorder all memory accesses - for example it may read
the data accessed in the "for" cycle before reading the variables
"cache->last_flushed_segment_id" and "cache->last_migrated_segment_id". If
this happens, the code may work with invalid data.

Similarly, the processor that updates cache->last_flushed_segment_id can
update it before updating the segment variables itself.

You need to use smp_wmb() before incrementing
cache->last_flushed_segment_id in the producer process and smp_rmb() after
reading cache->last_flushed_segment_id in the consumer process. Read
Documentation/memory-barriers.txt for full explanation.

You can use locks instead of memory barriers, locks are simpler to use and
simpler to verify, but they are usually slower than memory barriers.


Nonatomic 64-bit variables
--------------------------

Note that 64-bit variables are not atomic on 32-bit architectures.

Linux assumes that 32-bit variables are atomic on 32-bit architectures and
64-bit or 32-bit variables are atomic on 64-bit architectures. That is,
variables having the "long" or "int" type are atomic. Atomicity means that
if you read and modify the variable at the same time, you read either the
old value or the new values.

64-bit variables on 32-bit architectures are usually read and written in
two steps, and thus the atomicity requirement isn't true.

For example, suppose that you change cache->last_flushed_segment_id from
0x00000000ffffffff to 0x0000000100000000. Now, the function migrate_proc
that asynchronously reads cache->last_flushed_segment_id can read
0x00000000ffffffff or 0x0000000100000000 (if it reads one of these values,
it behaves correctly), but it can also read 0x0000000000000000 or
0x00000001ffffffff - if migrate_proc reads one of these two values, it
goes wild, migrating segments that weren't ever supposed to be migrated,
and likely causes a crash or data corruption.

I found this bug in migrate_proc and update_superblock_record (but it may
be also in other parts of the code).

You can use atomic64_t type for atomic 64-bit variables (plus memory
barriers as explained in the previous section). Or you can use locks.

reserving_segment_id is also affected. However, you never actually need
the value of reserving_segment_id, you only compare it to zero. Thus, you
can change this variable to the "int" type and set it to "0" or "1". (you
must use int and not bool, see the next section).


Variables smaller than 32 bits must not be asynchronously modified
------------------------------------------------------------------

Early Alpha processors can't access memory objects smaller than 32 bits -
so, for example, if your code writes 8-bit char or bool variable, the
processor reads 32 bits to a register, modifies 8-bit part of the register
and writes 32 bits from the register back to memory. Now, if you define
something like
unsigned char a;
unsigned char b;
and modify "a" from one "thread 1" and modify "b" from "thread 2", the
threads could destroy each other's modification - the "thread 1" can
destroy "b" (although in the C code it never references "b") and "thread
2" can destroy "a" (although in the C code it never references "a") - for
this reason - if you have variables that you modify asynchronously, they
shouldn't have a type smaller than 32 bits.

bool allow_migrate, bool enable_migration_modulator, bool on_terminate
have this problem, change them to int.


Lack of ACCESS_ONCE
-------------------

You can read a variable while you modify it (the variable must not be
bigger than "long" type, see the section "Nonatomic 64-bit variables").

However, if you read a variable that may be modified, you must use the
ACCESS_ONCE macro.

For example see this:
if (cache->on_terminate)
return;
cache->on_terminate may change while you are reading it, so you should use
if (ACCESS_ONCE(cache->on_terminate))
return;

There are many other variables that are read while modifying them and that
need ACCESS_ONCE, for example cache->allow_migrate. There are plenty of
other variables in your code that may be read and modified at the same
time.

The reason why you need ACCESS_ONCE is that the C compiler assumes that
the variable that you read doesn't change. Thus, it can perform certain
optimizations. ACCESS_ONCE suppresses these optimizations.

In most cases, omitting ACCESS_ONCE doesn't cause any misbehavior, for the
reason that the compiler doesn't use the assumption, that the variable
doesn't change, to perform optimizations. But the compiler may use this
assumption, and if it does, it triggers a hard to find bug.


GFP_NOIO allocations
--------------------

If possible, you should use mempool instead. Mempool allocation doesn't
fail (if memory is exhausted, it waits until some objects are returned to
the mempool).

kmalloc_retry is not needed - there's a flag __GFP_NOFAIL that does
infinite retry.

The problem with GFP_IO is that if the system is in such a state when all
memory is dirty and the only way how to free memory is to write pages to
the swap, it deadlocks - the memory manager waits for your driver to write
some data to the swap - and your driver is waiting for the memory manager
to free some memory so that you can allocate memory and process the write.

To avoid this problem, use mempools.


64-bit divide and modulo
------------------------

You can't use divide and modulo operators ('/' and '%') on 64-bit values.
On 32-bit architectures, these operators need library functions and these
functions are not present in the kernel. You need to use div64_u64
(divides 64-bit value by 64-bit value), div64_u64_rem (divides 64-bit
value by 64-bit value and also returns a remainder), You can also use
do_div (it divides a 64-bit value by a 32-bit value), or sector_div (it
divides sector_t by a 32-bit value).

Try to compile your driver with 32-bit kernel and you will see that '/'
and '%' doesn't work on 64-bit values.


Wrong printf arguments
----------------------

Try to compile your driver on 32-bit system. You get some warnings.

size_t x;
printf("%lu", x) - this is wrong because %lu says that the type is
unsigned long and the type is size_t (size_t may be unsigned or unsigned
long on different architectures). It should be
printf("%zu", z);

sector_t s;
printf("%lu", s) - this is wrong because the sector_t may not have the
same type as unsigned long. (on 32-bit architectures, you can select
32-bit sector_t or 64-bit sector_t in kernel configuration). It should be
printf("%llu", (unsigned long long)s);

DMEMIT("%lu ", atomic64_read(v)); - wrong, because format is unsigned long
and type is 64-bit. It should be
DMEMIT("%llu ", (unsigned long long)atomic64_read(v));


Infinite retry on I/O error in dm_safe_io_retry
-----------------------------------------------

Don't do this. If the cache disk fails for some reason, it kills the
system by doing inifinite retry.

Generally, I/O retry is handler by the lowest level driver (for example,
if ATA disk fails to respond, the driver tries to reset the disk and
reissue the request). If you get I/O error, you can assume that the lowest
level driver retried as much as it could and you shouldn't do additional
retry.

If you can't handle a specific I/O request failure gracefully, you should
mark the driver as dead, don't do any more I/Os to the disk or cache
device and return -EIO on all incoming requests.

Always think that I/O failures can happen because of connection problems,
not data corruption problems - for example, a disk cable can go loose, a
network may lose connectivity, etc. In these cases, it is best to stop
doing any I/O at all and let the user resolve the situation.

I think that you should always stop the driver on write error (stopping it
minimizes data damage when there is connectivity problem such as loose
cable). On read error you should stop the driver if the error is
non-recoverable (for example, when you get error when reading the cache
device in the migration process). You don't have to stop on read error, if
it is recoverable (for example, when you see read error on the disk, you
can just return -EIO to the user without stopping the driver).


Pointless modulator_proc
------------------------

This thread does no work, it just turns cache->allow_migrate on and off.
Thus, the thread is useless, you can just remove it. In the place where
you read cache->allow_migrate (in migrate_proc) you could just do the work
that used to be performed in modulator_proc.


Rewrite consideration
---------------------

Some of the above problems can be fixed easily, but others can't - for
example, if the code uses timed polling, it is not trivial to convert it
to event-based processing. Similarly, the lack of locks or memory barriers
when acessing shared data can't be easily fixed - I pinpointed some cases
where they are missing, but I didn't find all the problems - if you wrote
the code without considering the need for synchronization primitives, it
is not trivial to add them afterwards.

I think the best thing would be two rewrite the code and take the above
notes into account (in the rewriting you can use pieces of code already
written).


First, you must design some event model - (a bunch of threads polling in 1
second interval doesn't seem viable). For example, you can use workqueues
correctly this way:

* You create a workqueue for migration, but you don't submit any work to
it.
* If you fill the cache device above the watermark, you submit a work item
to the migration workqueue (you can submit the same work item to the
same workqueue multiple times - if the work item is still queued and
hasn't started executing, the second submit is ignored; if the work item
has started executing, the second submit causes that it is executed once
more).
* The work item does a little bit of work, it finds data to be migrated,
submits the bios to do the migration and exits. (you can use dm-kcopyd
to do actual migration, it simplifies your code)
* If you need more migration, you just submit the work item again.

If you design it this way, you avoid the polling timers (migration starts
as soon as it is needed) and also you avoid the problem with the looping
workqueue.


Next, you need to design some locking - which variables are protected by
which locks. If you use shared variables without locks, you need to use
memory barriers (it is harder to design code using memory barriers than
locks).


Mikulas


2013-10-06 12:47:27

by Akira Hayakawa

[permalink] [raw]
Subject: Re: A review of dm-writeboost

Mikulas,

Thank you for your reviewing.

I will reply to polling issue first.
For the rest, I will reply later.

> Polling for state
> -----------------
>
> Some of the kernel threads that you spawn poll for data in one-second
> interval - see migrate_proc, modulator_proc, recorder_proc, sync_proc.
>
> flush_proc correctly contains wait_event, but there is still some 100ms
> polling timeout in flush_proc that shouldn't be there.
>
> If you set the polling interval too low, it wastes CPU cycle and it wastes
> energy due to CPU wake-ups. If you set the polling interval too high, it
> causes artifical delays. For example, current middle-class SSDs can do
> writes at a rate 500MB/s. Now, think that the user uses 400MB partition
> for the cache - the partition is filled up in 0.8s and the kernel waits
> for additional 0.2s doing absolutely nothing, until your polling code
> wakes up and notices that it should start flusing the cache.
>
> So - the polling code introduces artifical delays that can cause
> performance degradation. You may think about lowering the polling interval
> to lessen the performance impact, but if you lower the polling interval,
> it increases CPU and power consumption when the driver is idle. Either
> way, it is wrong. Just don't use polling.

You dislike the fact that looping thread sleeping
can delay to the newly coming jobs.

I am afraid your understanding
on the sleeping of flush daemon is wrong.
Please see queue_flushing() it wakes up the daemon
if the queue is empty.

spin_lock_irqsave(&cache->flush_queue_lock, flags);
empty = list_empty(&cache->flush_queue);
list_add_tail(&job->flush_queue, &cache->flush_queue);
spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
if (empty)
wake_up_interruptible(&cache->flush_wait_queue);

Even if this waking up lines are removed
the flush daemon wakes up by its self
100ms at worst after the first 1MB written.

> Don't do this. You can do polling in a piece of code that is executed
> infrequently, such as driver unload, but general functionality must not
> depend on polling.
100ms is the time the user must wait in termination as your claim.

However, I think this is a good time to
explain why I didn't choose to use workqueue.

Keeping it a polling thread that consumes what in queue
instead of replacing it with workqueue
has these reasons:

1. CPU overhead is lower
queue_work overhead everytime flush job is to queue
is crucial for writeboost. queue_work is too CPU-consuming
as far as I looked at the code.
I have measured the theoretical maximum 4KB randwrite throughput
of writeboost using fio benchmark. It was 1.5GB/sec
with fast enough cache device.
I don't think this extraordinary score can not be achieved with
other caching softwares.
In this benchmark, the flush daemon didn't sleep at all
because the writes are ceaseless.
Using workqueue will not only lose throughput
but also wastes CPU cycles both regularly is the fact I dislike.
2. Ease of Error Handling
Keeping it a single thread looping and self-managing the queue
makes it easy to handle error.
If we choose to use workqueue, we need a mechanism to
* notify the occurrence of a error to other work items
* play-back the work items in the same order since waiting until the
system recovers for long time is prohibited with workqueue
as we discussed in the previous posts.
Segments must be flushed and then migrated in sequential
order along its id.
Leaving the control of the queue which should be in desired order
to the workqueue subsystem is dangerous in this case.

For migrate_proc,
I see no reason this daemon polling is bad
although it actually sleeps in leisure time.
In using writeboost, we can obviously assume that
the sequential write throughput of cache device is much higher
than the random write throughput of the backing store
so migrate daemon is unlikely to sleep
since there are many dirty segments to migrate
is the ordinary situation.
Furthermore, since migration daemon is more one block
off from the user (upper layer) than flush daemon
the one second sleeping problem gets weaker or diluted.
If you still dislike the delay of this daemon
I think of forcefully waking up the daemon if it is sleeping
like flush daemon.

> First, you must design some event model - (a bunch of threads polling in 1
> second interval doesn't seem viable). For example, you can use workqueues
> correctly this way:
>
> * You create a workqueue for migration, but you don't submit any work to
> it.
> * If you fill the cache device above the watermark, you submit a work item
> to the migration workqueue (you can submit the same work item to the
> same workqueue multiple times - if the work item is still queued and
> hasn't started executing, the second submit is ignored; if the work item
> has started executing, the second submit causes that it is executed once
> more).
> * The work item does a little bit of work, it finds data to be migrated,
> submits the bios to do the migration and exits. (you can use dm-kcopyd
> to do actual migration, it simplifies your code)
> * If you need more migration, you just submit the work item again.
>
> If you design it this way, you avoid the polling timers (migration starts
> as soon as it is needed) and also you avoid the problem with the looping
> workqueue.
The reason 2 in why I don't choose workqueue
can be reused in designing migration daemon and
I believe keeping it a single thread looping is the best
if we have to care error handling.

dm-kcopyd seems to be a good choice for simplifying the code.
I remember that I discarded this option
for some reason before but I don't remember the reason.
I am thinking of reconsidering this issue and
maybe I can eventually reach to using kcopyd
because the design is changed a lot since that time.

> to lessen the performance impact, but if you lower the polling interval,
> it increases CPU and power consumption when the driver is idle. Either
> way, it is wrong. Just don't use polling.
One last thing.
The power consumption in idle time is the only shortcoming
of polling.
But who cares the power consumption of
waking up once a second and checking a queue is empty?

Akira

2013-10-07 18:07:12

by Mikulas Patocka

[permalink] [raw]
Subject: Re: A review of dm-writeboost



On Sun, 6 Oct 2013, Akira Hayakawa wrote:

> Mikulas,
>
> Thank you for your reviewing.
>
> I will reply to polling issue first.
> For the rest, I will reply later.
>
> > Polling for state
> > -----------------
> >
> > Some of the kernel threads that you spawn poll for data in one-second
> > interval - see migrate_proc, modulator_proc, recorder_proc, sync_proc.
> >
> > flush_proc correctly contains wait_event, but there is still some 100ms
> > polling timeout in flush_proc that shouldn't be there.
> >
> > If you set the polling interval too low, it wastes CPU cycle and it wastes
> > energy due to CPU wake-ups. If you set the polling interval too high, it
> > causes artifical delays. For example, current middle-class SSDs can do
> > writes at a rate 500MB/s. Now, think that the user uses 400MB partition
> > for the cache - the partition is filled up in 0.8s and the kernel waits
> > for additional 0.2s doing absolutely nothing, until your polling code
> > wakes up and notices that it should start flusing the cache.
> >
> > So - the polling code introduces artifical delays that can cause
> > performance degradation. You may think about lowering the polling interval
> > to lessen the performance impact, but if you lower the polling interval,
> > it increases CPU and power consumption when the driver is idle. Either
> > way, it is wrong. Just don't use polling.
>
> You dislike the fact that looping thread sleeping
> can delay to the newly coming jobs.
>
> I am afraid your understanding
> on the sleeping of flush daemon is wrong.
> Please see queue_flushing() it wakes up the daemon
> if the queue is empty.
>
> spin_lock_irqsave(&cache->flush_queue_lock, flags);
> empty = list_empty(&cache->flush_queue);
> list_add_tail(&job->flush_queue, &cache->flush_queue);
> spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
> if (empty)
> wake_up_interruptible(&cache->flush_wait_queue);
>
> Even if this waking up lines are removed
> the flush daemon wakes up by its self
> 100ms at worst after the first 1MB written.

flush_proc is woken up correctly. But the other threads aren't.
migrate_proc, modulator_proc, recorder_proc, sync_proc all do polling.

Waking up every 100ms in flush_proc is not good because it wastes CPU time
and energy if the driver is idle.


BTW. You should use wait_event_interruptible_lock_irq instead of
wait_event_interruptible and wait_event_interruptible_lock_irq_timeout
instead of wait_event_interruptible_timeout. The functions with "lock_irq"
automatically drop and re-acquire the spinlock, so that the condition is
tested while the lock is held - so that there are no asynchronous accesses
to !list_empty(&cache->flush_queue).


> > Don't do this. You can do polling in a piece of code that is executed
> > infrequently, such as driver unload, but general functionality must not
> > depend on polling.
> 100ms is the time the user must wait in termination as your claim.
>
> However, I think this is a good time to
> explain why I didn't choose to use workqueue.
>
> Keeping it a polling thread that consumes what in queue
> instead of replacing it with workqueue
> has these reasons:
>
> 1. CPU overhead is lower
> queue_work overhead everytime flush job is to queue
> is crucial for writeboost. queue_work is too CPU-consuming
> as far as I looked at the code.
> I have measured the theoretical maximum 4KB randwrite throughput
> of writeboost using fio benchmark. It was 1.5GB/sec
> with fast enough cache device.
> I don't think this extraordinary score can not be achieved with
> other caching softwares.
> In this benchmark, the flush daemon didn't sleep at all
> because the writes are ceaseless.
> Using workqueue will not only lose throughput
> but also wastes CPU cycles both regularly is the fact I dislike.
> 2. Ease of Error Handling
> Keeping it a single thread looping and self-managing the queue
> makes it easy to handle error.
> If we choose to use workqueue, we need a mechanism to
> * notify the occurrence of a error to other work items
> * play-back the work items in the same order since waiting until the
> system recovers for long time is prohibited with workqueue
> as we discussed in the previous posts.
> Segments must be flushed and then migrated in sequential
> order along its id.
> Leaving the control of the queue which should be in desired order
> to the workqueue subsystem is dangerous in this case.

You don't have to use workqueues if you don't like them. You can use
kernel threads and wait_event/wake_up instead. Workqueues are simpler to
use in some circumstances (for example, you can submit the same work item
multiple times) but it's up to you if you want that simplicity or not.

But you shouldn't do looping and waiting for events in migrate_proc.

> For migrate_proc,
> I see no reason this daemon polling is bad
> although it actually sleeps in leisure time.
> In using writeboost, we can obviously assume that
> the sequential write throughput of cache device is much higher
> than the random write throughput of the backing store
> so migrate daemon is unlikely to sleep
> since there are many dirty segments to migrate
> is the ordinary situation.

The problem is that if you fill up the whole cache device in less time
than 1 second. Then, you are waiting for 1 second until the migration
process notices that it has some work to do. That waiting is pointless and
degrades performance - the new write requests received by your driver sit
in queue_current_buffer, waiting for data to be migrated. And the
migration process sits in
schedule_timeout_interruptible(msecs_to_jiffies(1000)), waiting for the
end of the timeout. Your driver does absolutely nothing in this moment.

For this reason, you should wake the migration process as soon as
it is needed.

> Furthermore, since migration daemon is more one block
> off from the user (upper layer) than flush daemon
> the one second sleeping problem gets weaker or diluted.
> If you still dislike the delay of this daemon
> I think of forcefully waking up the daemon if it is sleeping
> like flush daemon.
>
> One last thing.
> The power consumption in idle time is the only shortcoming
> of polling.

The other shortcoming is performance degradation. If you decrease polling
time, you improve performance, but increase power consumption. If you
increase polling time, you avoid performance degradation, but you increase
power consumption.

Either decreasing or increasing the polling time is wrong.

> But who cares the power consumption of
> waking up once a second and checking a queue is empty?
>
> Akira

Mikulas

2013-10-08 13:17:45

by Akira Hayakawa

[permalink] [raw]
Subject: Re: [dm-devel] A review of dm-writeboost

Mikulas,

Let me ask you about this comment
of choosing the best API.
For the rest, I will reply later.

> BTW. You should use wait_event_interruptible_lock_irq instead of
> wait_event_interruptible and wait_event_interruptible_lock_irq_timeout
> instead of wait_event_interruptible_timeout. The functions with "lock_irq"
> automatically drop and re-acquire the spinlock, so that the condition is
> tested while the lock is held - so that there are no asynchronous accesses
> to !list_empty(&cache->flush_queue).

wait_event_interruptible_lock_irq_timeout is
added to the kernel since 3.12 by this patch.
https://lkml.org/lkml/2013/8/21/362
However, it is only used by zfcp itself.

I am afraid I want to refrain from this new API
and keep the code as it is now.
Later if this API is widely accepted
it is time to use it in writeboost
is my opinion.
The fact this API is not added for a long time
makes me feel it should not be used at least at this moment of time.
I want to use only those truly stable.
writeboost as a storage kernel module should be stable.
I believe depending only on the stable APIs is a
good way of making a software stable.

All said, I tried to fix this.
Is this change what you meant?

@@ -24,10 +24,10 @@ int flush_proc(void *data)

spin_lock_irqsave(&cache->flush_queue_lock, flags);
while (list_empty(&cache->flush_queue)) {
- spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
- wait_event_interruptible_timeout(
+ wait_event_interruptible_lock_irq_timeout(
cache->flush_wait_queue,
- (!list_empty(&cache->flush_queue)),
+ !list_empty(&cache->flush_queue),
+ cache->flush_queue_lock,
msecs_to_jiffies(100));

/*
@@ -36,8 +36,6 @@ int flush_proc(void *data)
*/
if (kthread_should_stop())
return 0;
- else
- spin_lock_irqsave(&cache->flush_queue_lock, flags);
}

Akira

2013-10-09 02:44:31

by Akira Hayakawa

[permalink] [raw]
Subject: Re: [dm-devel] A review of dm-writeboost

Mikulas,

> Waking up every 100ms in flush_proc is not good because it wastes CPU time
> and energy if the driver is idle.
Yes, 100ms is too short. I will change it to 1sec then.
We can wait for 1 sec in termination.

> The problem is that if you fill up the whole cache device in less time
> than 1 second. Then, you are waiting for 1 second until the migration
> process notices that it has some work to do. That waiting is pointless and
> degrades performance - the new write requests received by your driver sit
> in queue_current_buffer, waiting for data to be migrated. And the
> migration process sits in
> schedule_timeout_interruptible(msecs_to_jiffies(1000)), waiting for the
> end of the timeout. Your driver does absolutely nothing in this moment.
>
> For this reason, you should wake the migration process as soon as
> it is needed.
I see the reason. I agree.

Migration is not restlessly executed in writeboost.
The cache device is full and needs migration to make room for new writes
is "urgent" situation.
Setting reserving_segment_id to non-zero can tell
migration daemon that the migration is urgent.

There is also a non-urgent migration
when the backing store is loaded lower than threshold.
Restlessly migrating to the backing store may affect the
whole system. For example, it may defer the read request
to the backing store.
So, migrating only in idle time can be a good strategy.

> Pointless modulator_proc
> ------------------------
>
> This thread does no work, it just turns cache->allow_migrate on and off.
> Thus, the thread is useless, you can just remove it. In the place where
> you read cache->allow_migrate (in migrate_proc) you could just do the work
> that used to be performed in modulator_proc.
Sure, it turns the flag on and off, and this daemon is needful.
This daemon calculates the load of the backing store then
moving this code to migrate_proc and do the same thing
every loop is too CPU consuming.
Make a decision every second seems to be reasonable.

However, some system doesn't want to delay migration at all
because the backing store has a large write back cache
and wants it filled for its optimization
(e.g. reordering) to be effective.
In this case, setting both enable_migration_modulator
and allow_migrate to 0 will do.

Also, note that related to migration
nr_max_batched_migration can determine how many segments
can be migrated at a time.

Back to the urgent migration,
the problem can be solved easily.
How about inserting waking up the migration daemon
just after reserving_segment_id to non-zero.
It is similar to waking up flush daemon when it queues a flush job.

void wait_for_migration(struct wb_cache *cache, u64 id)
{
struct segment_header *seg = get_segment_header_by_id(cache, id);

/*
* Set reserving_segment_id to non zero
* to force the migartion daemon
* to complete migarate of this segment
* immediately.
*/
cache->reserving_segment_id = id;
// HERE
wait_for_completion(&seg->migrate_done);
cache->reserving_segment_id = 0;
}

> flush_proc is woken up correctly. But the other threads aren't.
> migrate_proc, modulator_proc, recorder_proc, sync_proc all do polling.
For other daemons,
modulator: turns on and off according to the load of the backing store every second (default ON)
recorder: update the super block record every T seconds (default T=60)
sync: make all the transient data persistent every T seconds (default T=60)

They are just looping themselves.

Maybe, recorder and sync should be turned off for default.
- Recorder daemon is just for fast rebooting. The record section contains
the last_migrated_segment_id which is used in recover_cache()
to decrease the segments to recover.
- Sync daemon is for SLA in enterprise. Some user want to
make the storage system persistent every given period.
This is needless intrinsically. So, turning it off by default is appropriate.

Akira

2013-10-09 07:39:59

by Akira Hayakawa

[permalink] [raw]
Subject: Re: A review of dm-writeboost

Mikulas,

> Next, you need to design some locking - which variables are protected by
> which locks. If you use shared variables without locks, you need to use
> memory barriers (it is harder to design code using memory barriers than
> locks).

First I will explain the locking and the shared variables.
writeboost doesn't have many variables shared among asynchronous threads.

1. We have HashTable for cache lookup and RAM buffer for write data transiently.
These data structures are locked by a mutex io_lock.
barrier_ios is a queue to push bio flagged with REQ_FUA or REQ_FLUSH.
This queue is also locked by io_lock.
These bios are chained to flush job inside io_lock.
queue_flushing is a routine that makes a flush job and queues it in
flush_queue which is protected by a spinlock flush_queue_lock.
This foreground (.map) is a producer of flush jobs to flush daemon.
2. flush daemon loops flush_proc in single thread.
It pops and consumes one flush job from flush_queue (writing to the cache device).
Therefore, foreground should keep that the flush job is
completely initialized before queues it.
After completing the writing, it sets(actually increments) the last_flush_segment_id.
This daemon is the only to update this shared variable.
3. migrate daemon loops migrate_proc also running in the single thread.
It first calculates how many (nr_mig) segments from
last_migrated_segment_id + 1 to + nr_mig by referring
last_flushed_segment_id and last_migrate_segment_id.
The former is the head of the dirty log and the latter
is the tail. Note that the head doesn't have to be synchronized
since the last_flush_segment_id linearly increases.
last_flush_segment_id not synced only results in less number of migration than actual.
It doesn't collapse anything at all.
After migrate all the nr_mig dirty segments, it increments
last_migrated_segment_id by nr_mig.
This daemon is the only to update this last_migrate_segment_id.

Other daemons don't share such variables.

> Think of this piece of code:
> nr_mig_candidates = cache->last_flushed_segment_id -
> cache->last_migrated_segment_id;
> ...
> nr_mig = min(nr_mig_candidates,
> cache->nr_cur_batched_migration);
> ...
> for (i = 1; i <= nr_mig; i++) {
> seg = get_segment_header_by_id(
> cache,
> cache->last_migrated_segment_id + i);
> list_add_tail(&seg->migrate_list, &cache->migrate_list);
> }
>
> The processor may reorder all memory accesses - for example it may read
> the data accessed in the "for" cycle before reading the variables
> "cache->last_flushed_segment_id" and "cache->last_migrated_segment_id". If
> this happens, the code may work with invalid data.

True?
My understanding is that reordering does not happen
for these three blocks of code because there are sequential data dependencies.
That the last block is while loop is the point?

But, I sorted our other two points we need to insert memory barriers.

First is in queue_flushing.
We can't deny that incomplete flush job is queued and
flush_proc goes bad.
I am thinking of inserting smp_wmb before queuing and
smp_rmb after popping a job in flush_proc.

Second is in migrate_proc.
We can't deny that migrate_proc goes into migrate_linked_segments
before cache->migrate_list is properly initialized.
I am thinking of inserting smp_wmb before migrate_linked_segments.

I don't understand memory barrier well.
So, please fix if I am going the wrong way.

> Nonatomic 64-bit variables
> --------------------------
>
> Note that 64-bit variables are not atomic on 32-bit architectures.
>
> Linux assumes that 32-bit variables are atomic on 32-bit architectures and
> 64-bit or 32-bit variables are atomic on 64-bit architectures. That is,
> variables having the "long" or "int" type are atomic. Atomicity means that
> if you read and modify the variable at the same time, you read either the
> old value or the new values.
>
> 64-bit variables on 32-bit architectures are usually read and written in
> two steps, and thus the atomicity requirement isn't true.
>
> For example, suppose that you change cache->last_flushed_segment_id from
> 0x00000000ffffffff to 0x0000000100000000. Now, the function migrate_proc
> that asynchronously reads cache->last_flushed_segment_id can read
> 0x00000000ffffffff or 0x0000000100000000 (if it reads one of these values,
> it behaves correctly), but it can also read 0x0000000000000000 or
> 0x00000001ffffffff - if migrate_proc reads one of these two values, it
> goes wild, migrating segments that weren't ever supposed to be migrated,
> and likely causes a crash or data corruption.
>
> I found this bug in migrate_proc and update_superblock_record (but it may
> be also in other parts of the code).
>
> You can use atomic64_t type for atomic 64-bit variables (plus memory
> barriers as explained in the previous section). Or you can use locks.
>
> reserving_segment_id is also affected. However, you never actually need
> the value of reserving_segment_id, you only compare it to zero. Thus, you
> can change this variable to the "int" type and set it to "0" or "1". (you
> must use int and not bool, see the next section).
Yes.
reserving_segment_id will be renamed to `urge_migrate` (::int)
that will fit more.

> Variables smaller than 32 bits must not be asynchronously modified
> ------------------------------------------------------------------
>
> Early Alpha processors can't access memory objects smaller than 32 bits -
> so, for example, if your code writes 8-bit char or bool variable, the
> processor reads 32 bits to a register, modifies 8-bit part of the register
> and writes 32 bits from the register back to memory. Now, if you define
> something like
> unsigned char a;
> unsigned char b;
> and modify "a" from one "thread 1" and modify "b" from "thread 2", the
> threads could destroy each other's modification - the "thread 1" can
> destroy "b" (although in the C code it never references "b") and "thread
> 2" can destroy "a" (although in the C code it never references "a") - for
> this reason - if you have variables that you modify asynchronously, they
> shouldn't have a type smaller than 32 bits.
>
> bool allow_migrate, bool enable_migration_modulator, bool on_terminate
> have this problem, change them to int.
Yes. May I use true/false to update these values or 1/0 instead?

> 64-bit divide and modulo
> ------------------------
>
> You can't use divide and modulo operators ('/' and '%') on 64-bit values.
> On 32-bit architectures, these operators need library functions and these
> functions are not present in the kernel. You need to use div64_u64
> (divides 64-bit value by 64-bit value), div64_u64_rem (divides 64-bit
> value by 64-bit value and also returns a remainder), You can also use
> do_div (it divides a 64-bit value by a 32-bit value), or sector_div (it
> divides sector_t by a 32-bit value).
>
> Try to compile your driver with 32-bit kernel and you will see that '/'
> and '%' doesn't work on 64-bit values.
Yes.

> Wrong printf arguments
> ----------------------
>
> Try to compile your driver on 32-bit system. You get some warnings.
>
> size_t x;
> printf("%lu", x) - this is wrong because %lu says that the type is
> unsigned long and the type is size_t (size_t may be unsigned or unsigned
> long on different architectures). It should be
> printf("%zu", z);
>
> sector_t s;
> printf("%lu", s) - this is wrong because the sector_t may not have the
> same type as unsigned long. (on 32-bit architectures, you can select
> 32-bit sector_t or 64-bit sector_t in kernel configuration). It should be
> printf("%llu", (unsigned long long)s);
>
> DMEMIT("%lu ", atomic64_read(v)); - wrong, because format is unsigned long
> and type is 64-bit. It should be
> DMEMIT("%llu ", (unsigned long long)atomic64_read(v));
Yes.

> GFP_NOIO allocations
> --------------------
>
> If possible, you should use mempool instead. Mempool allocation doesn't
> fail (if memory is exhausted, it waits until some objects are returned to
> the mempool).
>
> kmalloc_retry is not needed - there's a flag __GFP_NOFAIL that does
> infinite retry.
>
> The problem with GFP_IO is that if the system is in such a state when all
> memory is dirty and the only way how to free memory is to write pages to
> the swap, it deadlocks - the memory manager waits for your driver to write
> some data to the swap - and your driver is waiting for the memory manager
> to free some memory so that you can allocate memory and process the write.
>
> To avoid this problem, use mempools.
No, I use GFP_NOIO not GFP_IO.

Yes, I understand that we don't need kmalloc_retry. I didn't know about __GFP_NOFAIL.
I will rewrite all kmalloc_retry.

> Lack of ACCESS_ONCE
> -------------------
>
> You can read a variable while you modify it (the variable must not be
> bigger than "long" type, see the section "Nonatomic 64-bit variables").
>
> However, if you read a variable that may be modified, you must use the
> ACCESS_ONCE macro.
>
> For example see this:
> if (cache->on_terminate)
> return;
> cache->on_terminate may change while you are reading it, so you should use
> if (ACCESS_ONCE(cache->on_terminate))
> return;
>
> There are many other variables that are read while modifying them and that
> need ACCESS_ONCE, for example cache->allow_migrate. There are plenty of
> other variables in your code that may be read and modified at the same
> time.
>
> The reason why you need ACCESS_ONCE is that the C compiler assumes that
> the variable that you read doesn't change. Thus, it can perform certain
> optimizations. ACCESS_ONCE suppresses these optimizations.
>
> In most cases, omitting ACCESS_ONCE doesn't cause any misbehavior, for the
> reason that the compiler doesn't use the assumption, that the variable
> doesn't change, to perform optimizations. But the compiler may use this
> assumption, and if it does, it triggers a hard to find bug.
I understand.
The other variables could be the tunable parameters writeboost provides.

> Infinite retry on I/O error in dm_safe_io_retry
> -----------------------------------------------
>
> Don't do this. If the cache disk fails for some reason, it kills the
> system by doing inifinite retry.
>
> Generally, I/O retry is handler by the lowest level driver (for example,
> if ATA disk fails to respond, the driver tries to reset the disk and
> reissue the request). If you get I/O error, you can assume that the lowest
> level driver retried as much as it could and you shouldn't do additional
> retry.
>
> If you can't handle a specific I/O request failure gracefully, you should
> mark the driver as dead, don't do any more I/Os to the disk or cache
> device and return -EIO on all incoming requests.
>
> Always think that I/O failures can happen because of connection problems,
> not data corruption problems - for example, a disk cable can go loose, a
> network may lose connectivity, etc. In these cases, it is best to stop
> doing any I/O at all and let the user resolve the situation.
>
> I think that you should always stop the driver on write error (stopping it
> minimizes data damage when there is connectivity problem such as loose
> cable). On read error you should stop the driver if the error is
> non-recoverable (for example, when you get error when reading the cache
> device in the migration process). You don't have to stop on read error, if
> it is recoverable (for example, when you see read error on the disk, you
> can just return -EIO to the user without stopping the driver).
Why I introduced infinite retry is that
I experienced a cache device connected by USB-SATA2 adapter
sometimes fails when the I/O are heavy for the cheap firmware.

So, please let me retry only once after waiting for 1 sec.
If the I/O fails again, I will stop the module marking as dead.
Or, should I immediately stop the module?

My design idea is to create a new daemon
that sets the module dead after failure reported from other daemons.
and listens to the client reports he/she fixed the problem and
this daemon starts the module again.

There is a wait queue all daemons except
this daemon are waiting for the status turns green.

The hint for fix is seen in dmesg
that is printed at the point failure happened.
It is like "in flush_proc writing to cache failed (sector xxx-yyy)"
and then the client finds something happened in writing to cache device.

Stopping the module is by rejecting all the
writes at the beginning of .map method.
Yes, if the read may not be dead we can still continues read service but
we should assume the worst case.
we should not even read a device that fails writes
since we don't guess what happens inside the storage.
Maybe, it returns crazy data and damages the applications.

Akira

2013-10-14 08:28:45

by Akira Hayakawa

[permalink] [raw]
Subject: Re: A review of dm-writeboost

Hi, DM Guys

I suppose I have finished the tasks to
answer Mikulas's pointing outs.
So, let me update the progress report.

The code is updated now on my Github repo.
Checkout the develop branch to avail
the latest source code.

Compilation Status
------------------
First, compilation status.
Mikulas's advised me to compile the module in
32 bit environment. and Yes, I did.
With all these kernels listed below
writeboost can compile without any error nor warning.

For 64 bit
3.2.36, 3.4.25, 3.5.7, 3.6.9, 3.7.5, 3.8.5, 3.9.8, 3.10.5, 3.11.4 and 3.12-rc1

For 32 bit
3.2.0-4-686-pae (Debian 7.1.0-i386)


Block up on error
-----------------
The most annoying thing in this update
is how to handle the I/O error.
For memory allocation error,
writeboost now makes use of mempool to avoid
the problem Mikulas's said in his last comments
but handling I/O error gracefully
when the system is running is very difficult.

My answer is
all the daemon stop when I/O error (-EIO returned) happens
in any part of this module.
They waits on wait_queue (blockup_wait_queue)
and reactivates when sysadmin turns `blockup` variable to 0
through message interface.
When `blockup` is 1, all the incoming I/O
are returned as -EIO to the upper layer.
RETRY macro is introduced
which wraps doing I/O and
retries I/O submission if the error is -ENOMEN but
turns blockup to 1 and sleeps if the error is -EIO.
-EIO is more serious than -ENOMEM because
it may destroy the storage for some accidental problem
that we have no control in device-mapper layer
(e.g. the storage controller went crazy).
Blocking up the whole I/O is to minimize the
probable damage.

But, XFS stalls ...
-------------------
For testing,
I manually turns `blockup` to 1
when compiling Ruby is in progress
on XFS on a writeboost device.
As soon as I do it,
XFS starts to dump error message
like "metadata I/O error: ... ("xlog_iodone") error ..."
and after few seconds it then starts to dump
like "BUG: soft lockup -CPU#3 stuck for 22s!".
The system stalls and doesn't accept the keyboard.

I think this behavior is caused by
the device always returning -EIO after turning
the variable to 1.
But why XFS goes stalling on I/O error?
It should just suspend and starts returning
error to the upper layer as writeboost now does.
As Mikulas said the I/O error is often
due to connection failure that is usually recoverable.
Stalling the kernel will need reboot
after recovering nevertheless writeboost
can recover just by again turning `blockup` to 0.
Any reason for this design or
existing of a option to not stall XFS on I/O error?

Thanks,
Akira

----------------------------------------------

Followed by
changes to Driver/* and Documentation

diff --git a/Driver/dm-writeboost-daemon.c b/Driver/dm-writeboost-daemon.c
index cb8febe..7a7f353 100644
--- a/Driver/dm-writeboost-daemon.c
+++ b/Driver/dm-writeboost-daemon.c
@@ -12,9 +12,11 @@

int flush_proc(void *data)
{
+ int r;
unsigned long flags;

struct wb_cache *cache = data;
+ struct wb_device *wb = cache->wb;

while (true) {
struct flush_job *job;
@@ -22,13 +24,13 @@ int flush_proc(void *data)
struct dm_io_request io_req;
struct dm_io_region region;

+ wait_on_blockup();
+
spin_lock_irqsave(&cache->flush_queue_lock, flags);
while (list_empty(&cache->flush_queue)) {
spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
- wait_event_interruptible_timeout(
- cache->flush_wait_queue,
- (!list_empty(&cache->flush_queue)),
- msecs_to_jiffies(100));
+
+ schedule_timeout_interruptible(msecs_to_jiffies(1000));

/*
* flush daemon can exit
@@ -49,6 +51,8 @@ int flush_proc(void *data)
list_del(&job->flush_queue);
spin_unlock_irqrestore(&cache->flush_queue_lock, flags);

+ smp_rmb();
+
seg = job->seg;

io_req = (struct dm_io_request) {
@@ -65,9 +69,9 @@ int flush_proc(void *data)
.count = (seg->length + 1) << 3,
};

- dm_safe_io_retry(&io_req, 1, &region, false);
+ RETRY(dm_safe_io(&io_req, 1, &region, NULL, false));

- cache->last_flushed_segment_id = seg->global_id;
+ atomic64_set(&cache->last_flushed_segment_id, seg->global_id);

complete_all(&seg->flush_done);

@@ -78,15 +82,15 @@ int flush_proc(void *data)
*/
if (!bio_list_empty(&job->barrier_ios)) {
struct bio *bio;
- blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL);
+ RETRY(blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL));
while ((bio = bio_list_pop(&job->barrier_ios)))
bio_endio(bio, 0);

mod_timer(&cache->barrier_deadline_timer,
- msecs_to_jiffies(cache->barrier_deadline_ms));
+ msecs_to_jiffies(ACCESS_ONCE(cache->barrier_deadline_ms)));
}

- kfree(job);
+ mempool_free(job, cache->flush_job_pool);
}
return 0;
}
@@ -101,7 +105,7 @@ void queue_barrier_io(struct wb_cache *cache, struct bio *bio)

if (!timer_pending(&cache->barrier_deadline_timer))
mod_timer(&cache->barrier_deadline_timer,
- msecs_to_jiffies(cache->barrier_deadline_ms));
+ msecs_to_jiffies(ACCESS_ONCE(cache->barrier_deadline_ms)));
}

void barrier_deadline_proc(unsigned long data)
@@ -147,6 +151,7 @@ static void migrate_endio(unsigned long error, void *context)
static void submit_migrate_io(struct wb_cache *cache,
struct segment_header *seg, size_t k)
{
+ int r;
u8 i, j;
size_t a = cache->nr_caches_inseg * k;
void *p = cache->migrate_buffer + (cache->nr_caches_inseg << 12) * k;
@@ -184,7 +189,7 @@ static void submit_migrate_io(struct wb_cache *cache,
.sector = mb->sector,
.count = (1 << 3),
};
- dm_safe_io_retry(&io_req_w, 1, &region_w, false);
+ RETRY(dm_safe_io(&io_req_w, 1, &region_w, NULL, false));
} else {
for (j = 0; j < 8; j++) {
bool b = dirty_bits & (1 << j);
@@ -205,8 +210,7 @@ static void submit_migrate_io(struct wb_cache *cache,
.sector = mb->sector + j,
.count = 1,
};
- dm_safe_io_retry(
- &io_req_w, 1, &region_w, false);
+ RETRY(dm_safe_io(&io_req_w, 1, &region_w, NULL, false));
}
}
}
@@ -216,7 +220,9 @@ static void memorize_dirty_state(struct wb_cache *cache,
struct segment_header *seg, size_t k,
size_t *migrate_io_count)
{
+ int r;
u8 i, j;
+ struct wb_device *wb = cache->wb;
size_t a = cache->nr_caches_inseg * k;
void *p = cache->migrate_buffer + (cache->nr_caches_inseg << 12) * k;
struct metablock *mb;
@@ -233,7 +239,7 @@ static void memorize_dirty_state(struct wb_cache *cache,
.sector = seg->start_sector + (1 << 3),
.count = seg->length << 3,
};
- dm_safe_io_retry(&io_req_r, 1, &region_r, false);
+ RETRY(dm_safe_io(&io_req_r, 1, &region_r, NULL, false));

/*
* We take snapshot of the dirtiness in the segments.
@@ -281,6 +287,8 @@ static void cleanup_segment(struct wb_cache *cache, struct segment_header *seg)

static void migrate_linked_segments(struct wb_cache *cache)
{
+ struct wb_device *wb = cache->wb;
+ int r;
struct segment_header *seg;
size_t k, migrate_io_count = 0;

@@ -336,7 +344,7 @@ migrate_write:
* on this issue by always
* migrating those data persistently.
*/
- blkdev_issue_flush(cache->wb->device->bdev, GFP_NOIO, NULL);
+ RETRY(blkdev_issue_flush(cache->wb->device->bdev, GFP_NOIO, NULL));

/*
* Discarding the migrated regions
@@ -351,43 +359,46 @@ migrate_write:
* will craze the cache.
*/
list_for_each_entry(seg, &cache->migrate_list, migrate_list) {
- blkdev_issue_discard(cache->device->bdev,
- seg->start_sector + (1 << 3),
- seg->length << 3,
- GFP_NOIO, 0);
+ RETRY(blkdev_issue_discard(cache->device->bdev,
+ seg->start_sector + (1 << 3),
+ seg->length << 3,
+ GFP_NOIO, 0));
}
}

int migrate_proc(void *data)
{
struct wb_cache *cache = data;
+ struct wb_device *wb = cache->wb;

while (!kthread_should_stop()) {
bool allow_migrate;
- size_t i, nr_mig_candidates, nr_mig, nr_max_batch;
+ u32 i, nr_mig_candidates, nr_mig, nr_max_batch;
struct segment_header *seg, *tmp;

+ wait_on_blockup();
+
/*
* If urge_migrate is true
* Migration should be immediate.
*/
- allow_migrate = cache->urge_migrate ||
- cache->allow_migrate;
+ allow_migrate = ACCESS_ONCE(cache->urge_migrate) ||
+ ACCESS_ONCE(cache->allow_migrate);

if (!allow_migrate) {
schedule_timeout_interruptible(msecs_to_jiffies(1000));
continue;
}

- nr_mig_candidates = cache->last_flushed_segment_id -
- cache->last_migrated_segment_id;
+ nr_mig_candidates = atomic64_read(&cache->last_flushed_segment_id) -
+ atomic64_read(&cache->last_migrated_segment_id);

if (!nr_mig_candidates) {
schedule_timeout_interruptible(msecs_to_jiffies(1000));
continue;
}

- nr_max_batch = cache->nr_max_batched_migration;
+ nr_max_batch = ACCESS_ONCE(cache->nr_max_batched_migration);
if (cache->nr_cur_batched_migration != nr_max_batch) {
/*
* Request buffer for nr_max_batch size.
@@ -411,10 +422,17 @@ int migrate_proc(void *data)
for (i = 1; i <= nr_mig; i++) {
seg = get_segment_header_by_id(
cache,
- cache->last_migrated_segment_id + i);
+ atomic64_read(&cache->last_migrated_segment_id) + i);
list_add_tail(&seg->migrate_list, &cache->migrate_list);
}

+ /*
+ * We insert write barrier here
+ * to make sure that migrate list
+ * is complete.
+ */
+ smp_wmb();
+
migrate_linked_segments(cache);

/*
@@ -422,7 +440,7 @@ int migrate_proc(void *data)
* Only line of code changes
* last_migrate_segment_id during runtime.
*/
- cache->last_migrated_segment_id += nr_mig;
+ atomic64_add(nr_mig, &cache->last_migrated_segment_id);

list_for_each_entry_safe(seg, tmp,
&cache->migrate_list,
@@ -449,6 +467,7 @@ void wait_for_migration(struct wb_cache *cache, u64 id)
* immediately.
*/
cache->urge_migrate = true;
+ wake_up_process(cache->migrate_daemon);
wait_for_completion(&seg->migrate_done);
cache->urge_migrate = false;
}
@@ -466,14 +485,16 @@ int modulator_proc(void *data)

while (!kthread_should_stop()) {

+ wait_on_blockup();
+
new = jiffies_to_msecs(part_stat_read(hd, io_ticks));

- if (!cache->enable_migration_modulator)
+ if (!ACCESS_ONCE(cache->enable_migration_modulator))
goto modulator_update;

- util = (100 * (new - old)) / 1000;
+ util = div_u64(100 * (new - old), 1000);

- if (util < wb->migrate_threshold)
+ if (util < ACCESS_ONCE(wb->migrate_threshold))
cache->allow_migrate = true;
else
cache->allow_migrate = false;
@@ -490,15 +511,17 @@ modulator_update:

static void update_superblock_record(struct wb_cache *cache)
{
+ int r;
+ struct wb_device *wb = cache->wb;
struct superblock_record_device o;
void *buf;
struct dm_io_request io_req;
struct dm_io_region region;

o.last_migrated_segment_id =
- cpu_to_le64(cache->last_migrated_segment_id);
+ cpu_to_le64(atomic64_read(&cache->last_migrated_segment_id));

- buf = kmalloc_retry(1 << SECTOR_SHIFT, GFP_NOIO | __GFP_ZERO);
+ buf = mempool_alloc(cache->buf_1_pool, GFP_NOIO | __GFP_ZERO);
memcpy(buf, &o, sizeof(o));

io_req = (struct dm_io_request) {
@@ -513,18 +536,22 @@ static void update_superblock_record(struct wb_cache *cache)
.sector = (1 << 11) - 1,
.count = 1,
};
- dm_safe_io_retry(&io_req, 1, &region, true);
- kfree(buf);
+ RETRY(dm_safe_io(&io_req, 1, &region, NULL, false));
+ mempool_free(buf, cache->buf_1_pool);
}

int recorder_proc(void *data)
{
struct wb_cache *cache = data;
+ struct wb_device *wb = cache->wb;
unsigned long intvl;

while (!kthread_should_stop()) {
+
+ wait_on_blockup();
+
/* sec -> ms */
- intvl = cache->update_record_interval * 1000;
+ intvl = ACCESS_ONCE(cache->update_record_interval) * 1000;

if (!intvl) {
schedule_timeout_interruptible(msecs_to_jiffies(1000));
@@ -542,12 +569,17 @@ int recorder_proc(void *data)

int sync_proc(void *data)
{
+ int r;
struct wb_cache *cache = data;
+ struct wb_device *wb = cache->wb;
unsigned long intvl;

while (!kthread_should_stop()) {
+
+ wait_on_blockup();
+
/* sec -> ms */
- intvl = cache->sync_interval * 1000;
+ intvl = ACCESS_ONCE(cache->sync_interval) * 1000;

if (!intvl) {
schedule_timeout_interruptible(msecs_to_jiffies(1000));
@@ -555,7 +587,8 @@ int sync_proc(void *data)
}

flush_current_buffer(cache);
- blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL);
+
+ RETRY(blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL));

schedule_timeout_interruptible(msecs_to_jiffies(intvl));
}
diff --git a/Driver/dm-writeboost-metadata.c b/Driver/dm-writeboost-metadata.c
index 77ffb28..a6bd584 100644
--- a/Driver/dm-writeboost-metadata.c
+++ b/Driver/dm-writeboost-metadata.c
@@ -16,29 +16,31 @@ struct part {

struct bigarray {
struct part *parts;
- size_t nr_elems;
- size_t elemsize;
+ u64 nr_elems;
+ u32 elemsize;
};

#define ALLOC_SIZE (1 << 16)
-static size_t nr_elems_in_part(struct bigarray *arr)
+static u32 nr_elems_in_part(struct bigarray *arr)
{
- return ALLOC_SIZE / arr->elemsize;
+ return div_u64(ALLOC_SIZE, arr->elemsize);
};

-static size_t nr_parts(struct bigarray *arr)
+static u64 nr_parts(struct bigarray *arr)
{
- return dm_div_up(arr->nr_elems, nr_elems_in_part(arr));
+ u64 a = arr->nr_elems;
+ u32 b = nr_elems_in_part(arr);
+ return div_u64(a + b - 1, b);
}

-static struct bigarray *make_bigarray(size_t elemsize, size_t nr_elems)
+static struct bigarray *make_bigarray(u32 elemsize, u64 nr_elems)
{
- size_t i, j;
+ u64 i, j;
struct part *part;

struct bigarray *arr = kmalloc(sizeof(*arr), GFP_KERNEL);
if (!arr) {
- WBERR();
+ WBERR("failed to alloc arr");
return NULL;
}

@@ -46,7 +48,7 @@ static struct bigarray *make_bigarray(size_t elemsize, size_t nr_elems)
arr->nr_elems = nr_elems;
arr->parts = kmalloc(sizeof(struct part) * nr_parts(arr), GFP_KERNEL);
if (!arr->parts) {
- WBERR();
+ WBERR("failed to alloc parts");
goto bad_alloc_parts;
}

@@ -54,7 +56,7 @@ static struct bigarray *make_bigarray(size_t elemsize, size_t nr_elems)
part = arr->parts + i;
part->memory = kmalloc(ALLOC_SIZE, GFP_KERNEL);
if (!part->memory) {
- WBERR();
+ WBERR("failed to alloc part memory");
for (j = 0; j < i; j++) {
part = arr->parts + j;
kfree(part->memory);
@@ -82,11 +84,11 @@ static void kill_bigarray(struct bigarray *arr)
kfree(arr);
}

-static void *bigarray_at(struct bigarray *arr, size_t i)
+static void *bigarray_at(struct bigarray *arr, u64 i)
{
- size_t n = nr_elems_in_part(arr);
- size_t j = i / n;
- size_t k = i % n;
+ u32 n = nr_elems_in_part(arr);
+ u32 k;
+ u64 j = div_u64_rem(i, n, &k);
struct part *part = arr->parts + j;
return part->memory + (arr->elemsize * k);
}
@@ -104,18 +106,18 @@ static void *bigarray_at(struct bigarray *arr, size_t i)
/*
* Get the in-core metablock of the given index.
*/
-static struct metablock *mb_at(struct wb_cache *cache, cache_nr idx)
+static struct metablock *mb_at(struct wb_cache *cache, u32 idx)
{
- u64 seg_idx = idx / cache->nr_caches_inseg;
+ u32 idx_inseg;
+ u32 seg_idx = div_u64_rem(idx, cache->nr_caches_inseg, &idx_inseg);
struct segment_header *seg =
bigarray_at(cache->segment_header_array, seg_idx);
- cache_nr idx_inseg = idx % cache->nr_caches_inseg;
return seg->mb_array + idx_inseg;
}

static void mb_array_empty_init(struct wb_cache *cache)
{
- size_t i;
+ u32 i;
for (i = 0; i < cache->nr_caches; i++) {
struct metablock *mb = mb_at(cache, i);
INIT_HLIST_NODE(&mb->ht_list);
@@ -126,34 +128,35 @@ static void mb_array_empty_init(struct wb_cache *cache)
}

static sector_t calc_segment_header_start(struct wb_cache *cache,
- u64 segment_idx)
+ u32 segment_idx)
{
return (1 << 11) + (1 << cache->segment_size_order) * (segment_idx);
}

static u32 calc_segment_lap(struct wb_cache *cache, u64 segment_id)
{
- u32 a = (segment_id - 1) / cache->nr_segments;
+ u64 a = div_u64(segment_id - 1, cache->nr_segments);
return a + 1;
};

-static u64 calc_nr_segments(struct dm_dev *dev, struct wb_cache *cache)
+static u32 calc_nr_segments(struct dm_dev *dev, struct wb_cache *cache)
{
sector_t devsize = dm_devsize(dev);
- return (devsize - (1 << 11)) / (1 << cache->segment_size_order);
+ return div_u64(devsize - (1 << 11), 1 << cache->segment_size_order);
}

sector_t calc_mb_start_sector(struct wb_cache *cache,
struct segment_header *seg,
- cache_nr mb_idx)
+ u32 mb_idx)
{
- size_t k = 1 + (mb_idx % cache->nr_caches_inseg);
- return seg->start_sector + (k << 3);
+ u32 idx;
+ div_u64_rem(mb_idx, cache->nr_caches_inseg, &idx);
+ return seg->start_sector + ((1 + idx) << 3);
}

-bool is_on_buffer(struct wb_cache *cache, cache_nr mb_idx)
+bool is_on_buffer(struct wb_cache *cache, u32 mb_idx)
{
- cache_nr start = cache->current_seg->start_idx;
+ u32 start = cache->current_seg->start_idx;
if (mb_idx < start)
return false;

@@ -170,15 +173,14 @@ bool is_on_buffer(struct wb_cache *cache, cache_nr mb_idx)
struct segment_header *get_segment_header_by_id(struct wb_cache *cache,
u64 segment_id)
{
- struct segment_header *r =
- bigarray_at(cache->segment_header_array,
- (segment_id - 1) % cache->nr_segments);
- return r;
+ u32 idx;
+ div_u64_rem(segment_id - 1, cache->nr_segments, &idx);
+ return bigarray_at(cache->segment_header_array, idx);
}

static int __must_check init_segment_header_array(struct wb_cache *cache)
{
- u64 segment_idx, nr_segments = cache->nr_segments;
+ u32 segment_idx, nr_segments = cache->nr_segments;
cache->segment_header_array =
make_bigarray(sizeof_segment_header(cache), nr_segments);
if (!cache->segment_header_array) {
@@ -225,9 +227,8 @@ static void free_segment_header_array(struct wb_cache *cache)
*/
static int __must_check ht_empty_init(struct wb_cache *cache)
{
- cache_nr idx;
- size_t i;
- size_t nr_heads;
+ u32 idx;
+ size_t i, nr_heads;
struct bigarray *arr;

cache->htsize = cache->nr_caches;
@@ -266,7 +267,9 @@ static void free_ht(struct wb_cache *cache)

struct ht_head *ht_get_head(struct wb_cache *cache, struct lookup_key *key)
{
- return bigarray_at(cache->htable, key->sector % cache->htsize);
+ u32 idx;
+ div_u64_rem(key->sector, cache->htsize, &idx);
+ return bigarray_at(cache->htable, idx);
}

static bool mb_hit(struct metablock *mb, struct lookup_key *key)
@@ -328,12 +331,14 @@ void discard_caches_inseg(struct wb_cache *cache, struct segment_header *seg)

/*----------------------------------------------------------------*/

-static int read_superblock_header(struct superblock_header_device *sup,
+static int read_superblock_header(struct wb_cache *cache,
+ struct superblock_header_device *sup,
struct dm_dev *dev)
{
int r = 0;
struct dm_io_request io_req_sup;
struct dm_io_region region_sup;
+ struct wb_device *wb = cache->wb;

void *buf = kmalloc(1 << SECTOR_SHIFT, GFP_KERNEL);
if (!buf) {
@@ -358,7 +363,7 @@ static int read_superblock_header(struct superblock_header_device *sup,
kfree(buf);

if (r) {
- WBERR();
+ WBERR("io failed in reading superblock header");
return r;
}

@@ -378,7 +383,7 @@ int __must_check audit_cache_device(struct dm_dev *dev, struct wb_cache *cache,
{
int r = 0;
struct superblock_header_device sup;
- r = read_superblock_header(&sup, dev);
+ r = read_superblock_header(cache, &sup, dev);
if (r) {
WBERR("read superblock header failed");
return r;
@@ -407,6 +412,7 @@ int __must_check audit_cache_device(struct dm_dev *dev, struct wb_cache *cache,
static int format_superblock_header(struct dm_dev *dev, struct wb_cache *cache)
{
int r = 0;
+ struct wb_device *wb = cache->wb;
struct dm_io_request io_req_sup;
struct dm_io_region region_sup;

@@ -465,7 +471,8 @@ static void format_segmd_endio(unsigned long error, void *__context)
*/
int __must_check format_cache_device(struct dm_dev *dev, struct wb_cache *cache)
{
- u64 i, nr_segments = calc_nr_segments(dev, cache);
+ u32 i, nr_segments = calc_nr_segments(dev, cache);
+ struct wb_device *wb = cache->wb;
struct format_segmd_context context;
struct dm_io_request io_req_sup;
struct dm_io_region region_sup;
@@ -569,6 +576,7 @@ read_superblock_record(struct superblock_record_device *record,
struct wb_cache *cache)
{
int r = 0;
+ struct wb_device *wb = cache->wb;
struct dm_io_request io_req;
struct dm_io_region region;

@@ -590,7 +598,7 @@ read_superblock_record(struct superblock_record_device *record,
.sector = (1 << 11) - 1,
.count = 1,
};
- r = dm_safe_io(&io_req, 1, &region, NULL, true);
+ r = dm_safe_io(&io_req, 1, &region, NULL, false);

kfree(buf);

@@ -606,9 +614,10 @@ read_superblock_record(struct superblock_record_device *record,

static int __must_check
read_segment_header_device(struct segment_header_device *dest,
- struct wb_cache *cache, size_t segment_idx)
+ struct wb_cache *cache, u32 segment_idx)
{
int r = 0;
+ struct wb_device *wb = cache->wb;
struct dm_io_request io_req;
struct dm_io_region region;
void *buf = kmalloc(1 << 12, GFP_KERNEL);
@@ -651,15 +660,16 @@ void prepare_segment_header_device(struct segment_header_device *dest,
struct wb_cache *cache,
struct segment_header *src)
{
- cache_nr i;
u8 left, right;
+ u32 i, tmp32;

dest->global_id = cpu_to_le64(src->global_id);
dest->length = src->length;
dest->lap = cpu_to_le32(calc_segment_lap(cache, src->global_id));

left = src->length - 1;
- right = (cache->cursor) % cache->nr_caches_inseg;
+ div_u64_rem(cache->cursor, cache->nr_caches_inseg, &tmp32);
+ right = tmp32;
BUG_ON(left != right);

for (i = 0; i < src->length; i++) {
@@ -679,7 +689,7 @@ void prepare_segment_header_device(struct segment_header_device *dest,
static void update_by_segment_header_device(struct wb_cache *cache,
struct segment_header_device *src)
{
- cache_nr i;
+ u32 i;
struct segment_header *seg =
get_segment_header_by_id(cache, src->global_id);
seg->length = src->length;
@@ -739,10 +749,9 @@ static int __must_check recover_cache(struct wb_cache *cache)
int r = 0;
struct segment_header_device *header;
struct segment_header *seg;
- u64 i, j,
- max_id, oldest_id, last_flushed_id, init_segment_id,
- oldest_idx, nr_segments = cache->nr_segments,
+ u64 max_id, oldest_id, last_flushed_id, init_segment_id,
header_id, record_id;
+ u32 i, j, oldest_idx, nr_segments = cache->nr_segments;

struct superblock_record_device uninitialized_var(record);
r = read_superblock_record(&record, cache);
@@ -815,7 +824,7 @@ static int __must_check recover_cache(struct wb_cache *cache)
* last_flushed init_seg migrated last_migrated flushed
*/
for (i = oldest_idx; i < (nr_segments + oldest_idx); i++) {
- j = i % nr_segments;
+ div_u64_rem(i, nr_segments, &j);
r = read_segment_header_device(header, cache, j);
if (r) {
WBERR();
@@ -871,14 +880,15 @@ setup_init_segment:
seg->global_id = init_segment_id;
atomic_set(&seg->nr_inflight_ios, 0);

- cache->last_flushed_segment_id = seg->global_id - 1;
+ atomic64_set(&cache->last_flushed_segment_id,
+ seg->global_id - 1);

- cache->last_migrated_segment_id =
- cache->last_flushed_segment_id > cache->nr_segments ?
- cache->last_flushed_segment_id - cache->nr_segments : 0;
+ atomic64_set(&cache->last_migrated_segment_id,
+ atomic64_read(&cache->last_flushed_segment_id) > cache->nr_segments ?
+ atomic64_read(&cache->last_flushed_segment_id) - cache->nr_segments : 0);

- if (record_id > cache->last_migrated_segment_id)
- cache->last_migrated_segment_id = record_id;
+ if (record_id > atomic64_read(&cache->last_migrated_segment_id))
+ atomic64_set(&cache->last_migrated_segment_id, record_id);

wait_for_migration(cache, seg->global_id);

@@ -903,9 +913,14 @@ static int __must_check init_rambuf_pool(struct wb_cache *cache)
size_t i, j;
struct rambuffer *rambuf;

- /* tmp var to avoid 80 cols */
- size_t nr = (RAMBUF_POOL_ALLOCATED * 1000000) /
- (1 << (cache->segment_size_order + SECTOR_SHIFT));
+ u32 nr = div_u64(cache->rambuf_pool_amount * 1000,
+ 1 << (cache->segment_size_order + SECTOR_SHIFT));
+
+ if (!nr) {
+ WBERR("rambuf must be allocated at least one");
+ return -EINVAL;
+ }
+
cache->nr_rambuf_pool = nr;
cache->rambuf_pool = kmalloc(sizeof(struct rambuffer) * nr,
GFP_KERNEL);
@@ -1024,24 +1039,44 @@ int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)
/*
* (i) Harmless Initializations
*/
+ cache->buf_1_pool = mempool_create_kmalloc_pool(16, 1 << SECTOR_SHIFT);
+ if (!cache->buf_1_pool) {
+ r = -ENOMEM;
+ WBERR("couldn't alloc 1 sector pool");
+ goto bad_buf_1_pool;
+ }
+ cache->buf_8_pool = mempool_create_kmalloc_pool(16, 8 << SECTOR_SHIFT);
+ if (!cache->buf_8_pool) {
+ r = -ENOMEM;
+ WBERR("couldn't alloc 8 sector pool");
+ goto bad_buf_8_pool;
+ }
+
r = init_rambuf_pool(cache);
if (r) {
- WBERR();
+ WBERR("couldn't alloc rambuf pool");
goto bad_init_rambuf_pool;
}
+ cache->flush_job_pool = mempool_create_kmalloc_pool(cache->nr_rambuf_pool,
+ sizeof(struct flush_job));
+ if (!cache->flush_job_pool) {
+ r = -ENOMEM;
+ WBERR("couldn't alloc flush job pool");
+ goto bad_flush_job_pool;
+ }

/* Select arbitrary one as the initial rambuffer. */
cache->current_rambuf = cache->rambuf_pool + 0;

r = init_segment_header_array(cache);
if (r) {
- WBERR();
+ WBERR("couldn't alloc segment header array");
goto bad_alloc_segment_header_array;
}

r = ht_empty_init(cache);
if (r) {
- WBERR();
+ WBERR("couldn't alloc hashtable");
goto bad_alloc_ht;
}

@@ -1077,7 +1112,7 @@ int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)

r = recover_cache(cache);
if (r) {
- WBERR();
+ WBERR("recovering cache metadata failed");
goto bad_recover;
}

@@ -1091,7 +1126,6 @@ int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)
/* Flush Daemon */
spin_lock_init(&cache->flush_queue_lock);
INIT_LIST_HEAD(&cache->flush_queue);
- init_waitqueue_head(&cache->flush_wait_queue);
CREATE_DAEMON(flush);

/* Deferred ACK for barrier writes */
@@ -1140,9 +1174,14 @@ bad_alloc_migrate_buffer:
bad_alloc_ht:
free_segment_header_array(cache);
bad_alloc_segment_header_array:
+ mempool_destroy(cache->flush_job_pool);
+bad_flush_job_pool:
free_rambuf_pool(cache);
bad_init_rambuf_pool:
- kfree(cache);
+ mempool_destroy(cache->buf_8_pool);
+bad_buf_8_pool:
+ mempool_destroy(cache->buf_1_pool);
+bad_buf_1_pool:
return r;
}

diff --git a/Driver/dm-writeboost-metadata.h b/Driver/dm-writeboost-metadata.h
index 2e59041..709dfda 100644
--- a/Driver/dm-writeboost-metadata.h
+++ b/Driver/dm-writeboost-metadata.h
@@ -12,8 +12,8 @@
struct segment_header *get_segment_header_by_id(struct wb_cache *,
u64 segment_id);
sector_t calc_mb_start_sector(struct wb_cache *,
- struct segment_header *, cache_nr mb_idx);
-bool is_on_buffer(struct wb_cache *, cache_nr mb_idx);
+ struct segment_header *, u32 mb_idx);
+bool is_on_buffer(struct wb_cache *, u32 mb_idx);

/*----------------------------------------------------------------*/

diff --git a/Driver/dm-writeboost-target.c b/Driver/dm-writeboost-target.c
index 4b5b7aa..8e40f15 100644
--- a/Driver/dm-writeboost-target.c
+++ b/Driver/dm-writeboost-target.c
@@ -13,23 +13,6 @@

/*----------------------------------------------------------------*/

-void *do_kmalloc_retry(size_t size, gfp_t flags, const char *caller)
-{
- size_t count = 0;
- void *p;
-
-retry_alloc:
- p = kmalloc(size, flags);
- if (!p) {
- count++;
- WBWARN("%s() allocation failed size:%lu, count:%lu",
- caller, size, count);
- schedule_timeout_interruptible(msecs_to_jiffies(1));
- goto retry_alloc;
- }
- return p;
-}
-
struct safe_io {
struct work_struct work;
int err;
@@ -52,6 +35,7 @@ static void safe_io_proc(struct work_struct *work)
* @thread run this operation in other thread to avoid deadlock.
*/
int dm_safe_io_internal(
+ struct wb_device *wb,
struct dm_io_request *io_req,
unsigned num_regions, struct dm_io_region *regions,
unsigned long *err_bits, bool thread, const char *caller)
@@ -68,6 +52,11 @@ int dm_safe_io_internal(

INIT_WORK_ONSTACK(&io.work, safe_io_proc);

+ /*
+ * don't go on submitting I/O
+ * minimizes the risk of breaking the data.
+ */
+ wait_on_blockup();
queue_work(safe_io_wq, &io.work);
flush_work(&io.work);

@@ -75,6 +64,7 @@ int dm_safe_io_internal(
if (err_bits)
*err_bits = io.err_bits;
} else {
+ wait_on_blockup();
err = dm_io(io_req, num_regions, regions, err_bits);
}

@@ -87,45 +77,15 @@ int dm_safe_io_internal(
eb = (~(unsigned long)0);
else
eb = *err_bits;
- WBERR("%s() io error err(%d, %lu), rw(%d), sector(%lu), dev(%u:%u)",
+ WBERR("%s() I/O error err(%d, %lu), rw(%d), sector(%llu), dev(%u:%u)",
caller, err, eb,
- io_req->bi_rw, regions->sector,
+ io_req->bi_rw, (unsigned long long) regions->sector,
MAJOR(dev), MINOR(dev));
}

return err;
}

-void dm_safe_io_retry_internal(
- struct dm_io_request *io_req,
- unsigned num_regions, struct dm_io_region *regions,
- bool thread, const char *caller)
-{
- int err, count = 0;
- unsigned long err_bits;
- dev_t dev;
-
-retry_io:
- err_bits = 0;
- err = dm_safe_io_internal(io_req, num_regions, regions, &err_bits,
- thread, caller);
-
- dev = regions->bdev->bd_dev;
- if (err || err_bits) {
- count++;
- WBWARN("%s() io error count(%d)", caller, count);
- schedule_timeout_interruptible(msecs_to_jiffies(1000));
- goto retry_io;
- }
-
- if (count) {
- WBWARN("%s() recover from io error rw(%d), sector(%lu), dev(%u:%u)",
- caller,
- io_req->bi_rw, regions->sector,
- MAJOR(dev), MINOR(dev));
- }
-}
-
sector_t dm_devsize(struct dm_dev *dev)
{
return i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT;
@@ -165,12 +125,13 @@ static void queue_flushing(struct wb_cache *cache)
bool empty;
struct rambuffer *next_rambuf;
size_t n1 = 0, n2 = 0;
+ u32 tmp32;
u64 next_id;

while (atomic_read(&current_seg->nr_inflight_ios)) {
n1++;
if (n1 == 100)
- WBWARN();
+ WBWARN("inflight ios remained for current seg");
schedule_timeout_interruptible(msecs_to_jiffies(1));
}

@@ -180,7 +141,7 @@ static void queue_flushing(struct wb_cache *cache)
INIT_COMPLETION(current_seg->migrate_done);
INIT_COMPLETION(current_seg->flush_done);

- job = kmalloc_retry(sizeof(*job), GFP_NOIO);
+ job = mempool_alloc(cache->flush_job_pool, GFP_NOIO);
INIT_LIST_HEAD(&job->flush_queue);
job->seg = current_seg;
job->rambuf = cache->current_rambuf;
@@ -189,12 +150,21 @@ static void queue_flushing(struct wb_cache *cache)
bio_list_merge(&job->barrier_ios, &cache->barrier_ios);
bio_list_init(&cache->barrier_ios);

+ /*
+ * Queuing imcomplete flush job
+ * will let flush daemon go wild.
+ * We put write barrier to make sure
+ * that job is completely initizalied.
+ */
+ smp_wmb();
+
spin_lock_irqsave(&cache->flush_queue_lock, flags);
empty = list_empty(&cache->flush_queue);
list_add_tail(&job->flush_queue, &cache->flush_queue);
spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
+
if (empty)
- wake_up_interruptible(&cache->flush_wait_queue);
+ wake_up_process(cache->flush_daemon);

next_id = current_seg->global_id + 1;
new_seg = get_segment_header_by_id(cache, next_id);
@@ -203,7 +173,7 @@ static void queue_flushing(struct wb_cache *cache)
while (atomic_read(&new_seg->nr_inflight_ios)) {
n2++;
if (n2 == 100)
- WBWARN();
+ WBWARN("inflight ios remained for new seg");
schedule_timeout_interruptible(msecs_to_jiffies(1));
}

@@ -217,7 +187,8 @@ static void queue_flushing(struct wb_cache *cache)
cache->cursor = current_seg->start_idx + (cache->nr_caches_inseg - 1);
new_seg->length = 0;

- next_rambuf = cache->rambuf_pool + (next_id % cache->nr_rambuf_pool);
+ div_u64_rem(next_id, cache->nr_rambuf_pool, &tmp32);
+ next_rambuf = cache->rambuf_pool + tmp32;
wait_for_completion(&next_rambuf->done);
INIT_COMPLETION(next_rambuf->done);

@@ -255,12 +226,14 @@ static void queue_current_buffer(struct wb_cache *cache)
void flush_current_buffer(struct wb_cache *cache)
{
struct segment_header *old_seg;
+ u32 tmp32;

mutex_lock(&cache->io_lock);
old_seg = cache->current_seg;

queue_current_buffer(cache);
- cache->cursor = (cache->cursor + 1) % cache->nr_caches;
+ div_u64_rem(cache->cursor + 1, cache->nr_caches, &tmp32);
+ cache->cursor = tmp32;
cache->current_seg->length = 1;
mutex_unlock(&cache->io_lock);

@@ -345,13 +318,14 @@ static void clear_stat(struct wb_cache *cache)
static void migrate_mb(struct wb_cache *cache, struct segment_header *seg,
struct metablock *mb, u8 dirty_bits, bool thread)
{
+ int r;
struct wb_device *wb = cache->wb;

if (!dirty_bits)
return;

if (dirty_bits == 255) {
- void *buf = kmalloc_retry(1 << 12, GFP_NOIO);
+ void *buf = mempool_alloc(cache->buf_8_pool, GFP_NOIO);
struct dm_io_request io_req_r, io_req_w;
struct dm_io_region region_r, region_w;

@@ -367,8 +341,7 @@ static void migrate_mb(struct wb_cache *cache, struct segment_header *seg,
.sector = calc_mb_start_sector(cache, seg, mb->idx),
.count = (1 << 3),
};
-
- dm_safe_io_retry(&io_req_r, 1, &region_r, thread);
+ RETRY(dm_safe_io(&io_req_r, 1, &region_r, NULL, thread));

io_req_w = (struct dm_io_request) {
.client = wb_io_client,
@@ -382,11 +355,11 @@ static void migrate_mb(struct wb_cache *cache, struct segment_header *seg,
.sector = mb->sector,
.count = (1 << 3),
};
- dm_safe_io_retry(&io_req_w, 1, &region_w, thread);
+ RETRY(dm_safe_io(&io_req_w, 1, &region_w, NULL, thread));

- kfree(buf);
+ mempool_free(buf, cache->buf_8_pool);
} else {
- void *buf = kmalloc_retry(1 << SECTOR_SHIFT, GFP_NOIO);
+ void *buf = mempool_alloc(cache->buf_1_pool, GFP_NOIO);
size_t i;
for (i = 0; i < 8; i++) {
bool bit_on = dirty_bits & (1 << i);
@@ -411,7 +384,7 @@ static void migrate_mb(struct wb_cache *cache, struct segment_header *seg,
.sector = src,
.count = 1,
};
- dm_safe_io_retry(&io_req_r, 1, &region_r, thread);
+ RETRY(dm_safe_io(&io_req_r, 1, &region_r, NULL, thread));

io_req_w = (struct dm_io_request) {
.client = wb_io_client,
@@ -425,9 +398,9 @@ static void migrate_mb(struct wb_cache *cache, struct segment_header *seg,
.sector = mb->sector + 1 * i,
.count = 1,
};
- dm_safe_io_retry(&io_req_w, 1, &region_w, thread);
+ RETRY(dm_safe_io(&io_req_w, 1, &region_w, NULL, thread));
}
- kfree(buf);
+ mempool_free(buf, cache->buf_1_pool);
}
}

@@ -438,12 +411,17 @@ static void migrate_mb(struct wb_cache *cache, struct segment_header *seg,
static void migrate_buffered_mb(struct wb_cache *cache,
struct metablock *mb, u8 dirty_bits)
{
+ int r;
struct wb_device *wb = cache->wb;
+ u8 i;
+ sector_t offset;
+ void *buf;

- u8 i, k = 1 + (mb->idx % cache->nr_caches_inseg);
- sector_t offset = (k << 3);
+ u32 k;
+ div_u64_rem(mb->idx, cache->nr_caches_inseg, &k);
+ offset = ((k + 1) << 3);

- void *buf = kmalloc_retry(1 << SECTOR_SHIFT, GFP_NOIO);
+ buf = mempool_alloc(cache->buf_1_pool, GFP_NOIO);
for (i = 0; i < 8; i++) {
struct dm_io_request io_req;
struct dm_io_region region;
@@ -473,9 +451,9 @@ static void migrate_buffered_mb(struct wb_cache *cache,
.count = 1,
};

- dm_safe_io_retry(&io_req, 1, &region, true);
+ RETRY(dm_safe_io(&io_req, 1, &region, NULL, true));
}
- kfree(buf);
+ mempool_free(buf, cache->buf_1_pool);
}

static void bio_remap(struct bio *bio, struct dm_dev *dev, sector_t sector)
@@ -487,7 +465,7 @@ static void bio_remap(struct bio *bio, struct dm_dev *dev, sector_t sector)
static sector_t calc_cache_alignment(struct wb_cache *cache,
sector_t bio_sector)
{
- return (bio_sector / (1 << 3)) * (1 << 3);
+ return div_u64(bio_sector, 1 << 3) * (1 << 3);
}

static int writeboost_map(struct dm_target *ti, struct bio *bio
@@ -502,13 +480,15 @@ static int writeboost_map(struct dm_target *ti, struct bio *bio
#if LINUX_VERSION_CODE >= PER_BIO_VERSION
struct per_bio_data *map_context;
#endif
- sector_t bio_count, bio_offset, s;
+ sector_t bio_count, s;
+ u8 bio_offset;
+ u32 tmp32;
bool bio_fullsize, found, on_buffer,
refresh_segment, b;
int rw;
struct lookup_key key;
struct ht_head *head;
- cache_nr update_mb_idx, idx_inseg;
+ u32 update_mb_idx;
size_t start;
void *data;

@@ -516,6 +496,9 @@ static int writeboost_map(struct dm_target *ti, struct bio *bio
struct wb_cache *cache = wb->cache;
struct dm_dev *orig = wb->device;

+ if (ACCESS_ONCE(wb->blockup))
+ return -EIO;
+
#if LINUX_VERSION_CODE >= PER_BIO_VERSION
map_context = dm_per_bio_data(bio, ti->per_bio_data_size);
#endif
@@ -552,7 +535,8 @@ static int writeboost_map(struct dm_target *ti, struct bio *bio

bio_count = bio->bi_size >> SECTOR_SHIFT;
bio_fullsize = (bio_count == (1 << 3));
- bio_offset = bio->bi_sector % (1 << 3);
+ div_u64_rem(bio->bi_sector, 1 << 3, &tmp32);
+ bio_offset = tmp32;

rw = bio_data_dir(bio);

@@ -580,8 +564,8 @@ static int writeboost_map(struct dm_target *ti, struct bio *bio
mutex_lock(&cache->io_lock);
mb = ht_lookup(cache, head, &key);
if (mb) {
- seg = ((void *) mb) - (mb->idx % cache->nr_caches_inseg) *
- sizeof(struct metablock)
+ div_u64_rem(mb->idx, cache->nr_caches_inseg, &tmp32);
+ seg = ((void *) mb) - tmp32 * sizeof(struct metablock)
- sizeof(struct segment_header);
atomic_inc(&seg->nr_inflight_ios);
}
@@ -723,12 +707,14 @@ write_not_found:
* We must flush the current segment and
* get the new one.
*/
- refresh_segment = !((cache->cursor + 1) % cache->nr_caches_inseg);
+ div_u64_rem(cache->cursor + 1, cache->nr_caches_inseg, &tmp32);
+ refresh_segment = !tmp32;

if (refresh_segment)
queue_current_buffer(cache);

- cache->cursor = (cache->cursor + 1) % cache->nr_caches;
+ div_u64_rem(cache->cursor + 1, cache->nr_caches, &tmp32);
+ cache->cursor = tmp32;

/*
* update_mb_idx is the cache line index to update.
@@ -738,7 +724,8 @@ write_not_found:
seg = cache->current_seg;
atomic_inc(&seg->nr_inflight_ios);

- new_mb = seg->mb_array + (update_mb_idx % cache->nr_caches_inseg);
+ div_u64_rem(update_mb_idx, cache->nr_caches_inseg, &tmp32);
+ new_mb = seg->mb_array + tmp32;
new_mb->dirty_bits = 0;
ht_register(cache, head, &key, new_mb);
mutex_unlock(&cache->io_lock);
@@ -747,13 +734,12 @@ write_not_found:

write_on_buffer:
;
- idx_inseg = update_mb_idx % cache->nr_caches_inseg;
-
/*
* The first 4KB of the segment is
* used for metadata.
*/
- s = (idx_inseg + 1) << 3;
+ div_u64_rem(update_mb_idx, cache->nr_caches_inseg, &tmp32);
+ s = (tmp32 + 1) << 3;

b = false;
lockseg(seg, flags);
@@ -769,7 +755,7 @@ write_on_buffer:
u8 i;
u8 acc_bits = 0;
s += bio_offset;
- for (i = bio_offset; i < (bio_offset+bio_count); i++)
+ for (i = bio_offset; i < (bio_offset + bio_count); i++)
acc_bits += (1 << i);

mb->dirty_bits |= acc_bits;
@@ -827,8 +813,15 @@ static int writeboost_end_io(struct dm_target *ti, struct bio *bio, int error
return 0;
}

+#define ARG_EXIST(n)\
+ if (argc <= (n)) {\
+ goto exit_parse_arg;\
+ }
+
/*
- * <backing dev> <cache dev> <segment size order>
+ * <backing dev> <cache dev>
+ * [segment size order]
+ * [rambuf pool amount]
*/
static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)
{
@@ -842,7 +835,7 @@ static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)
#if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 6, 0)
r = dm_set_target_max_io_len(ti, (1 << 3));
if (r) {
- WBERR();
+ WBERR("settting max io len failed");
return r;
}
#else
@@ -862,6 +855,8 @@ static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)
*/
wb->migrate_threshold = 70;

+ init_waitqueue_head(&wb->blockup_wait_queue);
+ wb->blockup = false;

cache = kzalloc(sizeof(*cache), GFP_KERNEL);
if (!cache) {
@@ -888,6 +883,10 @@ static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)
goto bad_get_device_cache;
}

+ /* Optional Parameters */
+
+ cache->segment_size_order = 7;
+ ARG_EXIST(2);
if (kstrtoul(argv[2], 10, &tmp)) {
r = -EINVAL;
goto bad_segment_size_order;
@@ -901,6 +900,16 @@ static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)

cache->segment_size_order = tmp;

+ cache->rambuf_pool_amount = 2048;
+ ARG_EXIST(3);
+ if (kstrtoul(argv[3], 10, &tmp)) {
+ r = -EINVAL;
+ goto bad_rambuf_pool_amount;
+ }
+ cache->rambuf_pool_amount = tmp;
+
+exit_parse_arg:
+
r = audit_cache_device(cachedev, cache, &need_format, &allow_format);
if (r) {
WBERR("audit cache device fails err(%d)", r);
@@ -930,7 +939,7 @@ static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)

r = resume_cache(cache, cachedev);
if (r) {
- WBERR("%d", r);
+ WBERR("failed to resume cache err(%d)", r);
goto bad_resume_cache;
}
clear_stat(cache);
@@ -957,6 +966,7 @@ static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)
bad_resume_cache:
bad_format_cache:
bad_audit_cache:
+bad_rambuf_pool_amount:
bad_segment_size_order:
dm_put_device(ti, cachedev);
bad_get_device_cache:
@@ -1000,6 +1010,14 @@ static int writeboost_message(struct dm_target *ti, unsigned argc, char **argv)
if (kstrtoul(argv[1], 10, &tmp))
return -EINVAL;

+ if (!strcasecmp(cmd, "blockup")) {
+ if (tmp > 1)
+ return -EINVAL;
+ wb->blockup = tmp;
+ wake_up(&wb->blockup_wait_queue);
+ return 0;
+ }
+
if (!strcasecmp(cmd, "allow_migrate")) {
if (tmp > 1)
return -EINVAL;
@@ -1101,31 +1119,39 @@ writeboost_status(
DMEMIT("%llu %llu %llu %llu %llu %u ",
(long long unsigned int)
atomic64_read(&wb->nr_dirty_caches),
- (long long unsigned int) cache->nr_segments,
- (long long unsigned int) cache->last_migrated_segment_id,
- (long long unsigned int) cache->last_flushed_segment_id,
- (long long unsigned int) cache->current_seg->global_id,
- (unsigned int) cache->cursor);
+ (long long unsigned int)
+ cache->nr_segments,
+ (long long unsigned int)
+ atomic64_read(&cache->last_migrated_segment_id),
+ (long long unsigned int)
+ atomic64_read(&cache->last_flushed_segment_id),
+ (long long unsigned int)
+ cache->current_seg->global_id,
+ (unsigned int)
+ cache->cursor);

for (i = 0; i < STATLEN; i++) {
atomic64_t *v = &cache->stat[i];
- DMEMIT("%lu ", atomic64_read(v));
+ DMEMIT("%llu ", (unsigned long long) atomic64_read(v));
}

- DMEMIT("%d ", 7);
+ DMEMIT("%d ", 8);
DMEMIT("barrier_deadline_ms %lu ",
cache->barrier_deadline_ms);
DMEMIT("allow_migrate %d ",
cache->allow_migrate ? 1 : 0);
DMEMIT("enable_migration_modulator %d ",
cache->enable_migration_modulator ? 1 : 0);
- DMEMIT("migrate_threshold %d ", wb->migrate_threshold);
- DMEMIT("nr_cur_batched_migration %lu ",
+ DMEMIT("migrate_threshold %d ",
+ wb->migrate_threshold);
+ DMEMIT("nr_cur_batched_migration %u ",
cache->nr_cur_batched_migration);
DMEMIT("sync_interval %lu ",
cache->sync_interval);
- DMEMIT("update_record_interval %lu",
+ DMEMIT("update_record_interval %lu ",
cache->update_record_interval);
+ DMEMIT("blockup %d",
+ wb->blockup);
break;

case STATUSTYPE_TABLE:
@@ -1169,13 +1195,13 @@ static int __init writeboost_module_init(void)
safe_io_wq = alloc_workqueue("safeiowq",
WQ_NON_REENTRANT | WQ_MEM_RECLAIM, 0);
if (!safe_io_wq) {
- WBERR();
+ WBERR("failed to alloc safe_io_wq");
goto bad_wq;
}

wb_io_client = dm_io_client_create();
if (IS_ERR(wb_io_client)) {
- WBERR();
+ WBERR("failed to alloc wb_io_client");
r = PTR_ERR(wb_io_client);
goto bad_io_client;
}
diff --git a/Driver/dm-writeboost.h b/Driver/dm-writeboost.h
index d394dfa..fdb41d0 100644
--- a/Driver/dm-writeboost.h
+++ b/Driver/dm-writeboost.h
@@ -22,22 +22,17 @@
#include <linux/device-mapper.h>
#include <linux/dm-io.h>

-#define wbdebug(f, args...) \
+#define wbdebug(f, args...)\
DMINFO("debug@%s() L.%d" f, __func__, __LINE__, ## args)

-#define WBERR(f, args...) \
+#define WBERR(f, args...)\
DMERR("err@%s() " f, __func__, ## args)
-#define WBWARN(f, args...) \
+#define WBWARN(f, args...)\
DMWARN("warn@%s() " f, __func__, ## args)
-#define WBINFO(f, args...) \
+#define WBINFO(f, args...)\
DMINFO("info@%s() " f, __func__, ## args)

/*
- * The amount of RAM buffer pool to pre-allocated.
- */
-#define RAMBUF_POOL_ALLOCATED 64 /* MB */
-
-/*
* The Detail of the Disk Format
*
* Whole:
@@ -81,15 +76,6 @@ struct superblock_record_device {
} __packed;

/*
- * Cache line index.
- *
- * dm-writeboost can supoort a cache device
- * with size less than 4KB * (1 << 32)
- * that is 16TB.
- */
-typedef u32 cache_nr;
-
-/*
* Metadata of a 4KB cache line
*
* Dirtiness is defined for each sector
@@ -98,7 +84,7 @@ typedef u32 cache_nr;
struct metablock {
sector_t sector; /* key */

- cache_nr idx; /* Const */
+ u32 idx; /* Const */

struct hlist_node ht_list;

@@ -143,7 +129,7 @@ struct segment_header {
*/
u8 length;

- cache_nr start_idx; /* Const */
+ u32 start_idx; /* Const */
sector_t start_sector; /* Const */

struct list_head migrate_list;
@@ -228,10 +214,14 @@ struct wb_device;
struct wb_cache {
struct wb_device *wb;

+ mempool_t *buf_1_pool; /* 1 sector buffer pool */
+ mempool_t *buf_8_pool; /* 8 sector buffer pool */
+ mempool_t *flush_job_pool;
+
struct dm_dev *device;
struct mutex io_lock;
- cache_nr nr_caches; /* Const */
- u64 nr_segments; /* Const */
+ u32 nr_caches; /* Const */
+ u32 nr_segments; /* Const */
u8 segment_size_order; /* Const */
u8 nr_caches_inseg; /* Const */
struct bigarray *segment_header_array;
@@ -248,15 +238,16 @@ struct wb_cache {
size_t htsize;
struct ht_head *null_head;

- cache_nr cursor; /* Index that has been written the most lately */
+ u32 cursor; /* Index that has been written the most lately */
struct segment_header *current_seg;

struct rambuffer *current_rambuf;
- size_t nr_rambuf_pool; /* Const */
+ u32 rambuf_pool_amount; /* kB */
+ u32 nr_rambuf_pool; /* Const */
struct rambuffer *rambuf_pool;

- u64 last_migrated_segment_id;
- u64 last_flushed_segment_id;
+ atomic64_t last_migrated_segment_id;
+ atomic64_t last_flushed_segment_id;
int urge_migrate;

/*
@@ -269,7 +260,6 @@ struct wb_cache {
struct task_struct *flush_daemon;
spinlock_t flush_queue_lock;
struct list_head flush_queue;
- wait_queue_head_t flush_wait_queue;

/*
* Deferred ACK for barriers.
@@ -289,7 +279,7 @@ struct wb_cache {
* if they are segments to migrate.
*/
struct task_struct *migrate_daemon;
- bool allow_migrate; /* param */
+ int allow_migrate; /* param */

/*
* Batched Migration
@@ -303,8 +293,8 @@ struct wb_cache {
struct list_head migrate_list;
u8 *dirtiness_snapshot;
void *migrate_buffer;
- size_t nr_cur_batched_migration;
- size_t nr_max_batched_migration; /* param */
+ u32 nr_cur_batched_migration;
+ u32 nr_max_batched_migration; /* param */

/*
* Migration modulator
@@ -314,7 +304,7 @@ struct wb_cache {
* according to the load of backing store.
*/
struct task_struct *modulator_daemon;
- bool enable_migration_modulator; /* param */
+ int enable_migration_modulator; /* param */

/*
* Superblock Recorder
@@ -347,6 +337,9 @@ struct wb_device {
u8 migrate_threshold;

atomic64_t nr_dirty_caches;
+
+ wait_queue_head_t blockup_wait_queue;
+ int blockup;
};

struct flush_job {
@@ -384,24 +377,50 @@ u8 atomic_read_mb_dirtiness(struct segment_header *, struct metablock *);
extern struct workqueue_struct *safe_io_wq;
extern struct dm_io_client *wb_io_client;

-void *do_kmalloc_retry(size_t size, gfp_t flags, const char *caller);
-#define kmalloc_retry(size, flags) \
- do_kmalloc_retry((size), (flags), __func__)
+/*
+ * I/O error on either backing or cache
+ * should block up the whole system.
+ * Either reading or writing a device
+ * should not be done if it once returns -EIO.
+ * These devices are untrustable and
+ * we wait for sysadmin to remove the failure cause away.
+ */
+
+#define wait_on_blockup()\
+ do {\
+ BUG_ON(!wb);\
+ if (ACCESS_ONCE(wb->blockup)) {\
+ WBERR("system is blocked up on I/O error. set blockup to 0 after checkup.");\
+ wait_event_interruptible(wb->blockup_wait_queue,\
+ !ACCESS_ONCE(wb->blockup));\
+ WBINFO("reactivated after blockup");\
+ }\
+ } while (0)
+
+#define RETRY(proc)\
+ do {\
+ BUG_ON(!wb);\
+ r = proc;\
+ if (r == -EOPNOTSUPP) {\
+ r = 0;\
+ } else if (r == -EIO) { /* I/O error is critical */\
+ wb->blockup = true;\
+ wait_on_blockup();\
+ } else if (r == -ENOMEM) {\
+ schedule_timeout_interruptible(msecs_to_jiffies(1000));\
+ } else if (r) { \
+ WBERR("please report!! I/O failed but no retry error code %d", r);\
+ r = 0;\
+ }\
+ } while (r)

int dm_safe_io_internal(
+ struct wb_device*,
struct dm_io_request *,
unsigned num_regions, struct dm_io_region *,
unsigned long *err_bits, bool thread, const char *caller);
-#define dm_safe_io(io_req, num_regions, regions, err_bits, thread) \
- dm_safe_io_internal((io_req), (num_regions), (regions), \
- (err_bits), (thread), __func__)
-
-void dm_safe_io_retry_internal(
- struct dm_io_request *,
- unsigned num_regions, struct dm_io_region *,
- bool thread, const char *caller);
-#define dm_safe_io_retry(io_req, num_regions, regions, thread) \
- dm_safe_io_retry_internal((io_req), (num_regions), (regions), \
- (thread), __func__)
+#define dm_safe_io(io_req, num_regions, regions, err_bits, thread)\
+ dm_safe_io_internal(wb, (io_req), (num_regions), (regions),\
+ (err_bits), (thread), __func__);\

sector_t dm_devsize(struct dm_dev *);

diff --git a/dm-writeboost.txt b/dm-writeboost.txt
index 9acbd54..00ad6e0 100644
--- a/dm-writeboost.txt
+++ b/dm-writeboost.txt
@@ -66,12 +66,20 @@ All the operations are via dmsetup command.

Constructor
-----------
-writeboost <backing dev> <cache dev> <segment size order>
+writeboost <backing dev> <cache dev>
+ [segment size order]
+ [rambuf pool amount]

backing dev : slow device holding original data blocks.
cache dev : fast device holding cached data and its metadata.
segment size order : the size of RAM buffer
1 << n (sectors), 4 <= n <= 11
+ default 7
+rambuf pool amount : The amount of the RAM buffer pool (kB).
+ Too fewer amount may cause waiting for new buffer
+ to become available again.
+ But too much doesn't affect the performance.
+ default 2048

Note that cache device is re-formatted
if the first sector of the cache device is zeroed out.

2013-10-16 00:02:14

by Mikulas Patocka

[permalink] [raw]
Subject: Re: A review of dm-writeboost



On Mon, 14 Oct 2013, Akira Hayakawa wrote:

> Hi, DM Guys
>
> I suppose I have finished the tasks to
> answer Mikulas's pointing outs.
> So, let me update the progress report.
>
> The code is updated now on my Github repo.
> Checkout the develop branch to avail
> the latest source code.
>
> Compilation Status
> ------------------
> First, compilation status.
> Mikulas's advised me to compile the module in
> 32 bit environment. and Yes, I did.
> With all these kernels listed below
> writeboost can compile without any error nor warning.
>
> For 64 bit
> 3.2.36, 3.4.25, 3.5.7, 3.6.9, 3.7.5, 3.8.5, 3.9.8, 3.10.5, 3.11.4 and 3.12-rc1
>
> For 32 bit
> 3.2.0-4-686-pae (Debian 7.1.0-i386)
>
>
> Block up on error
> -----------------
> The most annoying thing in this update
> is how to handle the I/O error.
> For memory allocation error,
> writeboost now makes use of mempool to avoid
> the problem Mikulas's said in his last comments
> but handling I/O error gracefully
> when the system is running is very difficult.
>
> My answer is
> all the daemon stop when I/O error (-EIO returned) happens
> in any part of this module.
> They waits on wait_queue (blockup_wait_queue)
> and reactivates when sysadmin turns `blockup` variable to 0
> through message interface.
> When `blockup` is 1, all the incoming I/O
> are returned as -EIO to the upper layer.
> RETRY macro is introduced
> which wraps doing I/O and
> retries I/O submission if the error is -ENOMEN but

I/Os shouldn't be returned with -ENOMEM. If they are, you can treat it as
a hard error.

> turns blockup to 1 and sleeps if the error is -EIO.
> -EIO is more serious than -ENOMEM because
> it may destroy the storage for some accidental problem
> that we have no control in device-mapper layer
> (e.g. the storage controller went crazy).
> Blocking up the whole I/O is to minimize the
> probable damage.
>
> But, XFS stalls ...
> -------------------
> For testing,
> I manually turns `blockup` to 1
> when compiling Ruby is in progress
> on XFS on a writeboost device.
> As soon as I do it,
> XFS starts to dump error message
> like "metadata I/O error: ... ("xlog_iodone") error ..."
> and after few seconds it then starts to dump
> like "BUG: soft lockup -CPU#3 stuck for 22s!".
> The system stalls and doesn't accept the keyboard.
>
> I think this behavior is caused by
> the device always returning -EIO after turning
> the variable to 1.
> But why XFS goes stalling on I/O error?

Because it is bloated and buggy. We have bug 924301 for XFS crash on I/O
error...

> It should just suspend and starts returning
> error to the upper layer as writeboost now does.
> As Mikulas said the I/O error is often
> due to connection failure that is usually recoverable.
> Stalling the kernel will need reboot
> after recovering nevertheless writeboost
> can recover just by again turning `blockup` to 0.
> Any reason for this design or
> existing of a option to not stall XFS on I/O error?
>
> Thanks,
> Akira

Blocking I/O until the admin turns a specific variable isn't too
reliable.

Think of this case - your driver detects I/O error and blocks all I/Os.
The admin tries to log in. The login process needs memory. To fulfill this
memory need, the login process writes out some dirty pages. Those writes
are blocked by your driver - in the result, the admin is not able to log
in and flip the switch to unblock I/Os.

Blocking I/O indefinitely isn't good because any system activity
(including typing commands into shell) may wait on this I/O.

Mikulas

2013-10-16 00:53:48

by Akira Hayakawa

[permalink] [raw]
Subject: Re: A review of dm-writeboost

Mikulas,

> I/Os shouldn't be returned with -ENOMEM. If they are, you can treat it as
> a hard error.
It seems to be blkdev_issue_discard returns -ENOMEM
when bio_alloc fails, for example.
Waiting for a second and we can alloc the memory is my idea
for handling -ENOMEM returned.

> Blocking I/O until the admin turns a specific variable isn't too
> reliable.
>
> Think of this case - your driver detects I/O error and blocks all I/Os.
> The admin tries to log in. The login process needs memory. To fulfill this
> memory need, the login process writes out some dirty pages. Those writes
> are blocked by your driver - in the result, the admin is not able to log
> in and flip the switch to unblock I/Os.
>
> Blocking I/O indefinitely isn't good because any system activity
> (including typing commands into shell) may wait on this I/O.
I understand the problem. But, what should I do then?
Since writeboost is a cache software,
it loses consistency if we ignore the cache at all
in its returning I/O error.
Go panic in that case is also inappropriate (But, inaccessibility to
the storage will eventually halt the whole system. If so, go panic might
be an acceptable solution).

I am afraid my idea is based on your past comment
> If you can't handle a specific I/O request failure gracefully, you should
> mark the driver as dead, don't do any more I/Os to the disk or cache
> device and return -EIO on all incoming requests.
>
> Always think that I/O failures can happen because of connection problems,
> not data corruption problems - for example, a disk cable can go loose, a
> network may lose connectivity, etc. In these cases, it is best to stop
> doing any I/O at all and let the user resolve the situation.
1) In failure, mark the driver dead - set `blockup` to 1 in my case -
and returning -EIO on all incoming requests. Yes.
2) And wait for the user resolve the situation - returning -EIO until
admin turns `blockup` to 0 after checkup in my case - . Yes.

Did you mean we should not provide any way to recover the system
because admin may not be able to reach the switch?
writeboost module autonomously checking the device in problem
recovered should be implemented?
Retry submitting I/O to the device and find the device is recovered
on I/O success is a solution and I have implemented it.
I/O retry doesn't destroy any consistency in writeboost;
sooner or later it can not be able to accept writes any more because of
lack of RAM buffer which can be reused after I/O success to cache device.

Akira

2013-10-16 06:07:58

by Dave Chinner

[permalink] [raw]
Subject: Re: A review of dm-writeboost

[cc [email protected]]

On Tue, Oct 15, 2013 at 08:01:45PM -0400, Mikulas Patocka wrote:
> On Mon, 14 Oct 2013, Akira Hayakawa wrote:
> > But, XFS stalls ...
> > -------------------
> > For testing,
> > I manually turns `blockup` to 1
> > when compiling Ruby is in progress
> > on XFS on a writeboost device.
> > As soon as I do it,
> > XFS starts to dump error message
> > like "metadata I/O error: ... ("xlog_iodone") error ..."
> > and after few seconds it then starts to dump
> > like "BUG: soft lockup -CPU#3 stuck for 22s!".
> > The system stalls and doesn't accept the keyboard.
> >
> > I think this behavior is caused by
> > the device always returning -EIO after turning
> > the variable to 1.
> > But why XFS goes stalling on I/O error?
>
> Because it is bloated and buggy.

How did I know you'd take that cheap shot, Mikulas? You are so
predictable...

> We have bug 924301 for XFS crash on I/O
> error...

Which is a problem with memory corruption after filling a dm
snapshot volume to 100% and shortly after XFS has shut down the
kernel panics from memory corruption. Can't be reproduced without
filling the dm-snapshot volume to 100%, can't be reproduced with any
other filesystem. Crashes are also occurring randomly in printk and
the worker thread infrastructure. Memory and list poisoning clearly
indicates worker thread lists have freed objects on them. There are
lockdep messages from the DM snapshot code, etc.

There's actually very little to point at XFS problems other than the
first hang that was reported where XFS was stuck in a tight loop due
to memory corruption. It reminds me of a very similar bug report
and triage we went through last week:

http://oss.sgi.com/pipermail/xfs/2013-October/030681.html

Further analysis and bisects pointed to the zram driver being buggy,
not XFS:

http://oss.sgi.com/pipermail/xfs/2013-October/030707.html

XFS has historically exposing bugs in block device drivers that no
other filesystem exposes, and so when a new block device driver gets
tested with XFS and we start seeing memory corruption symptoms, it's
a fair bet that it's not XFS that is causing it....

Just sayin'.

---

Akira, can you please post the entire set of messages you are
getting when XFS showing problems? That way I can try to confirm
whether it's a regression in XFS or something else.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-10-16 10:34:45

by Akira Hayakawa

[permalink] [raw]
Subject: Re: A review of dm-writeboost

Dave

> Akira, can you please post the entire set of messages you are
> getting when XFS showing problems? That way I can try to confirm
> whether it's a regression in XFS or something else.

Environment:
- The kernel version is 3.12-rc1
- The debuggee is a KVM virtual machine equipped with 8 vcpus.
- writeboost version is commit 236732eb84684e8473353812acb3302232e1eab0
You can clone it from https://github.com/akiradeveloper/dm-writeboost

Test:
1. Make a writeboost device with 3MB cache device and 3GB backing store
with default option (segment size order is 7 and RAM buffer is 2MB allocated).
2. start testing/1 script (compiling Ruby and make test after it)
3. set blockup variable to 1 via message interface few seconds later.
The writeboost device starts to return -EIO on all incoming requests.
I guess this behavior causes the problem.

In some case, XFS doesn't collapse after setting blockup to 1.
When I set the variable to 1 about 10 or 20 seconds later,
it didn't collapse but neatly stops the compile and
after again I set it to 0, it restarts the compile.
XFS does collapse (badly shutting down the filesystem as seen below) in some case
but doesn't collapse in another case sounds to me that
the former case runs into a very corner case bug.

The entire set of messages via virsh console is shown below.
Some lines related to writeboost are all benign.
The daemons are just stopping because blockup variable is 1.

[ 146.284626] XFS (dm-3): metadata I/O error: block 0x300d91 ("xlog_iodone") error 5 numblks 64
[ 146.285825] XFS (dm-3): Log I/O Error Detected. Shutting down filesystem
[ 146.286699] XFS (dm-3): Please umount the filesystem and rectify the problem(s)
[ 146.560036] device-mapper: writeboost: err@modulator_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
[ 147.244036] device-mapper: writeboost: err@migrate_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
[ 172.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
[ 172.436003] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]
[ 180.560040] device-mapper: writeboost: err@recorder_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
[ 180.561179] device-mapper: writeboost: err@sync_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
[ 200.052005] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
[ 200.436005] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]
[ 206.484005] INFO: rcu_sched self-detected stall on CPU { 0} (t=15000 jiffies g=1797 c=1796 q=3022)
[ 232.052007] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
[ 232.436003] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]
[ 260.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
[ 260.436004] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]
[ 288.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
[ 288.436004] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]

Akira

2013-10-16 11:01:53

by Dave Chinner

[permalink] [raw]
Subject: Re: A review of dm-writeboost

On Wed, Oct 16, 2013 at 07:34:38PM +0900, Akira Hayakawa wrote:
> Dave
>
> > Akira, can you please post the entire set of messages you are
> > getting when XFS showing problems? That way I can try to confirm
> > whether it's a regression in XFS or something else.
>
> Environment:
> - The kernel version is 3.12-rc1
> - The debuggee is a KVM virtual machine equipped with 8 vcpus.
> - writeboost version is commit 236732eb84684e8473353812acb3302232e1eab0
> You can clone it from https://github.com/akiradeveloper/dm-writeboost
>
> Test:
> 1. Make a writeboost device with 3MB cache device and 3GB backing store
> with default option (segment size order is 7 and RAM buffer is 2MB allocated).
> 2. start testing/1 script (compiling Ruby and make test after it)
> 3. set blockup variable to 1 via message interface few seconds later.
> The writeboost device starts to return -EIO on all incoming requests.
> I guess this behavior causes the problem.
>
> In some case, XFS doesn't collapse after setting blockup to 1.
> When I set the variable to 1 about 10 or 20 seconds later,
> it didn't collapse but neatly stops the compile and
> after again I set it to 0, it restarts the compile.
> XFS does collapse (badly shutting down the filesystem as seen below) in some case
> but doesn't collapse in another case sounds to me that
> the former case runs into a very corner case bug.

XFS shuts down because you've returned EIO to a log IO. That's a
fatal error. If you do the same to an ext4 journal write, it will do
the equivalent of shut down (e.g. complain and turn read-only).

> The entire set of messages via virsh console is shown below.
> Some lines related to writeboost are all benign.
> The daemons are just stopping because blockup variable is 1.
>
> [ 146.284626] XFS (dm-3): metadata I/O error: block 0x300d91 ("xlog_iodone") error 5 numblks 64
> [ 146.285825] XFS (dm-3): Log I/O Error Detected. Shutting down filesystem
> [ 146.286699] XFS (dm-3): Please umount the filesystem and rectify the problem(s)

What happened before this? Please attach the *full* log.

> [ 146.560036] device-mapper: writeboost: err@modulator_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
> [ 147.244036] device-mapper: writeboost: err@migrate_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
> [ 172.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
> [ 172.436003] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]

These should be emitting a stack trace. Can you turn up the logging
level you are using so that they emit a full stack trace? The
messages are useless without the stack dump....

Also, 23 seconds before this timestamp is 149s, about 3s after the
XFS filesystem shut down, so it's not clear that the XFS shutdown is
related to the soft lockup yet. That's what we need the stack traces
for...

> [ 180.560040] device-mapper: writeboost: err@recorder_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
> [ 180.561179] device-mapper: writeboost: err@sync_proc() system is blocked up on I/O error. set blockup to 0 after checkup.

What's with the 35s delay between these writeboost messages? Have
you only done a partial shutdown of the block device and it takes
This length of time for it to completely block IO?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-10-16 12:17:49

by Akira Hayakawa

[permalink] [raw]
Subject: Re: A review of dm-writeboost

Dave

> XFS shuts down because you've returned EIO to a log IO. That's a
> fatal error. If you do the same to an ext4 journal write, it will do
> the equivalent of shut down (e.g. complain and turn read-only).
You mean block device should not return -EIO anyway if
it doesn't want XFS to suddenly shut down?
As Mikulas said, connection failure often be the cause of
I/O error from the underlying devices.

That ext4 and XFS are both determine shutting down in
erroneous journal writes is also due to the limitation of journal write?
or just a compromise in implementation? This is just for my curiosity.

>> [ 180.560040] device-mapper: writeboost: err@recorder_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
>> [ 180.561179] device-mapper: writeboost: err@sync_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
>
> What's with the 35s delay between these writeboost messages? Have
> you only done a partial shutdown of the block device and it takes
> This length of time for it to completely block IO?
Strange.
These daemons should stop in few seconds on the current configuration.
Yes, partial in a sense. Not all the daemons stop immediately but
for the client the logical device is seen as being blocked up
returning -EIO on every I/O. I don't think this behavior harms the
upper layer.

Currently, sync_proc is like this.
It sleeps for few seconds, wakes up and
meets wait_on_blockup() to dump that message and halts itself.
recorder_proc is implemented in the same way.

int sync_proc(void *data)
{
int r;
struct wb_cache *cache = data;
struct wb_device *wb = cache->wb;
unsigned long intvl;

while (!kthread_should_stop()) {

wait_on_blockup();

/* sec -> ms */
intvl = ACCESS_ONCE(cache->sync_interval) * 1000;

if (!intvl) {
schedule_timeout_interruptible(msecs_to_jiffies(1000));
continue;
}

flush_current_buffer(cache);

RETRY(blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL));

schedule_timeout_interruptible(msecs_to_jiffies(intvl));
}
return 0;
}

XFS shuts down, go crazy and it disturbs the kthread to wake up?

> These should be emitting a stack trace. Can you turn up the logging
> level you are using so that they emit a full stack trace? The
> messages are useless without the stack dump....
I turned the level up to 7. Here is the one.

Connected to domain Hercules
Escape character is ^]
[ 54.683482] device-mapper: writeboost: err@audit_cache_device() superblock header: magic number invalid
[ 54.809262] bio: create slab <bio-2> at 2
[ 68.812800] SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled
[ 68.825016] XFS (dm-3): Mounting Filesystem
[ 68.847027] XFS (dm-3): Ending clean mount
[ 72.100112] device-mapper: writeboost: err@dm_safe_io_internal() system is blocked up on I/O error. set blockup to 0 after checkup.
[ 72.109702] device-mapper: writeboost: err@migrate_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
[ 72.812097] device-mapper: writeboost: err@modulator_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
[ 73.894429] Buffer I/O error on device dm-3, logical block 98354
[ 73.895824] lost page write due to I/O error on dm-3
[ 73.897042] Buffer I/O error on device dm-3, logical block 98355
[ 73.897209] Buffer I/O error on device dm-3, logical block 196641
[ 73.897210] lost page write due to I/O error on dm-3
[ 73.897263] Buffer I/O error on device dm-3, logical block 196688
[ 73.897264] lost page write due to I/O error on dm-3
[ 73.897266] Buffer I/O error on device dm-3, logical block 196689
[ 73.897267] lost page write due to I/O error on dm-3
[ 73.897268] Buffer I/O error on device dm-3, logical block 196690
[ 73.897269] lost page write due to I/O error on dm-3
[ 73.897270] Buffer I/O error on device dm-3, logical block 196691
[ 73.897271] lost page write due to I/O error on dm-3
[ 73.897272] Buffer I/O error on device dm-3, logical block 196692
[ 73.897273] lost page write due to I/O error on dm-3
[ 73.897307] Buffer I/O error on device dm-3, logical block 294955
[ 73.897308] lost page write due to I/O error on dm-3
[ 73.897335] Buffer I/O error on device dm-3, logical block 294956
[ 73.897335] lost page write due to I/O error on dm-3
[ 73.914261] lost page write due to I/O error on dm-3
[ 73.930022] XFS (dm-3): metadata I/O error: block 0x40 ("xfs_buf_iodone_callbacks") error 5 numblks 16
[ 74.036759] XFS (dm-3): metadata I/O error: block 0x300c7f ("xlog_iodone") error 5 numblks 64
[ 74.043456] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa03a6417
[ 74.047556] XFS (dm-3): Log I/O Error Detected. Shutting down filesystem
[ 74.049893] XFS (dm-3): Please umount the filesystem and rectify the problem(s)
[ 74.051467] XFS (dm-3): metadata I/O error: block 0x300cbf ("xlog_iodone") error 5 numblks 64
[ 74.053190] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa03a6417
[ 74.055435] XFS (dm-3): metadata I/O error: block 0x300cff ("xlog_iodone") error 5 numblks 64
[ 74.057162] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa03a6417
[ 74.059402] XFS (dm-3): metadata I/O error: block 0x300d3f ("xlog_iodone") error 5 numblks 64
[ 74.061136] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa03a6417
[ 74.063561] XFS (dm-3): metadata I/O error: block 0x300d7f ("xlog_iodone") error 5 numblks 64
[ 74.065667] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa03a6417
[ 74.068215] XFS (dm-3): metadata I/O error: block 0x300dbf ("xlog_iodone") error 5 numblks 64
[ 74.069920] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa03a6417
[ 74.072325] XFS (dm-3): metadata I/O error: block 0x300dff ("xlog_iodone") error 5 numblks 64
[ 74.074118] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa03a6417
[ 100.052005] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1H:215]
[ 100.052005] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm psmouse microcode pcspkr serio_raw processor i2c_piix4 i2c_core evdev joydev virtio_balloon snd_page_alloc snd_timer snd soundcore thermal_sys button ext4 crc16 jbd2 mbcache dm_mod hid_generic usbhid hid sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix uhci_hcd ehci_hcd libata usbcore scsi_mod virtio_pci virtio_ring usb_common virtio floppy
[ 100.052005] CPU: 0 PID: 215 Comm: kworker/0:1H Tainted: G O 3.12.0-rc1 #8
[ 100.052005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 100.052005] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
[ 100.052005] task: ffff880216eb67b0 ti: ffff8801f9c7c000 task.ti: ffff8801f9c7c000
[ 100.052005] RIP: 0010:[<ffffffff8108146a>] [<ffffffff8108146a>] do_raw_spin_lock+0x1d/0x23
[ 100.052005] RSP: 0018:ffff8801f9c7dde0 EFLAGS: 00000206
[ 100.052005] RAX: 000000000116011c RBX: ffff880216eb6818 RCX: 0000000000000001
[ 100.052005] RDX: 0000000000000116 RSI: ffff8801fa389218 RDI: ffff880205c69e80
[ 100.052005] RBP: ffff880205c69e40 R08: ffff88021fc12ad8 R09: 0000000000000001
[ 100.052005] R10: 0000000000000001 R11: ffff88002f36e3c0 R12: ffffffff810605a8
[ 100.052005] R13: ffff880216eb6818 R14: ffff88021fc12ef0 R15: ffff880216eb6818
[ 100.052005] FS: 0000000000000000(0000) GS:ffff88021fc00000(0000) knlGS:0000000000000000
[ 100.052005] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 100.052005] CR2: ffffe8ffffc00520 CR3: 000000002ef57000 CR4: 00000000000006f0
[ 100.052005] Stack:
[ 100.052005] ffffffffa03aa836 ffff88002f36e418 ffff88021fc12ac0 ffffe8ffffc00400
[ 100.052005] ffff88002f36e3c0 ffff88002f36e460 ffffffffa03a99c2 ffff88002f36e3c0
[ 100.052005] ffffffffa03a9bd2 ffff8801faf95bc0 ffff88002f36e460 ffff88021fc12ac0
[ 100.052005] Call Trace:
[ 100.052005] [<ffffffffa03aa836>] ? xfs_buf_iodone+0x1b/0x49 [xfs]
[ 100.052005] [<ffffffffa03a99c2>] ? xfs_buf_do_callbacks+0x22/0x30 [xfs]
[ 100.052005] [<ffffffffa03a9bd2>] ? xfs_buf_iodone_callbacks+0x16b/0x1c4 [xfs]
[ 100.052005] [<ffffffff8104b7ab>] ? process_one_work+0x191/0x294
[ 100.052005] [<ffffffff8104bc5d>] ? worker_thread+0x121/0x1e7
[ 100.052005] [<ffffffff8104bb3c>] ? rescuer_thread+0x269/0x269
[ 100.052005] [<ffffffff81050641>] ? kthread+0x81/0x89
[ 100.052005] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[ 100.052005] [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
[ 100.052005] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[ 100.052005] Code: 0c 31 c0 48 81 ff 18 a6 37 81 0f 92 c0 c3 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0c 66 8b 07 66 39 d0 74 04 f3 90 <eb> f4 c3 90 90 90 83 c8 ff 0f b7 ca 66 ff c2 89 c2 0f 45 d1 0f
[ 100.244006] BUG: soft lockup - CPU#2 stuck for 22s! [xfsaild/dm-3:3167]
[ 100.244006] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm psmouse microcode pcspkr serio_raw processor i2c_piix4 i2c_core evdev joydev virtio_balloon snd_page_alloc snd_timer snd soundcore thermal_sys button ext4 crc16 jbd2 mbcache dm_mod hid_generic usbhid hid sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix uhci_hcd ehci_hcd libata usbcore scsi_mod virtio_pci virtio_ring usb_common virtio floppy
[ 100.244006] CPU: 2 PID: 3167 Comm: xfsaild/dm-3 Tainted: G O 3.12.0-rc1 #8
[ 100.244006] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 100.244006] task: ffff88002fef2100 ti: ffff88002b384000 task.ti: ffff88002b384000
[ 100.244006] RIP: 0010:[<ffffffff8108146a>] [<ffffffff8108146a>] do_raw_spin_lock+0x1d/0x23
[ 100.244006] RSP: 0018:ffff88002b385b00 EFLAGS: 00000206
[ 100.244006] RAX: 000000000117011c RBX: ffffffff8105e684 RCX: 0000000000000002
[ 100.244006] RDX: 0000000000000117 RSI: ffff8802166d7200 RDI: ffff880205c69e80
[ 100.244006] RBP: ffff8801fa2e2670 R08: 0000000000000005 R09: 0000000000000000
[ 100.244006] R10: 000000000000a944 R11: 0000000000000000 R12: ffff88002f181674
[ 100.244006] R13: 0000000000000001 R14: 0000000000000001 R15: ffff88002fef2100
[ 100.244006] FS: 0000000000000000(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
[ 100.244006] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 100.244006] CR2: ffffe8ffffc80400 CR3: 00000001fae7c000 CR4: 00000000000006e0
[ 100.244006] Stack:
[ 100.244006] ffffffffa0372e3d ffff88021682a830 ffffffffa036050e 0000000000012e80
[ 100.244006] ffff88002b385fd8 ffff88002b385fd8 000000022fef2100 0000000000300e3f
[ 100.244006] ffff88002f071500 0000000000000292 ffffffff81050e94 0000000000000000
[ 100.244006] Call Trace:
[ 100.244006] [<ffffffffa0372e3d>] ? xfs_trans_committed_bulk+0x2f/0x19d [xfs]
[ 100.244006] [<ffffffffa036050e>] ? _xfs_buf_ioapply+0x271/0x29c [xfs]
[ 100.244006] [<ffffffff81050e94>] ? remove_wait_queue+0xe/0x48
[ 100.244006] [<ffffffffa03a5b85>] ? xlog_wait+0x62/0x6b [xfs]
[ 100.244006] [<ffffffff8105bbfb>] ? try_to_wake_up+0x190/0x190
[ 100.244006] [<ffffffffa03a78e6>] ? xlog_state_get_iclog_space+0x5a/0x1fb [xfs]
[ 100.244006] [<ffffffff810fa1c3>] ? __cache_free.isra.46+0x178/0x187
[ 100.244006] [<ffffffffa03a8e0b>] ? xlog_cil_committed+0x2f/0xe6 [xfs]
[ 100.244006] [<ffffffffa03a926c>] ? xlog_cil_push+0x2f6/0x311 [xfs]
[ 100.244006] [<ffffffff81058701>] ? mmdrop+0xd/0x1c
[ 100.244006] [<ffffffffa03a9780>] ? xlog_cil_force_lsn+0x71/0xdd [xfs]
[ 100.244006] [<ffffffffa03a8162>] ? _xfs_log_force+0x55/0x1a0 [xfs]
[ 100.244006] [<ffffffffa03a82cc>] ? xfs_log_force+0x1f/0x4e [xfs]
[ 100.244006] [<ffffffffa03aba15>] ? xfsaild+0x144/0x4cd [xfs]
[ 100.244006] [<ffffffff810592dc>] ? finish_task_switch+0x7f/0xaa
[ 100.244006] [<ffffffffa03ab8d1>] ? xfs_trans_ail_cursor_first+0x76/0x76 [xfs]
[ 100.244006] [<ffffffffa03ab8d1>] ? xfs_trans_ail_cursor_first+0x76/0x76 [xfs]
[ 100.244006] [<ffffffff81050641>] ? kthread+0x81/0x89
[ 100.244006] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[ 100.244006] [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
[ 100.244006] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[ 100.244006] Code: 0c 31 c0 48 81 ff 18 a6 37 81 0f 92 c0 c3 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0c 66 8b 07 66 39 d0 74 04 f3 90 <eb> f4 c3 90 90 90 83 c8 ff 0f b7 ca 66 ff c2 89 c2 0f 45 d1 0f
[ 100.340005] BUG: soft lockup - CPU#3 stuck for 22s! [kworker/3:1H:207]
[ 100.340005] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm psmouse microcode pcspkr serio_raw processor i2c_piix4 i2c_core evdev joydev virtio_balloon snd_page_alloc snd_timer snd soundcore thermal_sys button ext4 crc16 jbd2 mbcache dm_mod hid_generic usbhid hid sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix uhci_hcd ehci_hcd libata usbcore scsi_mod virtio_pci virtio_ring usb_common virtio floppy
[ 100.340005] CPU: 3 PID: 207 Comm: kworker/3:1H Tainted: G O 3.12.0-rc1 #8
[ 100.340005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 100.340005] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
[ 100.340005] task: ffff8801f9cc5870 ti: ffff8801f9c06000 task.ti: ffff8801f9c06000
[ 100.340005] RIP: 0010:[<ffffffff81081463>] [<ffffffff81081463>] do_raw_spin_lock+0x16/0x23
[ 100.340005] RSP: 0018:ffff8801f9c07de0 EFLAGS: 00000202
[ 100.340005] RAX: 000000000118011c RBX: ffffe8ffffcc0400 RCX: 0000000000000001
[ 100.340005] RDX: 0000000000000118 RSI: ffff8801fa389218 RDI: ffff880205c69e80
[ 100.340005] RBP: ffff880205c69e40 R08: ffff88021fcd2ad8 R09: 0000000000000001
[ 100.340005] R10: 0000000000000001 R11: ffff88002f36e540 R12: ffff88002f36e674
[ 100.340005] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[ 100.340005] FS: 0000000000000000(0000) GS:ffff88021fcc0000(0000) knlGS:0000000000000000
[ 100.340005] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 100.340005] CR2: 00007f15636c7e50 CR3: 000000000160b000 CR4: 00000000000006e0
[ 100.340005] Stack:
[ 100.340005] ffffffffa03aa836 ffff88002f36e598 ffff88021fcd2ac0 ffffe8ffffcc0400
[ 100.340005] ffff88002f36e540 ffff88002f36e5e0 ffffffffa03a99c2 ffff88002f36e540
[ 100.340005] ffffffffa03a9bd2 ffff8801f9d16f40 ffff88002f36e5e0 ffff88021fcd2ac0
[ 100.340005] Call Trace:
[ 100.340005] [<ffffffffa03aa836>] ? xfs_buf_iodone+0x1b/0x49 [xfs]
[ 100.340005] [<ffffffffa03a99c2>] ? xfs_buf_do_callbacks+0x22/0x30 [xfs]
[ 100.340005] [<ffffffffa03a9bd2>] ? xfs_buf_iodone_callbacks+0x16b/0x1c4 [xfs]
[ 100.340005] [<ffffffff8104b7ab>] ? process_one_work+0x191/0x294
[ 100.340005] [<ffffffff8104bc5d>] ? worker_thread+0x121/0x1e7
[ 100.340005] [<ffffffff8104bb3c>] ? rescuer_thread+0x269/0x269
[ 100.340005] [<ffffffff81050641>] ? kthread+0x81/0x89
[ 100.340005] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[ 100.340005] [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
[ 100.340005] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[ 100.340005] Code: 81 ff 98 a2 37 81 72 0c 31 c0 48 81 ff 18 a6 37 81 0f 92 c0 c3 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0c 66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c3 90 90 90 83 c8 ff 0f b7 ca 66 ff
[ 100.436010] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:2:537]
[ 100.436010] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm psmouse microcode pcspkr serio_raw processor i2c_piix4 i2c_core evdev joydev virtio_balloon snd_page_alloc snd_timer snd soundcore thermal_sys button ext4 crc16 jbd2 mbcache dm_mod hid_generic usbhid hid sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix uhci_hcd ehci_hcd libata usbcore scsi_mod virtio_pci virtio_ring usb_common virtio floppy
[ 100.436010] CPU: 4 PID: 537 Comm: kworker/4:2 Tainted: G O 3.12.0-rc1 #8
[ 100.436010] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 100.436010] Workqueue: xfs-reclaim/dm-3 xfs_reclaim_worker [xfs]
[ 100.436010] task: ffff8801fade6040 ti: ffff88002edaa000 task.ti: ffff88002edaa000
[ 100.436010] RIP: 0010:[<ffffffff81081460>] [<ffffffff81081460>] do_raw_spin_lock+0x13/0x23
[ 100.436010] RSP: 0018:ffff88002edabc20 EFLAGS: 00000202
[ 100.436010] RAX: 00000000011b011c RBX: ffff88002f693870 RCX: 0000000000000000
[ 100.436010] RDX: 000000000000011b RSI: 0000000000000000 RDI: ffff880205c69e80
[ 100.436010] RBP: ffff88002a15c000 R08: 0000000000000000 R09: 0000000000000006
[ 100.436010] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 100.436010] R13: ffff880216e79ec0 R14: ffff88002edabc24 R15: 0000000000012e80
[ 100.436010] FS: 0000000000000000(0000) GS:ffff88021fd00000(0000) knlGS:0000000000000000
[ 100.436010] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 100.436010] CR2: 00007f7e9b394f30 CR3: 000000000160b000 CR4: 00000000000006e0
[ 100.436010] Stack:
[ 100.436010] ffffffffa03ab5f3 0000000000000001 ffff8801fa383710 ffff88002a15c000
[ 100.436010] 0000000000000445 0000000000000002 ffff8801fa3836c0 0000000000000002
[ 100.436010] ffffffffa03667a3 0000000000000444 0000000000000000 ffff8801fa3836c0
[ 100.436010] Call Trace:
[ 100.436010] [<ffffffffa03ab5f3>] ? xfs_iflush_abort+0x35/0x9a [xfs]
[ 100.436010] [<ffffffffa03667a3>] ? xfs_reclaim_inode+0x85/0x246 [xfs]
[ 100.436010] [<ffffffffa0366aab>] ? xfs_reclaim_inodes_ag+0x147/0x1fc [xfs]
[ 100.436010] [<ffffffff8105bbfb>] ? try_to_wake_up+0x190/0x190
[ 100.436010] [<ffffffff81056927>] ? __wake_up_common+0x42/0x78
[ 100.436010] [<ffffffff810d5a18>] ? fold_diff+0x22/0x2e
[ 100.436010] [<ffffffff810408d8>] ? lock_timer_base.isra.35+0x23/0x48
[ 100.436010] [<ffffffff81040750>] ? internal_add_timer+0xd/0x28
[ 100.436010] [<ffffffff8104125b>] ? __mod_timer+0xfa/0x10c
[ 100.436010] [<ffffffffa0367382>] ? xfs_reclaim_inodes+0x16/0x1b [xfs]
[ 100.436010] [<ffffffffa036739c>] ? xfs_reclaim_worker+0x15/0x1e [xfs]
[ 100.436010] [<ffffffff8104b7ab>] ? process_one_work+0x191/0x294
[ 100.436010] [<ffffffff8104bc5d>] ? worker_thread+0x121/0x1e7
[ 100.436010] [<ffffffff8104bb3c>] ? rescuer_thread+0x269/0x269
[ 100.436010] [<ffffffff81050641>] ? kthread+0x81/0x89
[ 100.436010] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[ 100.436010] [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
[ 100.436010] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[ 100.436010] Code: 31 c0 48 81 ff 98 a2 37 81 72 0c 31 c0 48 81 ff 18 a6 37 81 0f 92 c0 c3 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0c <66> 8b 07 66 39 d0 74 04 f3 90 eb f4 c3 90 90 90 83 c8 ff 0f b7
[ 100.628005] BUG: soft lockup - CPU#6 stuck for 22s! [script:3151]
[ 100.628005] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm psmouse microcode pcspkr serio_raw processor i2c_piix4 i2c_core evdev joydev virtio_balloon snd_page_alloc snd_timer snd soundcore thermal_sys button ext4 crc16 jbd2 mbcache dm_mod hid_generic usbhid hid sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix uhci_hcd ehci_hcd libata usbcore scsi_mod virtio_pci virtio_ring usb_common virtio floppy
[ 100.628005] CPU: 6 PID: 3151 Comm: script Tainted: G O 3.12.0-rc1 #8
[ 100.628005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 100.628005] task: ffff88002fc3d870 ti: ffff88002fda4000 task.ti: ffff88002fda4000
[ 100.628005] RIP: 0010:[<ffffffff81081463>] [<ffffffff81081463>] do_raw_spin_lock+0x16/0x23
[ 100.628005] RSP: 0018:ffff88002fda5d10 EFLAGS: 00000202
[ 100.628005] RAX: 00000000011a011c RBX: ffffffff81119902 RCX: 00000000000004e2
[ 100.628005] RDX: 000000000000011a RSI: ffff88002f64e200 RDI: ffff880205c69e80
[ 100.628005] RBP: ffff88002fda5e58 R08: ffffffffa03bec40 R09: 0000000000000000
[ 100.628005] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88002fda5cf0
[ 100.628005] R13: 0000000000000001 R14: ffff8801facfea10 R15: 0000000000000000
[ 100.628005] FS: 00007f599f046700(0000) GS:ffff88021fd80000(0000) knlGS:0000000000000000
[ 100.628005] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 100.628005] CR2: 000000000234c698 CR3: 000000002f62e000 CR4: 00000000000006e0
[ 100.628005] Stack:
[ 100.628005] ffffffffa03abe25 ffff8801facfe800 ffff88002fda5e58 ffff88002ffcb000
[ 100.628005] ffffffffa03673bf ffff88002fda5d38 000000002fda5d38 0000000000001036
[ 100.628005] ffffffff8110875e 0000000000000000 0000000000000000 ffff88002ffcb360
[ 100.628005] Call Trace:
[ 100.628005] [<ffffffffa03abe25>] ? xfs_ail_push_all+0x13/0x4f [xfs]
[ 100.628005] [<ffffffffa03673bf>] ? xfs_reclaim_inodes_nr+0x1a/0x34 [xfs]
[ 100.628005] [<ffffffff8110875e>] ? super_cache_scan+0x121/0x13e
[ 100.628005] [<ffffffff810cdb7a>] ? shrink_slab+0x1e3/0x2f9
[ 100.628005] [<ffffffff81119526>] ? iput+0x34/0x13d
[ 100.628005] [<ffffffff81148d14>] ? do_coredump+0xbc3/0xbc3
[ 100.628005] [<ffffffff81148e3f>] ? drop_caches_sysctl_handler+0x65/0x76
[ 100.628005] [<ffffffff81157b7c>] ? proc_sys_call_handler+0x98/0xbf
[ 100.628005] [<ffffffff81105eca>] ? vfs_write+0x9e/0x104
[ 100.628005] [<ffffffff811061c1>] ? SyS_write+0x51/0x85
[ 100.628005] [<ffffffff8137f592>] ? system_call_fastpath+0x16/0x1b
[ 100.628005] Code: 81 ff 98 a2 37 81 72 0c 31 c0 48 81 ff 18 a6 37 81 0f 92 c0 c3 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0c 66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c3 90 90 90 83 c8 ff 0f b7 ca 66 ff
[ 100.724004] BUG: soft lockup - CPU#7 stuck for 22s! [kworker/7:1H:211]
[ 100.724005] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm psmouse microcode pcspkr serio_raw processor i2c_piix4 i2c_core evdev joydev virtio_balloon snd_page_alloc snd_timer snd soundcore thermal_sys button ext4 crc16 jbd2 mbcache dm_mod hid_generic usbhid hid sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix uhci_hcd ehci_hcd libata usbcore scsi_mod virtio_pci virtio_ring usb_common virtio floppy
[ 100.724005] CPU: 7 PID: 211 Comm: kworker/7:1H Tainted: G O 3.12.0-rc1 #8
[ 100.724005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 100.724005] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
[ 100.724005] task: ffff88002f7de830 ti: ffff8801fa100000 task.ti: ffff8801fa100000
[ 100.724005] RIP: 0010:[<ffffffff81081463>] [<ffffffff81081463>] do_raw_spin_lock+0x16/0x23
[ 100.724005] RSP: 0018:ffff8801fa101de0 EFLAGS: 00000206
[ 100.724005] RAX: 000000000119011c RBX: ffff8801fa7ad0c0 RCX: 0000000000000001
[ 100.724005] RDX: 0000000000000119 RSI: ffff8801fac4a398 RDI: ffff880205c69e80
[ 100.724005] RBP: ffff880205c69e40 R08: ffff88021fdd2ad8 R09: ffff88002f181500
[ 100.724005] R10: ffff88002f181500 R11: ffff88002f2670c0 R12: ffff8801fa7ad0c0
[ 100.724005] R13: ffff88021658a360 R14: ffffffffa03c3e75 R15: ffffe8ffffdc0400
[ 100.724005] FS: 0000000000000000(0000) GS:ffff88021fdc0000(0000) knlGS:0000000000000000
[ 100.724005] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 100.724005] CR2: 000000000131b724 CR3: 00000001fae7c000 CR4: 00000000000006e0
[ 100.724005] Stack:
[ 100.724005] ffffffffa03aa836 ffff88002f267118 ffff88021fdd2ac0 ffffe8ffffdc0400
[ 100.724005] ffff88002f2670c0 ffff88002f267160 ffffffffa03a99c2 ffff88002f2670c0
[ 100.724005] ffffffffa03a9bd2 ffff880216eb42c0 ffff88002f267160 ffff88021fdd2ac0
[ 100.724005] Call Trace:
[ 100.724005] [<ffffffffa03aa836>] ? xfs_buf_iodone+0x1b/0x49 [xfs]
[ 100.724005] [<ffffffffa03a99c2>] ? xfs_buf_do_callbacks+0x22/0x30 [xfs]
[ 100.724005] [<ffffffffa03a9bd2>] ? xfs_buf_iodone_callbacks+0x16b/0x1c4 [xfs]
[ 100.724005] [<ffffffff8104b7ab>] ? process_one_work+0x191/0x294
[ 100.724005] [<ffffffff8104bc5d>] ? worker_thread+0x121/0x1e7
[ 100.724005] [<ffffffff8104bb3c>] ? rescuer_thread+0x269/0x269
[ 100.724005] [<ffffffff81050641>] ? kthread+0x81/0x89
[ 100.724005] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[ 100.724005] [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
[ 100.724005] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[ 100.724005] Code: 81 ff 98 a2 37 81 72 0c 31 c0 48 81 ff 18 a6 37 81 0f 92 c0 c3 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0c 66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c3 90 90 90 83 c8 ff 0f b7 ca 66 ff

Akira

2013-10-16 21:42:37

by Dave Chinner

[permalink] [raw]
Subject: Re: A review of dm-writeboost

On Wed, Oct 16, 2013 at 09:17:40PM +0900, Akira Hayakawa wrote:
> Dave
>
> > XFS shuts down because you've returned EIO to a log IO. That's a
> > fatal error. If you do the same to an ext4 journal write, it will do
> > the equivalent of shut down (e.g. complain and turn read-only).
> You mean block device should not return -EIO anyway if
> it doesn't want XFS to suddenly shut down?

Yes. EIO means an IO error has occurred. That causes failure paths
to be triggered in the upper layers.

I really don't understand what you are trying to achieve with this
"blockup" thing. If something goes wrong with the device, then you
*cannot recover* by sending EIO to any new IOs and then continuing
on at a later time as though nothing has happened. The moment a
filesystem gets EIO from a metadata write, it is likely to be
corrupted and if you continue onwards after that you simply
propagate the corruption.

> As Mikulas said, connection failure often be the cause of
> I/O error from the underlying devices.

Connection failure is *rarely* the cause of IO errors, except in
environments where SANs are in use. Even then multipathing makes
fatal connection failure a rare occurrence. Broken hardware is a
much more common cause of problems at the storage layers.

> That ext4 and XFS are both determine shutting down in
> erroneous journal writes is also due to the limitation of journal write?
> or just a compromise in implementation? This is just for my curiosity.

A failed, unrecoverable journal write violates the filesystem
consistency model of any journalling filesystem. Operations must be
stopped and the hardware and filesystem must be repaired, otherwise
loss of data will occur.

i.e. You're tellingthe filesystem that it's had a fatal IO error by
returning EIO, and the filesystems are treating it as though they've
seen a fatal IO error.


Simple rule: Don't complete IOs with EIO if you haven't had a fatal
IO error.

> struct wb_cache *cache = data;
> struct wb_device *wb = cache->wb;
> unsigned long intvl;
>
> while (!kthread_should_stop()) {
>
> wait_on_blockup();

Ugh. You should be using workqueue with timed work for this.

BTW, you're missing the handling needed by these kernel threads
for suspend-to-disk/ram....

> [ 68.825016] XFS (dm-3): Mounting Filesystem
> [ 68.847027] XFS (dm-3): Ending clean mount
> [ 72.100112] device-mapper: writeboost: err@dm_safe_io_internal() system is blocked up on I/O error. set blockup to 0 after checkup.
> [ 72.109702] device-mapper: writeboost: err@migrate_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
> [ 72.812097] device-mapper: writeboost: err@modulator_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
> [ 73.894429] Buffer I/O error on device dm-3, logical block 98354
> [ 73.895824] lost page write due to I/O error on dm-3

Data IO has been lost due to EIOs. You've corrupted user files when
this error is emitted.

...

> [ 73.930022] XFS (dm-3): metadata I/O error: block 0x40 ("xfs_buf_iodone_callbacks") error 5 numblks 16
> [ 74.036759] XFS (dm-3): metadata I/O error: block 0x300c7f ("xlog_iodone") error 5 numblks 64
> [ 74.043456] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa03a6417
> [ 74.047556] XFS (dm-3): Log I/O Error Detected. Shutting down filesystem
> [ 74.049893] XFS (dm-3): Please umount the filesystem and rectify the problem(s)
> [ 74.051467] XFS (dm-3): metadata I/O error: block 0x300cbf ("xlog_iodone") error 5 numblks 64
> [ 74.053190] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa03a6417

And these are all the log buffers containing uncommitted changes
being aborted due to EIO. The filesystem state in memory now doesn't
match the state on disk, and so it's effectively corrupt and shuts
down.

....

> [ 100.052005] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1H:215]
...
> [ 100.052005] RIP: 0010:[<ffffffff81081463>] [<ffffffff81081463>] do_raw_spin_lock+0x16/0x23
> [ 100.052005] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
> [ 100.052005] Call Trace:
> [ 100.052005] [<ffffffffa03aa836>] ? xfs_buf_iodone+0x1b/0x49 [xfs]
> [ 100.052005] [<ffffffffa03a99c2>] ? xfs_buf_do_callbacks+0x22/0x30 [xfs]
> [ 100.052005] [<ffffffffa03a9bd2>] ? xfs_buf_iodone_callbacks+0x16b/0x1c4 [xfs]
> [ 100.052005] [<ffffffff8104b7ab>] ? process_one_work+0x191/0x294
> [ 100.052005] [<ffffffff8104bc5d>] ? worker_thread+0x121/0x1e7
> [ 100.052005] [<ffffffff8104bb3c>] ? rescuer_thread+0x269/0x269
> [ 100.052005] [<ffffffff81050641>] ? kthread+0x81/0x89
> [ 100.052005] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
> [ 100.052005] [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
> [ 100.052005] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d

You need to compile your kernel with framepointers enabled so we get
reliable stack traces. I think it's stuck on a spinlock in
xfs_buf_iodone, which would imply the AIL lock.

.....

> [ 100.244006] CPU: 2 PID: 3167 Comm: xfsaild/dm-3 Tainted: G O 3.12.0-rc1 #8

FWIW, you should probably be testing the lastest Linus kernel
(3.12-rc5, IIRC) rather than -rc1....

....

> [ 100.244006] RIP: 0010:[<ffffffff81081463>] [<ffffffff81081463>] do_raw_spin_lock+0x16/0x23
> [ 100.244006] Call Trace:
> [ 100.244006] [<ffffffffa0372e3d>] ? xfs_trans_committed_bulk+0x2f/0x19d [xfs]
> [ 100.244006] [<ffffffffa036050e>] ? _xfs_buf_ioapply+0x271/0x29c [xfs]
> [ 100.244006] [<ffffffff81050e94>] ? remove_wait_queue+0xe/0x48
> [ 100.244006] [<ffffffffa03a5b85>] ? xlog_wait+0x62/0x6b [xfs]
> [ 100.244006] [<ffffffff8105bbfb>] ? try_to_wake_up+0x190/0x190
> [ 100.244006] [<ffffffffa03a78e6>] ? xlog_state_get_iclog_space+0x5a/0x1fb [xfs]
> [ 100.244006] [<ffffffff810fa1c3>] ? __cache_free.isra.46+0x178/0x187
> [ 100.244006] [<ffffffffa03a8e0b>] ? xlog_cil_committed+0x2f/0xe6 [xfs]
> [ 100.244006] [<ffffffffa03a926c>] ? xlog_cil_push+0x2f6/0x311 [xfs]
> [ 100.244006] [<ffffffff81058701>] ? mmdrop+0xd/0x1c
> [ 100.244006] [<ffffffffa03a9780>] ? xlog_cil_force_lsn+0x71/0xdd [xfs]
> [ 100.244006] [<ffffffffa03a8162>] ? _xfs_log_force+0x55/0x1a0 [xfs]
> [ 100.244006] [<ffffffffa03a82cc>] ? xfs_log_force+0x1f/0x4e [xfs]
> [ 100.244006] [<ffffffffa03aba15>] ? xfsaild+0x144/0x4cd [xfs]
> [ 100.244006] [<ffffffff810592dc>] ? finish_task_switch+0x7f/0xaa
> [ 100.244006] [<ffffffffa03ab8d1>] ? xfs_trans_ail_cursor_first+0x76/0x76 [xfs]
> [ 100.244006] [<ffffffffa03ab8d1>] ? xfs_trans_ail_cursor_first+0x76/0x76 [xfs]
> [ 100.244006] [<ffffffff81050641>] ? kthread+0x81/0x89
> [ 100.244006] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
> [ 100.244006] [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
> [ 100.244006] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d

It's stuck on a spin lock, but I don't know what function it's
in because the stack trace is indeterminate (i.e. need frame
pointers enabled). It might be the AIL lock (as it's the xfsaild),
but I can't tell.

> [ 100.436010] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:2:537]
...
> [ 100.436010] Workqueue: xfs-reclaim/dm-3 xfs_reclaim_worker [xfs]
> [ 100.436010] RIP: 0010:[<ffffffff81081460>] [<ffffffff81081460>] do_raw_spin_lock+0x13/0x23
> [ 100.436010] Call Trace:
> [ 100.436010] [<ffffffffa03ab5f3>] ? xfs_iflush_abort+0x35/0x9a [xfs]
> [ 100.436010] [<ffffffffa03667a3>] ? xfs_reclaim_inode+0x85/0x246 [xfs]
> [ 100.436010] [<ffffffffa0366aab>] ? xfs_reclaim_inodes_ag+0x147/0x1fc [xfs]
> [ 100.436010] [<ffffffff8105bbfb>] ? try_to_wake_up+0x190/0x190
> [ 100.436010] [<ffffffff81056927>] ? __wake_up_common+0x42/0x78
> [ 100.436010] [<ffffffff810d5a18>] ? fold_diff+0x22/0x2e
> [ 100.436010] [<ffffffff810408d8>] ? lock_timer_base.isra.35+0x23/0x48
> [ 100.436010] [<ffffffff81040750>] ? internal_add_timer+0xd/0x28
> [ 100.436010] [<ffffffff8104125b>] ? __mod_timer+0xfa/0x10c
> [ 100.436010] [<ffffffffa0367382>] ? xfs_reclaim_inodes+0x16/0x1b [xfs]
> [ 100.436010] [<ffffffffa036739c>] ? xfs_reclaim_worker+0x15/0x1e [xfs]
> [ 100.436010] [<ffffffff8104b7ab>] ? process_one_work+0x191/0x294
> [ 100.436010] [<ffffffff8104bc5d>] ? worker_thread+0x121/0x1e7
> [ 100.436010] [<ffffffff8104bb3c>] ? rescuer_thread+0x269/0x269
> [ 100.436010] [<ffffffff81050641>] ? kthread+0x81/0x89
> [ 100.436010] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
> [ 100.436010] [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
> [ 100.436010] [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d

Also stuck on a spin lock, but again it is not obvious what function
it is in and hence what spinlock is affected. xfs_iflush_abort()
does take the AIL lock, so that might be it.

> [ 100.628005] BUG: soft lockup - CPU#6 stuck for 22s! [script:3151]
> [ 100.628005] RIP: 0010:[<ffffffff81081463>] [<ffffffff81081463>] do_raw_spin_lock+0x16/0x23
> [ 100.628005] Call Trace:
> [ 100.628005] [<ffffffffa03abe25>] ? xfs_ail_push_all+0x13/0x4f [xfs]
> [ 100.628005] [<ffffffffa03673bf>] ? xfs_reclaim_inodes_nr+0x1a/0x34 [xfs]
> [ 100.628005] [<ffffffff8110875e>] ? super_cache_scan+0x121/0x13e
> [ 100.628005] [<ffffffff810cdb7a>] ? shrink_slab+0x1e3/0x2f9
> [ 100.628005] [<ffffffff81119526>] ? iput+0x34/0x13d
> [ 100.628005] [<ffffffff81148d14>] ? do_coredump+0xbc3/0xbc3
> [ 100.628005] [<ffffffff81148e3f>] ? drop_caches_sysctl_handler+0x65/0x76
> [ 100.628005] [<ffffffff81157b7c>] ? proc_sys_call_handler+0x98/0xbf
> [ 100.628005] [<ffffffff81105eca>] ? vfs_write+0x9e/0x104
> [ 100.628005] [<ffffffff811061c1>] ? SyS_write+0x51/0x85
> [ 100.628005] [<ffffffff8137f592>] ? system_call_fastpath+0x16/0x1b

That can only be stuck on the AIL spin lock.

So, I've just audited all the uses of the AIL lock, and I cannot
find an unbalanced user of the AIL lock. If we've leaked the spin
lock, it's not an obvious or easy to trigger bug. Can you turn on
lockdep as well as CONFIG_XFS_DEBUG and see what warnings that
throws?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-10-19 11:00:08

by Akira Hayakawa

[permalink] [raw]
Subject: Re: A review of dm-writeboost

Dave,

# -EIO retuned corrupts XFS
I turned up
lockdep, frame pointer, xfs debug
and also changed to 3.12.0-rc5 from rc1.

What's changed is that
the problem we discussed in previous mails
*never* reproduce.

However, if I turn off the lockdep only
it hangs up by setting blockup to 1 and then to 0 after that.
The underlying device once became dead revives
may confuse the filesystem but
this looks not related to locking things.
This problem may be producible using dm-flakey.

This behavior is not reproducible with ext4.

--------------------------------------------
root@Lyle:/home/akira# virsh console Hercules
Connected to domain Hercules
Escape character is ^]
[ 143.042980] device-mapper: writeboost: info@modulator_proc() reactivated after blockup
[ 143.042999] device-mapper: writeboost: info@dm_safe_io_internal() reactivated after blockup
[ 143.043006] device-mapper: writeboost: info@migrate_proc() reactivated after blockup
[ 143.045328] XFS: metadata I/O error: block 0x300d0f ("xlog_iodone") error 5 numblks 64
[ 143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa0430f9c
[ 143.045335] XFS: Log I/O Error Detected. Shutting down filesystem
[ 143.045335] XFS: Please umount the filesystem and rectify the problem(s)
[ 143.045338] XFS: Assertion failed: atomic_read(&iclog->ic_refcnt) == 0, file: fs/xfs/xfs_log.c, line: 2751
[ 143.045392] ------------[ cut here ]------------
[ 143.045393] kernel BUG at fs/xfs/xfs_message.c:108!
[ 143.045396] invalid opcode: 0000 [#1] SMP
[ 143.045429] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm joydev hid_generic usbhid processor snd_page_alloc i2c_piix4 i2c_core snd_timer snd soundcore psmouse serio_raw evdev virtio_balloon pcspkr hid thermal_sys microcode button ext4 crc16 jbd2 mbcache dm_mod sg sr_mod cdrom ata_generic virtio_blk virtio_net floppy uhci_hcd ehci_hcd ata_piix usbcore libata usb_common scsi_mod virtio_pci virtio_ring virtio
[ 143.045433] CPU: 2 PID: 216 Comm: kworker/2:1H Tainted: G O 3.12.0-rc5 #11
[ 143.045434] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 143.045465] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
[ 143.045466] task: ffff88002fe7e0c0 ti: ffff88002ff66000 task.ti: ffff88002ff66000
[ 143.045493] RIP: 0010:[<ffffffffa03e72f3>] [<ffffffffa03e72f3>] assfail+0x1d/0x1f [xfs]
[ 143.045494] RSP: 0000:ffff88002ff67d98 EFLAGS: 00010292
[ 143.045495] RAX: 000000000000005e RBX: ffff88020de3c280 RCX: 0000000000000c8a
[ 143.045496] RDX: 0000000000000000 RSI: ffffffff819c3f2c RDI: ffffffff818356e0
[ 143.045497] RBP: ffff88002ff67d98 R08: 0000000000000000 R09: ffffffff819c3ecc
[ 143.045498] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801fa583d28
[ 143.045499] R13: ffff8801fa583c00 R14: 0000000000000002 R15: 0000000000000000
[ 143.045501] FS: 0000000000000000(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
[ 143.045502] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 143.045505] CR2: 00007f6e7680db67 CR3: 000000000180b000 CR4: 00000000000006e0
[ 143.045511] Stack:
[ 143.045514] ffff88002ff67dc8 ffffffffa0430ed1 ffff8801fb124e40 0000000000000002
[ 143.045517] ffff88020de3c280 ffffe8ffffc80500 ffff88002ff67df8 ffffffffa0431018
[ 143.045519] ffff8801fb124ee0 ffff8801fb124ee0 ffff8801fb124e40 ffff88021fc92a00
[ 143.045520] Call Trace:
[ 143.045558] [<ffffffffa0430ed1>] xlog_state_done_syncing+0x6d/0xe4 [xfs]
[ 143.045590] [<ffffffffa0431018>] xlog_iodone+0xd0/0xd9 [xfs]
[ 143.045609] [<ffffffffa03d834f>] xfs_buf_iodone_work+0x57/0xa2 [xfs]
[ 143.045627] [<ffffffff8104f21a>] process_one_work+0x18b/0x297
[ 143.045631] [<ffffffff8104f6e6>] worker_thread+0x12e/0x1fb
[ 143.045634] [<ffffffff8104f5b8>] ? rescuer_thread+0x268/0x268
[ 143.045638] [<ffffffff81054507>] kthread+0x88/0x90
[ 143.045641] [<ffffffff8105447f>] ? __kthread_parkme+0x60/0x60
[ 143.045646] [<ffffffff813a707c>] ret_from_fork+0x7c/0xb0
[ 143.045649] [<ffffffff8105447f>] ? __kthread_parkme+0x60/0x60
[ 143.045670] Code: 48 c7 c7 6b 26 45 a0 e8 a4 2a c5 e0 5d c3 55 48 89 f1 41 89 d0 48 c7 c6 80 26 45 a0 48 89 fa 31 c0 48 89 e5 31 ff e8 aa fc ff ff <0f> 0b 55 48 63 f6 49 89 f9 41 b8 01 00 00 00 b9 10 00 00 00 ba
[ 143.045694] RIP [<ffffffffa03e72f3>] assfail+0x1d/0x1f [xfs]
[ 143.045695] RSP <ffff88002ff67d98>
[ 143.045697] ---[ end trace e4f1a3c306353178 ]---
[ 143.045741] BUG: unable to handle kernel paging request at ffffffffffffffd8
[ 143.045745] IP: [<ffffffff810548fd>] kthread_data+0xb/0x11
[ 143.045747] PGD 180c067 PUD 180e067 PMD 0
[ 143.045749] Oops: 0000 [#2] SMP
[ 143.045774] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm joydev hid_generic usbhid processor snd_page_alloc i2c_piix4 i2c_core snd_timer snd soundcore psmouse serio_raw evdev virtio_balloon pcspkr hid thermal_sys microcode button ext4 crc16 jbd2 mbcache dm_mod sg sr_mod cdrom ata_generic virtio_blk virtio_net floppy uhci_hcd ehci_hcd ata_piix usbcore libata usb_common scsi_mod virtio_pci virtio_ring virtio
[ 143.045777] CPU: 2 PID: 216 Comm: kworker/2:1H Tainted: G D O 3.12.0-rc5 #11
[ 143.045778] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 143.045795] task: ffff88002fe7e0c0 ti: ffff88002ff66000 task.ti: ffff88002ff66000
[ 143.045798] RIP: 0010:[<ffffffff810548fd>] [<ffffffff810548fd>] kthread_data+0xb/0x11
[ 143.045802] RSP: 0000:ffff88002ff67a38 EFLAGS: 00010002
[ 143.045804] RAX: 0000000000000000 RBX: ffff88021fc92e80 RCX: ffff88021fc92ef0
[ 143.045805] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff88002fe7e0c0
[ 143.045805] RBP: ffff88002ff67a38 R08: ffffffff819b6640 R09: 000000000000bfcf
[ 143.045806] R10: dead000000200200 R11: 0000000000000000 R12: 0000000000000002
[ 143.045807] R13: 0000000000000001 R14: 0000000000000002 R15: ffff88002fe7e0b0
[ 143.045809] FS: 0000000000000000(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
[ 143.045810] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 143.045813] CR2: 0000000000000028 CR3: 000000000180b000 CR4: 00000000000006e0
[ 143.045818] Stack:
[ 143.045821] ffff88002ff67a58 ffffffff8104faa5 ffff88021fc92e80 ffff88002fe7e408
[ 143.045824] ffff88002ff67ad8 ffffffff813a0860 0000000000000000 ffff88002f53b6e0
[ 143.045826] ffff88002f53b6f0 ffff88002fe7e0c0 ffff88002ff67fd8 ffff88002ff67fd8
[ 143.045827] Call Trace:
[ 143.045831] [<ffffffff8104faa5>] wq_worker_sleeping+0xf/0x87
[ 143.045836] [<ffffffff813a0860>] __schedule+0x142/0x548
[ 143.045839] [<ffffffff813a0f4d>] schedule+0x60/0x62
[ 143.045842] [<ffffffff8103c014>] do_exit+0x8f8/0x90f
[ 143.045846] [<ffffffff8139b740>] ? printk+0x48/0x4a
[ 143.045848] [<ffffffff813a2b8d>] oops_end+0xa9/0xb1
[ 143.045854] [<ffffffff81004e5c>] die+0x55/0x61
[ 143.045858] [<ffffffff813a25e0>] do_trap+0x69/0x135
[ 143.045861] [<ffffffff813a4ec8>] ? __atomic_notifier_call_chain+0xd/0xf
[ 143.045865] [<ffffffff81002ae1>] do_invalid_op+0x93/0x9c
[ 143.045889] [<ffffffffa03e72f3>] ? assfail+0x1d/0x1f [xfs]
[ 143.045893] [<ffffffff8139b740>] ? printk+0x48/0x4a
[ 143.045896] [<ffffffff813a8608>] invalid_op+0x18/0x20
[ 143.045919] [<ffffffffa03e72f3>] ? assfail+0x1d/0x1f [xfs]
[ 143.045940] [<ffffffffa03e72f3>] ? assfail+0x1d/0x1f [xfs]
[ 143.045973] [<ffffffffa0430ed1>] xlog_state_done_syncing+0x6d/0xe4 [xfs]
[ 143.046003] [<ffffffffa0431018>] xlog_iodone+0xd0/0xd9 [xfs]
[ 143.046023] [<ffffffffa03d834f>] xfs_buf_iodone_work+0x57/0xa2 [xfs]
[ 143.046027] [<ffffffff8104f21a>] process_one_work+0x18b/0x297
[ 143.046030] [<ffffffff8104f6e6>] worker_thread+0x12e/0x1fb
[ 143.046033] [<ffffffff8104f5b8>] ? rescuer_thread+0x268/0x268
[ 143.046036] [<ffffffff81054507>] kthread+0x88/0x90
[ 143.046039] [<ffffffff8105447f>] ? __kthread_parkme+0x60/0x60
[ 143.046042] [<ffffffff813a707c>] ret_from_fork+0x7c/0xb0
[ 143.046045] [<ffffffff8105447f>] ? __kthread_parkme+0x60/0x60
[ 143.046066] Code: 65 48 8b 04 25 80 b9 00 00 48 8b 80 f0 02 00 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 48 8b 87 f0 02 00 00 55 48 89 e5 <48> 8b 40 d8 5d c3 55 ba 08 00 00 00 48 89 e5 48 83 ec 10 48 8b
[ 143.046069] RIP [<ffffffff810548fd>] kthread_data+0xb/0x11
[ 143.046069] RSP <ffff88002ff67a38>
[ 143.046070] CR2: ffffffffffffffd8
[ 143.046071] ---[ end trace e4f1a3c306353179 ]---
[ 143.046072] Fixing recursive fault but reboot is needed!
--------------------------------------------



# Design in failure
> Yes. EIO means an IO error has occurred. That causes failure paths
> to be triggered in the upper layers.
>
> I really don't understand what you are trying to achieve with this
> "blockup" thing. If something goes wrong with the device, then you
> *cannot recover* by sending EIO to any new IOs and then continuing
> on at a later time as though nothing has happened. The moment a
> filesystem gets EIO from a metadata write, it is likely to be
> corrupted and if you continue onwards after that you simply
> propagate the corruption.
>
>> As Mikulas said, connection failure often be the cause of
>> I/O error from the underlying devices.
>
> Connection failure is *rarely* the cause of IO errors, except in
> environments where SANs are in use. Even then multipathing makes
> fatal connection failure a rare occurrence. Broken hardware is a
> much more common cause of problems at the storage layers.
Yes. Connection error happens in SAN due to loose cabling
and multi-pathing may resolve the problem and
hardware problem can also be resolve by RAID.

My Idea:
- If something problematic happens in underlying devices
writeboost device returns -EIO on any requests and
stop all the daemons

- I will not remove `blockup`.
Yes. If the problem is in hardware it is very difficult
to recover.
However, leaving a chance for recovering the device is
at least better than just shutdown the device
if it doesn't pollute the code so much.
So, I will leave this option.

- As Mikulas pointed out, the current design asking the admin
to turn off the `blockup` variable when the system is checked up
is not reliable
since the admin may not be able to log in the system due
to memory starvation.
More sophisticated way to recover the device is
something like autonomously checking the underlying
devices came back. But, I think this is
too much for the reasons listed below:
- redundancy technique is really sophisticated and -EIO returned
means something really serious that definitely *can not* recover.
- The example admin can't log in the system goes extreme.
At the beginning, admin should be logged in beforehand.

BTW, what do you mean by the word "fatal"?
> Simple rule: Don't complete IOs with EIO if you haven't had a fatal
> IO error.

There are two choices for writeboost in I/O failure.
1) Returns -EIO
2) Just blocks all the in-coming processes.
writeboost doesn't corrupt in either of these two
ways of handling error. Maybe, this is because of the
nature of log-structured caching.
My question is "which is better behavior of
underlying device for the upper layer software?".
Pros for 2) is that upper layer may go on its operation
without restarting from shutdown.
Cons for 2) is that blocking the in-coming processes means all the
write barrier requests sent are not returned for a long time
and the upper layer can't tell whether the device is dead or not.
I guess you like to simply propagate the error to the parent.
But, I want to find the better way for the upper layer
, particularly for the filesystems.

# Misc
> Ugh. You should be using workqueue with timed work for this.
How is it? I don't see the implementation.
All I want to do is repeating in every interval
and as far as I know current code is the simplest.

> BTW, you're missing the handling needed by these kernel threads
> for suspend-to-disk/ram....
I didn't implement suspend/resume yet.
Maybe, I should stop all the daemons and need no
further codes for the kernel threads.

Akira

2013-10-21 01:31:54

by Dave Chinner

[permalink] [raw]
Subject: Re: A review of dm-writeboost

On Sat, Oct 19, 2013 at 07:59:55PM +0900, Akira Hayakawa wrote:
> Dave,
>
> # -EIO retuned corrupts XFS
> I turned up
> lockdep, frame pointer, xfs debug
> and also changed to 3.12.0-rc5 from rc1.
>
> What's changed is that
> the problem we discussed in previous mails
> *never* reproduce.
>
> However, if I turn off the lockdep only
> it hangs up by setting blockup to 1 and then to 0 after that.

Which indicates that the corruption is likely to be caused by a race
condition, and that adding lockdep slows things down enough to
change the timing and hence not hit the race condition...

e.g. the use after free that causes the memory corruption now occurs
after the critical point where XFS can get stuck on it.

> The underlying device once became dead revives
> may confuse the filesystem but
> this looks not related to locking things.
> This problem may be producible using dm-flakey.

I use dm-flakey quite a bit, and I haven't seen this....

> This behavior is not reproducible with ext4.
>
> --------------------------------------------
> root@Lyle:/home/akira# virsh console Hercules
> Connected to domain Hercules
> Escape character is ^]
> [ 143.042980] device-mapper: writeboost: info@modulator_proc() reactivated after blockup
> [ 143.042999] device-mapper: writeboost: info@dm_safe_io_internal() reactivated after blockup
> [ 143.043006] device-mapper: writeboost: info@migrate_proc() reactivated after blockup
> [ 143.045328] XFS: metadata I/O error: block 0x300d0f ("xlog_iodone") error 5 numblks 64
> [ 143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa0430f9c
> [ 143.045335] XFS: Log I/O Error Detected. Shutting down filesystem
> [ 143.045335] XFS: Please umount the filesystem and rectify the problem(s)
> [ 143.045338] XFS: Assertion failed: atomic_read(&iclog->ic_refcnt) == 0, file: fs/xfs/xfs_log.c, line: 2751
.....
> [ 143.045465] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
.....
> [ 143.045558] [<ffffffffa0430ed1>] xlog_state_done_syncing+0x6d/0xe4 [xfs]
> [ 143.045590] [<ffffffffa0431018>] xlog_iodone+0xd0/0xd9 [xfs]
> [ 143.045609] [<ffffffffa03d834f>] xfs_buf_iodone_work+0x57/0xa2 [xfs]
> [ 143.045627] [<ffffffff8104f21a>] process_one_work+0x18b/0x297
> [ 143.045631] [<ffffffff8104f6e6>] worker_thread+0x12e/0x1fb
> [ 143.045634] [<ffffffff8104f5b8>] ? rescuer_thread+0x268/0x268
> [ 143.045638] [<ffffffff81054507>] kthread+0x88/0x90
> [ 143.045641] [<ffffffff8105447f>] ? __kthread_parkme+0x60/0x60

So that has got through xlog_iodone() and has found that the
reference count of the iclog was not zero when it went to run the
log IO completion callbacks.

We asserted that the reference count was zero before issuing the IO,
and we only ever increment the reference count when the iclog is in
an active state. We cannot increment the reference count while the
log is under IO because the state of the iclog is "syncing", not
"active".

Once the log is shut down, the iclog state goes to
XLOG_STATE_IOERROR and never goes out of that state. The assert just
prior to the one that tripped above indicates that we are either in
an ACTIVE state or IOERROR:

ASSERT(iclog->ic_state == XLOG_STATE_SYNCING ||
iclog->ic_state == XLOG_STATE_IOERROR);
>>>>>> ASSERT(atomic_read(&iclog->ic_refcnt) == 0);

It was the second assert that failed, and hence it's clear that
in either state the reference count cannot be incremented until the
IO is fully completed and it transitioned back the active state.

> [ 143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa0430f9c

This indicates that the shutdown was called from xlog_iodone() as a
result of an IO error on the iclog buffer, and the call to
xlog_state_done_syncing() happens 5 lines of code later, which
immediately assert fails with a corrupt iclog state.

Because we shutdown with SHUTDOWN_LOG_IO_ERROR set, we call into
xfs_log_force_umount() with logerror = true. This first call ends up
setting the log flag XLOG_IO_ERROR, then calling
xlog_state_ioerror() which sets ic_state on all iclogs to
XLOG_STATE_IOERROR.

The shutdown then runs xlog_state_do_callback() which aborts the
completions on all the iclogs that have callbacks attached, and
wakes anyone waiting on log space or log forces so they can get
errors returned to them.

We now return to xlog_iodone() in a shutdown state, with all the
iclog buffers in with ic_state = XLOG_STATE_IOERROR. So, there is
absolutely no opportunity for anyone to take a new reference to the
iclog in the 10 *microseconds* between the IO error being detected,
the shutdown being processed and the call to
xlog_state_done_syncing() where the assert fails. At this point, I
cannot see how this can occur in the XFS code.

Indeed, I can trigger this error path easily under heavy load just
using the godown utility in xfstests:

[1044535.986040] XFS (vdc): Mounting Filesystem
[1044536.040630] XFS (vdc): Ending clean mount
[1044536.059299] XFS: Clearing xfsstats
# src/xfstests-dev/src/godown -v /mnt/scratch
Opening "/mnt/scratch"
Calling XFS_IOC_GOINGDOWN
[1044553.342767] XFS (vdc): metadata I/O error: block 0x19000c2e98 ("xlog_iodone") error 5 numblks 512
[1044553.345993] XFS (vdc): xfs_do_force_shutdown(0x2) called from line 1171 of file fs/xfs/xfs_log.c. Return address = 0xffffffff814e61ad
#[1044554.136965] XFS (vdc): xfs_log_force: error 5 returned.
[1044554.154892] XFS: Clearing xfsstats
[1044566.108484] XFS (vdc): xfs_log_force: error 5 returned.
[1044596.188515] XFS (vdc): xfs_log_force: error 5 returned.
[1044626.268166] XFS (vdc): xfs_log_force: error 5 returned.
[1044656.348146] XFS (vdc): xfs_log_force: error 5 returned.

IOWs, this looks like something is corrupting the state of the log
*before* the xlog_iodone() callback is being run. i.e. it is in a
valid state before IO dispatch and it's in a corrupt state on IO
completion despite XFS having *never touched that state* during the
IO.

Indeed, this is a different issue to the one you posted
earlier, which was the AIL lock (which is in a different XFS
structure) had not been released. Again, I couldn't see how that
could occur, and we're now seeing a pattern of random structures
being corrupted. Hmmm, looking at pahole:

atomic_t ic_refcnt; /* 192 4 */

spinlock_t xa_lock; /* 64 2 */

Both the items that were corrupted are on the first word of a
cacheline. That further points to memory corruption of some kind...

IOWs, the more I look, the more this looks like memory corruption is
the cause of the problems. The only unusual thing that is happening
is an error handling path in a brand new block device driver is
being run immediately before the memory corruption causes problems,
and that tends to indicate that this corruption is not caused by
XFS...

> My Idea:
> - If something problematic happens in underlying devices
> writeboost device returns -EIO on any requests and
> stop all the daemons
....
> - I will not remove `blockup`.
> Yes. If the problem is in hardware it is very difficult
> to recover.
> However, leaving a chance for recovering the device is
> at least better than just shutdown the device
> if it doesn't pollute the code so much.
> So, I will leave this option.

It doesn't matter whether the underlying hardware might be able to
recover - once you've send EIOs during IO completion that
data/metadata is considered *unrecoverable* and so the filesystem
will end up inconsistent or the *user will lose data* as a result.

IOWs, your idea is flawed, will not work and will result in
data/filesystem loss when used. You cannot call this a "recovery
mechanism" because it simply isn't.

> BTW, what do you mean by the word "fatal"?

"He suffered a fatal gunshot wound to the head."

i.e. the person is *dead* and life is *unrecoverable*.

That's what you are doing here - returning EIOs to IOs in progress
is effectively shooting the fileystem (and user data) in the
head.....

Cheers,

Dave.

--
Dave Chinner
[email protected]