LinuxLists.cc - Help on Implementation of EXT3 type Ordered Mode in EXT4

[permalink] [raw]

Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

Hi,

> I recently found that in EXT4 with delayed block the Ordered mode does not
> bahave same as in EXT3.
> I found a patch for this at http://lwn.net/Articles/324023/, but it has some
> journal block estimation problem resulting into deadlock.
>
> I would like to know if it has been solved.
> If not, is it possible to solve it? What are the complexities involved?
It has not been solved. The problem is that to commit data on transaction
commit (which is what data=ordered mode has historically done), you have to
allocate space for these blocks. But that allocation needs to modify a
filesystem and thus journal more blocks... And that is tricky - we would have
to reserve space in the current transaction for allocation of delayed data. So
it gets a bit messy...
Why exactly do you need the old data=ordered guarantees?

Honza

--
Jan Kara <[email protected]>
SuSE CR Labs

2010-02-09 17:41:51

[permalink] [raw]

Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On Tue, Feb 09, 2010 at 05:05:22PM +0100, Jan Kara wrote:
> Hi,
>
> > I recently found that in EXT4 with delayed block the Ordered mode does not
> > bahave same as in EXT3.
> > I found a patch for this at http://lwn.net/Articles/324023/, but it has some
> > journal block estimation problem resulting into deadlock.
> >
> > I would like to know if it has been solved.
> > If not, is it possible to solve it? What are the complexities involved?
>
> It has not been solved. The problem is that to commit data on
> transaction commit (which is what data=ordered mode has historically
> done), you have to allocate space for these blocks. But that
> allocation needs to modify a filesystem and thus journal more
> blocks... And that is tricky - we would have to reserve space in the
> current transaction for allocation of delayed data. So it gets a
> bit messy...

The dioread_nolock patches from Jiaying, which are currently in the
unstable portion of the tree, is a partial solution to the
data=ordered problem, although it solves it in a slightly different
way.

As a side effect of trying to avoid locking on the direct I/O read
path, on the buffered I/O write path it changes things so the extent
tree is first changed so the blocks are allocated with the "extent
uninitialized" bit, and then only after the blocks hit the disk, via
the bh completion callback, do we set the extent so that it is marked
as containing initialized data.

As a result, if you crash before the extent tree is updated, when you
read from the file, you will get all zero's, instead of the data, thus
preventing the security leak.

It does mean that fsync() is slightly slower, since we now have to
flush the data blocks out, wait for the completion handler to fire and
update the extent in the same jbd2 transaction, and only then wait for
the barrier in the jbd2 transaction. (And in fact, I'm not sure
fsync() is completely working correctly in the current patch in the
unstable patch stream, and there aren't race conditions where the
extent tree update slips into the next transaction.) But it does
solve the problem.

The other downside with this solution is that it only works for files
that are extent-mapped, and if you do this with a converted ext3 file
system, and there are files that are still mapped using
direct/indirect blocks, when you change the mount option to be
data=writeback,dioread_nolock, the block allocating writes to these
legacy files could result in data getting exposed after a crash.

Depending on the workload the upside is that by using data=writeback
instead of data=ordered could far outweigh the downside of needing to
do an extra block I/O queue flush before the fsync, since it reduces
the number of entangled writes to only the metadata blocks, where
previously the entagled write problem affected metadata blocks plus
all freshly allocated blocks.

Kalias, this is something that I plan to look in the near future; if
you are interested in helping to benchmark and characterize this
solution, I'd be very interested in working with you. Can you tell me
a little more about your use case and requirements?

- Ted

2010-02-11 07:32:16

[permalink] [raw]

Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On 11 February 2010 12:31, Kailas Joshi <[email protected]> wrote:
>
> On 9 February 2010 23:11, <[email protected]> wrote:
>>
>> On Tue, Feb 09, 2010 at 05:05:22PM +0100, Jan Kara wrote:
>> > ? Hi,
>> >
>> > > I recently found that in EXT4 with delayed block the Ordered mode does not
>> > > bahave same as in EXT3.
>> > > I found a patch for this at http://lwn.net/Articles/324023/, but it has some
>> > > journal block estimation problem resulting into deadlock.
>> > >
>> > > I would like to know if it has been solved.
>> > > If not, is it possible to solve it? What are the complexities involved?
>> >
>> > It has not been solved. The problem is that to commit data on
>> > transaction commit (which is what data=ordered mode has historically
>> > done), you have to allocate space for these blocks. But that
>> > allocation needs to modify a filesystem and thus journal more
>> > blocks... And that is tricky - we would have to reserve space in the
>> > current transaction for allocation of delayed data. ?So it gets a
>> > bit messy...
>>
>> The dioread_nolock patches from Jiaying, which are currently in the
>> unstable portion of the tree, is a partial solution to the
>> data=ordered problem, although it solves it in a slightly different
>> way.
>>
>> As a side effect of trying to avoid locking on the direct I/O read
>> path, on the buffered I/O write path it changes things so the extent
>> tree is first changed so the blocks are allocated with the "extent
>> uninitialized" bit, and then only after the blocks hit the disk, via
>> the bh completion callback, do we set the extent so that it is marked
>> as containing initialized data.
>>
>> As a result, if you crash before the extent tree is updated, when you
>> read from the file, you will get all zero's, instead of the data, thus
>> preventing the security leak.
>>
>> It does mean that fsync() is slightly slower, since we now have to
>> flush the data blocks out, wait for the completion handler to fire and
>> update the extent in the same jbd2 transaction, and only then wait for
>> the barrier in the jbd2 transaction. ?(And in fact, I'm not sure
>> fsync() is completely working correctly in the current patch in the
>> unstable patch stream, and there aren't race conditions where the
>> extent tree update slips into the next transaction.) ?But it does
>> solve the problem.
>>
>> The other downside with this solution is that it only works for files
>> that are extent-mapped, and if you do this with a converted ext3 file
>> system, and there are files that are still mapped using
>> direct/indirect blocks, when you change the mount option to be
>> data=writeback,dioread_nolock, the block allocating writes to these
>> legacy files could result in data getting exposed after a crash.
>>
>> Depending on the workload the upside is that by using data=writeback
>> instead of data=ordered could far outweigh the downside of needing to
>> do an extra block I/O queue flush before the fsync, since it reduces
>> the number of entangled writes to only the metadata blocks, where
>> previously the entagled write problem affected metadata blocks plus
>> all freshly allocated blocks.
>>
>> Kalias, this is something that I plan to look in the near future; if
>> you are interested in helping to benchmark and characterize this
>> solution, I'd be very interested in working with you. ?Can you tell me
>> a little more about your use case and requirements?
>>
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>

Jan and Ted, thank you very much for detailed replies.

We are assessing the use of copy-on-write technique to provide data
level consistency in EXT3/EXT4. We have implemented this in EXT3 by
using the Ordered mode of operation. Benchmark results for IOZone and
Postmark are quiet good. We could get the consistency equivalent to
Journal mode with the overhead almost same as Ordered mode. However,
there are few cases(for example, file rewrite) where performance of
Journal mode is better than our technique. We think that in EXT4, with
the support for delayed block allocation and extents, these problems
can be removed.

However, Ordered mode with delayed block allocation in EXT4 does not
behave in the same way as in EXT3. It does not flush 'all' dirty
blocks to the disk as in EXT3. For implementing our technique in EXT4,
we need EXT3 style Ordered mode, that is
alloc_on_commit(http://lwn.net/Articles/324023/).

I understand that this is not required in EXT4 since the Ordered mode
is provided for security and not consistency. However, from the
discussions on blogs/post, it seems that developers expect Ordered
mode to provide (limited) data consistency as well.

Since the implementation of our technique heavily depends on EXT3
style Ordered mode, I would like to implement alloc_on_commit on EXT4.
I have designed following strategy to address credit reservation
problem in earlier patch. Please let me know your comments on it.

1. In Write path, the call to journal_start() for updating metadata
will reserve credits for delayed allocation also.
2. If the fs is mounted with alloc_on_commit, journal_stop() will not
return remaining credits to the journal (t_outstanding_credits will
not be changed).
3. In journal_commit() -
i. After LOCKing the current transaction, a new special handle will be
created by calling journal_start() with zero credits . Such a call to
journal_start() can be treated as a special case for creating handle
to use accumulated credits (in t_outstanding_credits) of currently
locked transaction.
ii. Before changing transaction state to FLUSH, callback will be used
to perform delayed block allocation for all inodes. This mechanism
will be same as in alloc_on_commit at http://lwn.net/Articles/324023/
, but it will be performed after changing the transaction to LOCKED
state. In the callback, specially created handle will passed to the
callback function and it will use that handle for performing delayed
block allocation.
iii. The special handle will be closed, outstanding credits for
transaction will be zeroed and the transaction flush will continue.

Regarding dioread_nolock work:
Ted, I am new in filesystem development. If this is fine and your
deadlines are not very critical, I will be very happy to work with you
on dioread_nolock even though its not directly related to our current
work. Please let me know more on this.

Thanks & Regards,
Kailas

2010-02-11 19:56:26

[permalink] [raw]

Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On Thu, Feb 11, 2010 at 01:02:15PM +0530, Kailas Joshi wrote:
>
> We are assessing the use of copy-on-write technique to provide data
> level consistency in EXT3/EXT4. We have implemented this in EXT3 by
> using the Ordered mode of operation. Benchmark results for IOZone and
> Postmark are quiet good. We could get the consistency equivalent to
> Journal mode with the overhead almost same as Ordered mode. However,
> there are few cases(for example, file rewrite) where performance of
> Journal mode is better than our technique. We think that in EXT4, with
> the support for delayed block allocation and extents, these problems
> can be removed.

Ah, sorry, I misread your initial post; I thouht you were trying to
reimplement the proposed ext4 mode data=guarded.

I've mostly given up on trying to get alloc_on_commit work, for two
reasons.

The first is that one of the reasons why you might be closing the
transaction is if there's not enough space left in the journal. But
if we you going to a large number of data allocations at commit time,
there's no guaratee that there will be space in the journal for all of
the metadata blocks that might have to be modified in order to make
the block allocations.

The second problem with this scheme is a performance problem; while
you are doing handling delayed allocation blocks, you have to do this
while the journal is still locked, using magic handles that are
allowed to be created while the journal is locked. That adds all
sorts of complexity, and that seems to what you are thinking about
doing. The problem though is that while this is going on, all other
file system activity has to be blocked. So this will cause all sorts
of processes to become suspended waiting for the all of the allocation
activity to complete, which may require bitmap allocation blocks to be
read into disk, etc.

The trade off for all of these problems is that it allows you to delay
the block allocation for only 5 seconds. The question is, is this
worth it, compared with simply mounting the file system with
nodelalloc? It may be all of this complexity doesn't produce enough
of a performance gain over simply using nodelalloc.

So maybe the solution for certain distributions that are catering to
the "inexperienced user" / "users who like to use unstable video
drivers" market is to mount with nodelalloc by default, and tell them
that if they want the performance improvements of delayed allocation,
they need to lobby to get the applications fixed.

(After all, these problems are going to be around no matter whether
people use XFS or btrfs; most modern file systems are going to use
delayed allocation, so sooner or later the broken applications really
need to get fixed. The defiant user's cry, "well, if you don't fix
this I'll switch to xfs/btrfs!" isn't going to help in this case....)

- Ted

2010-02-12 03:22:15

[permalink] [raw]

Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On 12 February 2010 01:26, <[email protected]> wrote:
> I've mostly given up on trying to get alloc_on_commit work, for two
> reasons.
>
> The first is that one of the reasons why you might be closing the
> transaction is if there's not enough space left in the journal. ?But
> if we you going to a large number of data allocations at commit time,
> there's no guaratee that there will be space in the journal for all of
> the metadata blocks that might have to be modified in order to make
> the block allocations.
Won't this get fixed by performing early reservations as mentioned in
my scheme? We are reserving required credits in the path of write
system call and these will be kept reserved until transaction commit.
So, the journal space for allocation at commit will be guaranteed.

>
> The second problem with this scheme is a performance problem; while
> you are doing handling delayed allocation blocks, you have to do this
> while the journal is still locked, using magic handles that are
> allowed to be created while the journal is locked. ?That adds all
> sorts of complexity, and that seems to what you are thinking about
> doing. ?The problem though is that while this is going on, all other
> file system activity has to be blocked. ?So this will cause all sorts
> of processes to become suspended waiting for the all of the allocation
> activity to complete, which may require bitmap allocation blocks to be
> read into disk, etc.
Sorry, I didn't understand why processes need to be suspended.
In my scheme, I am issuing magic handle only after locking the current
transaction. AFAIK after the transaction is locked, it can receive the
block journaling requests for already created handles(in our case, for
already reserved journal space), and the new concurrent requests for
journal_start() will go to the new current transaction. Since, the
credits for locked transaction are fixed (by means of early
reservations) we can know whether journal has enough space for the new
journal_start(). So, as long as journal has enough space available,
new processes need now be stalled.

Please correct me if this is wrong.

> The trade off for all of these problems is that it allows you to delay
> the block allocation for only 5 seconds. ?The question is, is this
> worth it, compared with simply mounting the file system with
> nodelalloc? ?It may be all of this complexity doesn't produce enough
> of a performance gain over simply using nodelalloc.
I agree. The performance gain might not be good enough compared to
'nodelalloc'. However, our goal is providing data consistency
equivalent to Journal mode at low cost. So, we are interested in
comparing performance of alloc_on_commit (and our technique) with
performance of Journal mode.

Thanks & Regards,
Kailas

2010-02-12 20:07:29

[permalink] [raw]

Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On Fri, Feb 12, 2010 at 08:52:15AM +0530, Kailas Joshi wrote:
> Won't this get fixed by performing early reservations as mentioned in
> my scheme? We are reserving required credits in the path of write
> system call and these will be kept reserved until transaction commit.
> So, the journal space for allocation at commit will be guaranteed.

Yes, if you account for these separately. One challenge is
over-estimating the needed credits will be tricky. If we go down this
path, be sure that the bonnie style write(fd, &ch, 1) in a tight loop
doesn't end up reserving a separate set of credits for each write
system call to the same block. (It can be done; if the DA block is
already instantiated, you can assume that credits have already been
reserved.)

> Sorry, I didn't understand why processes need to be suspended.
> In my scheme, I am issuing magic handle only after locking the current
> transaction. AFAIK after the transaction is locked, it can receive the
> block journaling requests for already created handles(in our case, for
> already reserved journal space), and the new concurrent requests for
> journal_start() will go to the new current transaction. Since, the
> credits for locked transaction are fixed (by means of early
> reservations) we can know whether journal has enough space for the new
> journal_start(). So, as long as journal has enough space available,
> new processes need now be stalled.

But while you are modifying blocks that need to go into the journal
via the locked (old) transaction, it's not safe to start a new
transaction and start issuing handles against the new transaction.

Just to give one example, suppose we need to update the extent
allocation tree for an inode in the locked/committing transaction as
the delayed allocation blocks are being resolved --- and in another
process, that inode is getting truncated or unlinked, which also needs
to modify the extent allocation tree? Hilarty ensues, unless you use
a block all attempts to create a new handle (practically speaking, by
blocking all attempts to start a new transaction), until this new
delayed allocation resolution phase which you have proposed is
complete.

- Ted

2010-02-13 08:43:18

[permalink] [raw]

Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On 13 February 2010 01:37, <[email protected]> wrote:
> On Fri, Feb 12, 2010 at 08:52:15AM +0530, Kailas Joshi wrote:
>> Won't this get fixed by performing early reservations as mentioned in
>> my scheme? We are reserving required credits in the path of write
>> system call and these will be kept reserved until transaction commit.
>> So, the journal space for allocation at commit will be guaranteed.
>
> Yes, if you account for these separately. ?One challenge is
> over-estimating the needed credits will be tricky. ?If we go down this
> path, be sure that the bonnie style write(fd, &ch, 1) in a tight loop
> doesn't end up reserving a separate set of credits for each write
> system call to the same block. ?(It can be done; if the DA block is
> already instantiated, you can assume that credits have already been
> reserved.)
Okay

>> Sorry, I didn't understand why processes need to be suspended.
>> In my scheme, I am issuing magic handle only after locking the current
>> transaction. ?AFAIK after the transaction is locked, it can receive the
>> block journaling requests for already created handles(in our case, for
>> already reserved journal space), and the new concurrent requests for
>> journal_start() will go to the new current transaction. Since, the
>> credits for locked transaction are fixed (by means of early
>> reservations) we can know whether journal has enough space for the new
>> journal_start(). So, as long as journal has enough space available,
>> new processes need now be stalled.
>
> But while you are modifying blocks that need to go into the journal
> via the locked (old) transaction, it's not safe to start a new
> transaction and start issuing handles against the new transaction.
>
> Just to give one example, suppose we need to update the extent
> allocation tree for an inode in the locked/committing transaction as
> the delayed allocation blocks are being resolved --- and in another
> process, that inode is getting truncated or unlinked, which also needs
> to modify the extent allocation tree? ?Hilarty ensues, unless you use
> a block all attempts to create a new handle (practically speaking, by
> blocking all attempts to start a new transaction), until this new
> delayed allocation resolution phase which you have proposed is
> complete.
Okay. So, basically process stalling is unavoidable as we cannot
modify a buffer data in past transaction after it has been modified in
current transaction.
Can we restrict the scope for this blocking? Blocking on
journal_start() will block all processes even though they are
operating on mutually exclusive sets of metadata buffers. Can we
restrict this blocking to allocation/deallocation paths by blocking in
get_write_access() on specific cases(some condition on buffer)? This
way, since all files will use commit-time allocation, very few(sync
and direct-io mode) file operations will be stalled.

I am not sure whether this is feasible or not. Please let me know more on this.

Thanks & Regards,
Kailas

2010-02-15 15:00:13

[permalink] [raw]

Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On Sat 13-02-10 14:13:17, Kailas Joshi wrote:
> On 13 February 2010 01:37, <[email protected]> wrote:
> > On Fri, Feb 12, 2010 at 08:52:15AM +0530, Kailas Joshi wrote:
> >> Sorry, I didn't understand why processes need to be suspended.
> >> In my scheme, I am issuing magic handle only after locking the current
> >> transaction. ?AFAIK after the transaction is locked, it can receive the
> >> block journaling requests for already created handles(in our case, for
> >> already reserved journal space), and the new concurrent requests for
> >> journal_start() will go to the new current transaction. Since, the
> >> credits for locked transaction are fixed (by means of early
> >> reservations) we can know whether journal has enough space for the new
> >> journal_start(). So, as long as journal has enough space available,
> >> new processes need now be stalled.
> >
> > But while you are modifying blocks that need to go into the journal
> > via the locked (old) transaction, it's not safe to start a new
> > transaction and start issuing handles against the new transaction.
> >
> > Just to give one example, suppose we need to update the extent
> > allocation tree for an inode in the locked/committing transaction as
> > the delayed allocation blocks are being resolved --- and in another
> > process, that inode is getting truncated or unlinked, which also needs
> > to modify the extent allocation tree? ?Hilarty ensues, unless you use
> > a block all attempts to create a new handle (practically speaking, by
> > blocking all attempts to start a new transaction), until this new
> > delayed allocation resolution phase which you have proposed is
> > complete.
> Okay. So, basically process stalling is unavoidable as we cannot
> modify a buffer data in past transaction after it has been modified in
> current transaction.
> Can we restrict the scope for this blocking? Blocking on
> journal_start() will block all processes even though they are
> operating on mutually exclusive sets of metadata buffers. Can we
> restrict this blocking to allocation/deallocation paths by blocking in
> get_write_access() on specific cases(some condition on buffer)? This
> way, since all files will use commit-time allocation, very few(sync
> and direct-io mode) file operations will be stalled.
I doubt blocking at buffer-level would be enough. I think that the
journalling layer just does not have enough information for such decisions.
It could be feasible to block on per-inode basis but you'd still have to
give a good thought to modification of filesystem global structures like
bitmaps, superblock, or inode blocks.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-02-16 10:10:22

[permalink] [raw]

Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On 15 February 2010 20:30, Jan Kara <[email protected]> wrote:
> On Sat 13-02-10 14:13:17, Kailas Joshi wrote:
>> On 13 February 2010 01:37, ?<[email protected]> wrote:
>> > On Fri, Feb 12, 2010 at 08:52:15AM +0530, Kailas Joshi wrote:
>> >> Sorry, I didn't understand why processes need to be suspended.
>> >> In my scheme, I am issuing magic handle only after locking the current
>> >> transaction. ?AFAIK after the transaction is locked, it can receive the
>> >> block journaling requests for already created handles(in our case, for
>> >> already reserved journal space), and the new concurrent requests for
>> >> journal_start() will go to the new current transaction. Since, the
>> >> credits for locked transaction are fixed (by means of early
>> >> reservations) we can know whether journal has enough space for the new
>> >> journal_start(). So, as long as journal has enough space available,
>> >> new processes need now be stalled.
>> >
>> > But while you are modifying blocks that need to go into the journal
>> > via the locked (old) transaction, it's not safe to start a new
>> > transaction and start issuing handles against the new transaction.
>> >
>> > Just to give one example, suppose we need to update the extent
>> > allocation tree for an inode in the locked/committing transaction as
>> > the delayed allocation blocks are being resolved --- and in another
>> > process, that inode is getting truncated or unlinked, which also needs
>> > to modify the extent allocation tree? ?Hilarty ensues, unless you use
>> > a block all attempts to create a new handle (practically speaking, by
>> > blocking all attempts to start a new transaction), until this new
>> > delayed allocation resolution phase which you have proposed is
>> > complete.
>> Okay. So, basically process stalling is unavoidable as we cannot
>> modify a buffer data in past transaction after it has been modified in
>> current transaction.
>> Can we restrict the scope for this blocking? Blocking on
>> journal_start() will block all processes even though they are
>> operating on mutually exclusive sets of metadata buffers. Can we
>> restrict this blocking to allocation/deallocation paths by blocking in
>> get_write_access() on specific cases(some condition on buffer)? This
>> way, since all files will use commit-time allocation, very few(sync
>> and direct-io mode) file operations will be stalled.
> ?I doubt blocking at buffer-level would be enough. I think that the
> journalling layer just does not have enough information for such decisions.
> It could be feasible to block on per-inode basis but you'd still have to
> give a good thought to modification of filesystem global structures like
> bitmaps, superblock, or inode blocks.
Okay. So, blocking at buffer level will not be easy as global
structures shared among inodes will need modifications(for example,
changing access time for a file in inode block).

One last doubt, while looking at the code, I saw that journal_start()
always stalls all file operations while currently running transaction
is in LOCKED state. Only when the current transaction moves to FLUSH,
the new transaction is created and the stalled operations continue. Is
this interpretation correct?
If yes, why this stalling does not have significant negative impact on
performance of file operations? Also, if it does not have, will
stalling for delayed block allocation really have such significant
negative impact?

Please reply.

Thanks & Regards,
Kailas

2010-02-16 13:10:30

[permalink] [raw]

Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On Tue 16-02-10 15:40:22, Kailas Joshi wrote:
> On 15 February 2010 20:30, Jan Kara <[email protected]> wrote:
> > On Sat 13-02-10 14:13:17, Kailas Joshi wrote:
> >> On 13 February 2010 01:37, ?<[email protected]> wrote:
> >> > On Fri, Feb 12, 2010 at 08:52:15AM +0530, Kailas Joshi wrote:
> >> >> Sorry, I didn't understand why processes need to be suspended.
> >> >> In my scheme, I am issuing magic handle only after locking the current
> >> >> transaction. ?AFAIK after the transaction is locked, it can receive the
> >> >> block journaling requests for already created handles(in our case, for
> >> >> already reserved journal space), and the new concurrent requests for
> >> >> journal_start() will go to the new current transaction. Since, the
> >> >> credits for locked transaction are fixed (by means of early
> >> >> reservations) we can know whether journal has enough space for the new
> >> >> journal_start(). So, as long as journal has enough space available,
> >> >> new processes need now be stalled.
> >> >
> >> > But while you are modifying blocks that need to go into the journal
> >> > via the locked (old) transaction, it's not safe to start a new
> >> > transaction and start issuing handles against the new transaction.
> >> >
> >> > Just to give one example, suppose we need to update the extent
> >> > allocation tree for an inode in the locked/committing transaction as
> >> > the delayed allocation blocks are being resolved --- and in another
> >> > process, that inode is getting truncated or unlinked, which also needs
> >> > to modify the extent allocation tree? ?Hilarty ensues, unless you use
> >> > a block all attempts to create a new handle (practically speaking, by
> >> > blocking all attempts to start a new transaction), until this new
> >> > delayed allocation resolution phase which you have proposed is
> >> > complete.
> >> Okay. So, basically process stalling is unavoidable as we cannot
> >> modify a buffer data in past transaction after it has been modified in
> >> current transaction.
> >> Can we restrict the scope for this blocking? Blocking on
> >> journal_start() will block all processes even though they are
> >> operating on mutually exclusive sets of metadata buffers. Can we
> >> restrict this blocking to allocation/deallocation paths by blocking in
> >> get_write_access() on specific cases(some condition on buffer)? This
> >> way, since all files will use commit-time allocation, very few(sync
> >> and direct-io mode) file operations will be stalled.
> > ?I doubt blocking at buffer-level would be enough. I think that the
> > journalling layer just does not have enough information for such decisions.
> > It could be feasible to block on per-inode basis but you'd still have to
> > give a good thought to modification of filesystem global structures like
> > bitmaps, superblock, or inode blocks.
> Okay. So, blocking at buffer level will not be easy as global
> structures shared among inodes will need modifications(for example,
> changing access time for a file in inode block).
Yes.

> One last doubt, while looking at the code, I saw that journal_start()
> always stalls all file operations while currently running transaction
> is in LOCKED state. Only when the current transaction moves to FLUSH,
> the new transaction is created and the stalled operations continue. Is
> this interpretation correct?
Yes, it is correct.

> If yes, why this stalling does not have significant negative impact on
> performance of file operations? Also, if it does not have, will
> stalling for delayed block allocation really have such significant
> negative impact?
Actually, stalling on a transaction in LOCKED state does have a negative
impact on the filesystem performance. But it's hard to avoid it. The
transaction is in LOCKED state while we've decided it needs a commit but
there are still tasks which have handle to it and are adding new metadata
buffers to it. So this transaction is effectively still running and we
cannot start a next transaction because then we'd have two running
transactions and the journalling logic isn't able to handle that.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-02-16 14:18:59

[permalink] [raw]

Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On Tue, Feb 16, 2010 at 02:10:39PM +0100, Jan Kara wrote:
> Actually, stalling on a transaction in LOCKED state does have a negative
> impact on the filesystem performance. But it's hard to avoid it. The
> transaction is in LOCKED state while we've decided it needs a commit but
> there are still tasks which have handle to it and are adding new metadata
> buffers to it. So this transaction is effectively still running and we
> cannot start a next transaction because then we'd have two running
> transactions and the journalling logic isn't able to handle that.

This is also why we try to avoid staying in LOCKED state for very
long.... and why increasing the journal size can help performance
(since if we get ourselves into trouble where are forced to do a
journal checkpoint, we can end up stalling all file system updates for
a non-trivial amount of time).

So changes that increase the amount of time that we spend in LOCKED
are going to be really bad, especially if you have one thread which is
frequently calling fsync() (for example, like Firefox, which can be
*very* fsync() happy) and another thread which is doing lots of file
creates and deletes. Each fsync() will force a transaction commit,
and if you have to stop all transaction updates while the delayed
allocation blocks are getting resolved, life can really get bad.

This is why, ultimately, we really need to distinguish between files
where we might not care when they get written to disk (i.e., object
files being created by the compiler, ISO files being downloaded from
the web since we can always restart them after the hopefully rare
crash --- unless you're using crappy video drivers, of course) from
files written by buggy applications which are precious and yet where
the application writer didn't bother to use fsync().

Maybe something we ought to consider is doing things both ways. Maybe
we should have a way for applications to indicate they have been
audited and any precious files will be properly fsync()'ed. This
could be done via two process personality flags; one which is
inherited across an exec, and which which isn't. (We need this so
that jobs being fired out of make can be properly exempted from
calling fsync(), even if they are using programs like sort, or shell
redirections, where the coreutils authors don't know whether the files
they are writing are precious or not, and thus whether they should be
fsync'ed.)

These flags would be used to exempt processes from a mount option
which could be set by people who are nervous about not trusting their
application writers, which would force an fsync at every file close
(except for those processes which have these process personality flags
set). People who are more confident about having a stable set of
kernel drivers (and/or who are running servers where they have UPS's
and where they aren't using crappy desktop applications that seem to
be the most likely to not properly call fsync for precious files) can
simply avoid using this mount option, but we can give users and system
administrators a choice.

Maybe, just for those whiners at Phoronix, we can give them an mount
option where applications which have this flag set will get delayed
allocation, and applications which don't get their files written with
O_SYNC. :-)

- Ted

2010-02-17 15:37:26

[permalink] [raw]

Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On 16 February 2010 19:48, <[email protected]> wrote:
> On Tue, Feb 16, 2010 at 02:10:39PM +0100, Jan Kara wrote:
>> ? Actually, stalling on a transaction in LOCKED state does have a negative
>> impact on the filesystem performance. But it's hard to avoid it. The
>> transaction is in LOCKED state while we've decided it needs a commit but
>> there are still tasks which have handle to it and are adding new metadata
>> buffers to it. So this transaction is effectively still running and we
>> cannot start a next transaction because then we'd have two running
>> transactions and the journalling logic isn't able to handle that.
>
> This is also why we try to avoid staying in LOCKED state for very
> long.... and why increasing the journal size can help performance
> (since if we get ourselves into trouble where are forced to do a
> journal checkpoint, we can end up stalling all file system updates for
> a non-trivial amount of time).
>
> So changes that increase the amount of time that we spend in LOCKED
> are going to be really bad, especially if you have one thread which is
> frequently calling fsync() (for example, like Firefox, which can be
> *very* fsync() happy) and another thread which is doing lots of file
> creates and deletes. ?Each fsync() will force a transaction commit,
> and if you have to stop all transaction updates while the delayed
> allocation blocks are getting resolved, life can really get bad.

Okay. It seems that there is no easy way to solve this. Probably, the
personality flag based solution is more appropriate.
Still, as we need this mode of operation for our further analysis, for
now we will go with the same design to implement alloc_on_commit and
see how can we optimize it and how much negative impact it has. Will
update you on this.

Thank you very much for the help.

Regards,
Kailas

2010-03-22 16:51:57

[permalink] [raw]

Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

Hi,

On Fri 19-03-10 08:53:08, Kailas Joshi wrote:
> I am facing some problems while implementing alloc_on_commit.
> While performing exhaustive write operations(for example using postmark),
> system locks up after some time.
> It runs fine for (simple)non-exhaustive write operations.
>
> I am using filemap_write_and_wait() in journal commit callback for
> performing synchronous block allocation. It uses special journal handle
> which enables use of early reservations.
> Is it right to use this function here? If no, is there any other alternative
> that should be used in this scenario?
>
> I am using following strategy -
> 1) ext4_da_get_block_prep() marks delayed-allocation buffers with BH_DA
> after reserving space for them.
We have a BH_Delay flag for this already. OK, probably you need a
temporary flag which you can clear in ext4_da_write_begin. I'd find
counting number of BH_Delay buffers before and after block_write_begin
call nicer...

> 2) ext4_da_write_begin() counts the number of buffers marked with BH_DA and
> reserves credits for block allocation.
> 3) journal_stop() accumulates the unused credits of a handle in the
> transaction.
> 4) journal_start() when called with nblocks=0, creates a special handle with
> the credits accumulated by all previous handles(by step 2).
This is a hack. I'd rather create a separate JBD2 function for this.

> 5) journal_commit() creates special handle for block allocation(as in step
> 4) and calls filemap_write_and_wait() to perform block allocation.
>
> I am also sending the patch(for kernel 2.6.32.4) for my implementation (also
> available at
> http://www.cse.iitb.ac.in/~kailasjoshi/files/alloc_on_commit.patch).
>
> Being new to filesystem development, I am not able to identify the problem.
> I will be very greatful if someone can help me out.
Probably you are hitting some lock inversion problem. I suggest you
compile the kernel with lockdep enabled (in Kernel hacking -> Lock debugging
-> Prove lock correctness or something like that) and see whether it issues
some warnings. If not, you can get backtraces of the locked up processes
by pressing Alt-Sysrq-w (or echo "w" >/proc/sysrq-trigger).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-03-23 10:41:46

[permalink] [raw]

Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

Thanks Jan. This has definitely given me some pointers to work upon.

I have Lock Debugging enables but that didn't give any warnings.
However, when I did echo "w" >/proc/sysrq-trigger after system lockup,
I got the stack trace for locked up process.

Following are the stack traces of the processes (I suspect) resulting
in total system lockup -
-----------------------------------------------------------------------------------------------------------------------------------------
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] jbd2/sdb1-8 D
00000046 0 5913 2 0x00000000
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] c4473b90
00000046 00000001 00000046 ce9e4a00 00000000 c4473b70 00000046
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] 00000000
c061de60 c061de60 c061de60 c01513bd 00000000 ce9e4a94 ce9e4a00
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] ce9e4b94
c1407e60 c01513ed 00000001 c13430bc 00000296 cfac9278 c4473b90
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] Call Trace:
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01513bd>] ?
prepare_to_wait_exclusive+0x1d/0x60
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01513ed>] ?
prepare_to_wait_exclusive+0x4d/0x60
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c04004f5>]
io_schedule+0x35/0x50
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a6f75>]
sync_page+0x35/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0400820>]
__wait_on_bit_lock+0x40/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a6f40>] ?
sync_page+0x0/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a6f1d>]
__lock_page+0x4d/0x60
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0151250>] ?
wake_bit_function+0x0/0x60
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01ad987>]
write_cache_pages+0x437/0x5d0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0237930>] ?
__mpage_da_writepage+0x0/0x170
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01ad310>] ?
mapping_tagged+0x0/0x70
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01ad310>] ?
mapping_tagged+0x0/0x70
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c02387ec>]
ext4_da_writepages+0x2ec/0x7a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c02c514a>] ?
number+0x25a/0x270
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0328ada>] ?
vt_console_print+0x1da/0x2a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c040238d>] ?
_spin_unlock+0x1d/0x20
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0328ada>] ?
vt_console_print+0x1da/0x2a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0155aeb>] ?
up+0x2b/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01375e7>] ?
release_console_sem+0x197/0x1d0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0238500>] ?
ext4_da_writepages+0x0/0x7a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01adb6d>]
do_writepages+0x1d/0x30
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a76d6>]
__filemap_fdatawrite_range+0x66/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a81e6>]
filemap_fdatawrite+0x26/0x30
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a821c>]
filemap_write_and_wait+0x2c/0x50
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c023228a>]
ext4_sync_alloc_da_blocks+0x5a/0x90
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0244c0c>]
alloc_on_commit_callback+0x6c/0xc0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c02693a5>]
jbd2_journal_commit_transaction+0x335/0x1ae0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c012c10c>] ?
finish_task_switch+0x6c/0xe0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0143225>] ?
lock_timer_base+0x25/0x50
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c04025ad>] ?
_spin_lock_irqsave+0x4d/0x60
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0143287>] ?
try_to_del_timer_sync+0x37/0xb0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c014336a>] ?
del_timer_sync+0x6a/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0143300>] ?
del_timer_sync+0x0/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c02703d6>]
kjournald2+0xb6/0x380
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0151210>] ?
autoremove_wake_function+0x0/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0270320>] ?
kjournald2+0x0/0x380
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0151144>]
kthread+0x74/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01510d0>] ?
kthread+0x0/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0103a07>]
kernel_thread_helper+0x7/0x10
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] flush-8:16 D
c47f1cc0 0 5916 2 0x00000000
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] c47f1cd4
00000046 00000002 c47f1cc0 c14031c4 00000000 c47f1cb4 00000046
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] 00000000
c061de60 c061de60 c061de60 c06191c4 c1407e70 ce9e6f94 ce9e6f00
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] ce9e7094
c1407e60 000095a3 00000000 00000000 00000000 00000000 00000000
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] Call Trace:
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c04004f5>]
io_schedule+0x35/0x50
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a6f75>]
sync_page+0x35/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0400820>]
__wait_on_bit_lock+0x40/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a6f40>] ?
sync_page+0x0/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a6f1d>]
__lock_page+0x4d/0x60
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0151250>] ?
wake_bit_function+0x0/0x60
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0238c06>]
ext4_da_writepages+0x706/0x7a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a6c30>] ?
find_get_pages_tag+0x0/0x120
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0162127>] ?
lock_release_non_nested+0x187/0x2b0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f51d5>] ?
writeback_inodes_wb+0x245/0x3b0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f4925>] ?
writeback_single_inode+0x95/0x260
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f51d5>] ?
writeback_inodes_wb+0x245/0x3b0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f4925>] ?
writeback_single_inode+0x95/0x260
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0238500>] ?
ext4_da_writepages+0x0/0x7a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01adb6d>]
do_writepages+0x1d/0x30
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f4930>]
writeback_single_inode+0xa0/0x260
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f5216>]
writeback_inodes_wb+0x286/0x3b0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f543f>]
wb_writeback+0xff/0x1a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f5606>] ?
wb_do_writeback+0x86/0x1e0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f573b>]
wb_do_writeback+0x1bb/0x1e0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f55a2>] ?
wb_do_writeback+0x22/0x1e0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f5792>]
bdi_writeback_task+0x32/0xa0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01bca5e>]
bdi_start_fn+0x5e/0xb0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01bca00>] ?
bdi_start_fn+0x0/0xb0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0151144>]
kthread+0x74/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01510d0>] ?
kthread+0x0/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0103a07>]
kernel_thread_helper+0x7/0x10
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] write_test D
c01272a0 0 5966 1 0x00000005
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] c2149c50
00200046 00200046 c01272a0 00000001 c4f09910 00200296 c4f09910
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] c2149c14
c061de60 c061de60 c061de60 c2149c34 c01272a0 ce9e2594 ce9e2500
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] ce9e2694
c1607e60 00200246 c0267d9c 00000001 c4f09814 c4f09800 c4f09814
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] Call Trace:
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01272a0>] ?
__wake_up+0x40/0x50
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01272a0>] ?
__wake_up+0x40/0x50
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0267d9c>] ?
start_this_handle+0x36c/0x580
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0267da1>]
start_this_handle+0x371/0x580
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0160bac>] ?
lockdep_init_map+0x3c/0x500
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0151210>] ?
autoremove_wake_function+0x0/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0268055>]
jbd2_journal_start+0xa5/0xd0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0248663>]
ext4_journal_start_sb+0x53/0xa0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0245db3>] ?
__ext4_journal_stop+0x43/0x70
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0238ef4>]
ext4_da_write_begin+0x254/0x3a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0239470>] ?
ext4_da_get_block_prep+0x0/0x360
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a785e>]
generic_file_buffered_write+0xde/0x260
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a7e36>]
__generic_file_aio_write+0x276/0x510
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0400e04>] ?
mutex_lock_nested+0x1e4/0x270
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a8128>]
generic_file_aio_write+0x58/0xc0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c022fa0f>]
ext4_file_write+0x3f/0xd0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0147ed5>] ?
ptrace_stop+0xa5/0xf0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01d85bd>]
do_sync_write+0xcd/0x110
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c014840a>] ?
ptrace_notify+0x9a/0xb0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0151210>] ?
autoremove_wake_function+0x0/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c028291f>] ?
security_file_permission+0xf/0x20
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01d877c>] ?
rw_verify_area+0x6c/0xe0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01d8de6>]
vfs_write+0x96/0x190
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01d84f0>] ?
do_sync_write+0x0/0x110
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01d950d>]
sys_write+0x3d/0x70
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0102f55>]
syscall_call+0x7/0xb
-----------------------------------------------------------------------------------------------------------------------------------------
I am also attaching complete trace for your reference.

write_test is my testing executable that performs many sequential
writes to a file.
This stack trace confirms that lock_page() in write_cache_pages() is
resulting into this lockup.

I have few questions here.
I guess process named jbd2/sdb1-8 is kjournald thread. But what is
flush-8:16 process? Is it the kernel thread for periodically writing
dirty pages to disk?

Is it the case that these threads are running concurrently at certain
time and are trying to get lock on same pages resulting into deadlock?
If yes, what can be the reason? Am I doing mistake in calling
filemap_write_and_wait() in journal_commit flow?

Please reply.

Thanks & Regards,
Kailas

On 22 March 2010 22:22, Jan Kara <[email protected]> wrote:
>
> ?Hi,
>
> On Fri 19-03-10 08:53:08, Kailas Joshi wrote:
> > I am facing some problems while implementing alloc_on_commit.
> > While performing exhaustive write operations(for example using postmark),
> > system locks up after some time.
> > It runs fine for (simple)non-exhaustive write operations.
> >
> > I am using filemap_write_and_wait() in journal commit callback for
> > performing synchronous block allocation. It uses special journal handle
> > which enables use of early reservations.
> > Is it right to use this function here? If no, is there any other alternative
> > that should be used in this scenario?
> >
> > I am using following strategy -
> > 1) ext4_da_get_block_prep() marks delayed-allocation buffers with BH_DA
> > after reserving space for them.
> ?We have a BH_Delay flag for this already. OK, probably you need a
> temporary flag which you can clear in ext4_da_write_begin. I'd find
> counting number of BH_Delay buffers before and after block_write_begin
> call nicer...
>
> > 2) ext4_da_write_begin() counts the number of buffers marked with BH_DA and
> > reserves credits for block allocation.
> > 3) journal_stop() accumulates the unused credits of a handle in the
> > transaction.
> > 4) journal_start() when called with nblocks=0, creates a special handle with
> > the credits accumulated by all previous handles(by step 2).
> ?This is a hack. I'd rather create a separate JBD2 function for this.
>
> > 5) journal_commit() creates special handle for block allocation(as in step
> > 4) and calls filemap_write_and_wait() to perform block allocation.
> >
> > I am also sending the patch(for kernel 2.6.32.4) for my implementation (also
> > available at
> > http://www.cse.iitb.ac.in/~kailasjoshi/files/alloc_on_commit.patch).
> >
> > Being new to filesystem development, I am not able to identify the problem.
> > I will be very greatful if someone can help me out.
> ?Probably you are hitting some lock inversion problem. I suggest you
> compile the kernel with lockdep enabled (in Kernel hacking -> Lock debugging
> -> Prove lock correctness or something like that) and see whether it issues
> some warnings. If not, you can get backtraces of the locked up processes
> by pressing Alt-Sysrq-w (or echo "w" >/proc/sysrq-trigger).
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

Attachments:

syslockup.log (28.41 kB)

2010-03-29 16:45:21

[permalink] [raw]

Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

Hi,

On Tue 23-03-10 16:11:45, Kailas Joshi wrote:
> I have Lock Debugging enables but that didn't give any warnings.
> However, when I did echo "w" >/proc/sysrq-trigger after system lockup,
> I got the stack trace for locked up process.
>
> Following are the stack traces of the processes (I suspect) resulting
> in total system lockup -
<snip>

So kjournald is waiting on a page lock and everyone else waits for
kjournald to finish committing or for page lock as well. The strange thing
is that I don't see anybody who could hold the page lock everyone is
waiting on. So I think further debugging should go in this direction - find
out on which page do we wait and who is holding it's lock (you'd need to
add tracking of page lock owner but that shouldn't be too hard).

> I have few questions here.
> I guess process named jbd2/sdb1-8 is kjournald thread. But what is
Yes.

> flush-8:16 process? Is it the kernel thread for periodically writing
> dirty pages to disk?
Yes.

> Is it the case that these threads are running concurrently at certain
> time and are trying to get lock on same pages resulting into deadlock?
It should not happen - they should always acquire page lock in
index-increasing order so that way deadlocks should be avoided...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-04-17 04:42:52