2010-02-04 05:50:07

by Kailas Joshi

[permalink] [raw]
Subject: Help on Implementation of EXT3 type Ordered Mode in EXT4

Hi

I recently found that in EXT4 with delayed block the Ordered mode does not
bahave same as in EXT3.
I found a patch for this at http://lwn.net/Articles/324023/, but it has some
journal block estimation problem resulting into deadlock.

I would like to know if it has been solved.
If not, is it possible to solve it? What are the complexities involved?

Please help.
Thanks in advanced.

Regards,
Kailas



2010-02-09 16:05:24

by Jan Kara

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

Hi,

> I recently found that in EXT4 with delayed block the Ordered mode does not
> bahave same as in EXT3.
> I found a patch for this at http://lwn.net/Articles/324023/, but it has some
> journal block estimation problem resulting into deadlock.
>
> I would like to know if it has been solved.
> If not, is it possible to solve it? What are the complexities involved?
It has not been solved. The problem is that to commit data on transaction
commit (which is what data=ordered mode has historically done), you have to
allocate space for these blocks. But that allocation needs to modify a
filesystem and thus journal more blocks... And that is tricky - we would have
to reserve space in the current transaction for allocation of delayed data. So
it gets a bit messy...
Why exactly do you need the old data=ordered guarantees?

Honza

--
Jan Kara <[email protected]>
SuSE CR Labs

2010-02-09 17:41:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On Tue, Feb 09, 2010 at 05:05:22PM +0100, Jan Kara wrote:
> Hi,
>
> > I recently found that in EXT4 with delayed block the Ordered mode does not
> > bahave same as in EXT3.
> > I found a patch for this at http://lwn.net/Articles/324023/, but it has some
> > journal block estimation problem resulting into deadlock.
> >
> > I would like to know if it has been solved.
> > If not, is it possible to solve it? What are the complexities involved?
>
> It has not been solved. The problem is that to commit data on
> transaction commit (which is what data=ordered mode has historically
> done), you have to allocate space for these blocks. But that
> allocation needs to modify a filesystem and thus journal more
> blocks... And that is tricky - we would have to reserve space in the
> current transaction for allocation of delayed data. So it gets a
> bit messy...

The dioread_nolock patches from Jiaying, which are currently in the
unstable portion of the tree, is a partial solution to the
data=ordered problem, although it solves it in a slightly different
way.

As a side effect of trying to avoid locking on the direct I/O read
path, on the buffered I/O write path it changes things so the extent
tree is first changed so the blocks are allocated with the "extent
uninitialized" bit, and then only after the blocks hit the disk, via
the bh completion callback, do we set the extent so that it is marked
as containing initialized data.

As a result, if you crash before the extent tree is updated, when you
read from the file, you will get all zero's, instead of the data, thus
preventing the security leak.

It does mean that fsync() is slightly slower, since we now have to
flush the data blocks out, wait for the completion handler to fire and
update the extent in the same jbd2 transaction, and only then wait for
the barrier in the jbd2 transaction. (And in fact, I'm not sure
fsync() is completely working correctly in the current patch in the
unstable patch stream, and there aren't race conditions where the
extent tree update slips into the next transaction.) But it does
solve the problem.

The other downside with this solution is that it only works for files
that are extent-mapped, and if you do this with a converted ext3 file
system, and there are files that are still mapped using
direct/indirect blocks, when you change the mount option to be
data=writeback,dioread_nolock, the block allocating writes to these
legacy files could result in data getting exposed after a crash.

Depending on the workload the upside is that by using data=writeback
instead of data=ordered could far outweigh the downside of needing to
do an extra block I/O queue flush before the fsync, since it reduces
the number of entangled writes to only the metadata blocks, where
previously the entagled write problem affected metadata blocks plus
all freshly allocated blocks.

Kalias, this is something that I plan to look in the near future; if
you are interested in helping to benchmark and characterize this
solution, I'd be very interested in working with you. Can you tell me
a little more about your use case and requirements?

- Ted


2010-02-11 07:32:16

by Kailas Joshi

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On 11 February 2010 12:31, Kailas Joshi <[email protected]> wrote:
>
> On 9 February 2010 23:11, <[email protected]> wrote:
>>
>> On Tue, Feb 09, 2010 at 05:05:22PM +0100, Jan Kara wrote:
>> > ? Hi,
>> >
>> > > I recently found that in EXT4 with delayed block the Ordered mode does not
>> > > bahave same as in EXT3.
>> > > I found a patch for this at http://lwn.net/Articles/324023/, but it has some
>> > > journal block estimation problem resulting into deadlock.
>> > >
>> > > I would like to know if it has been solved.
>> > > If not, is it possible to solve it? What are the complexities involved?
>> >
>> > It has not been solved. The problem is that to commit data on
>> > transaction commit (which is what data=ordered mode has historically
>> > done), you have to allocate space for these blocks. But that
>> > allocation needs to modify a filesystem and thus journal more
>> > blocks... And that is tricky - we would have to reserve space in the
>> > current transaction for allocation of delayed data. ?So it gets a
>> > bit messy...
>>
>> The dioread_nolock patches from Jiaying, which are currently in the
>> unstable portion of the tree, is a partial solution to the
>> data=ordered problem, although it solves it in a slightly different
>> way.
>>
>> As a side effect of trying to avoid locking on the direct I/O read
>> path, on the buffered I/O write path it changes things so the extent
>> tree is first changed so the blocks are allocated with the "extent
>> uninitialized" bit, and then only after the blocks hit the disk, via
>> the bh completion callback, do we set the extent so that it is marked
>> as containing initialized data.
>>
>> As a result, if you crash before the extent tree is updated, when you
>> read from the file, you will get all zero's, instead of the data, thus
>> preventing the security leak.
>>
>> It does mean that fsync() is slightly slower, since we now have to
>> flush the data blocks out, wait for the completion handler to fire and
>> update the extent in the same jbd2 transaction, and only then wait for
>> the barrier in the jbd2 transaction. ?(And in fact, I'm not sure
>> fsync() is completely working correctly in the current patch in the
>> unstable patch stream, and there aren't race conditions where the
>> extent tree update slips into the next transaction.) ?But it does
>> solve the problem.
>>
>> The other downside with this solution is that it only works for files
>> that are extent-mapped, and if you do this with a converted ext3 file
>> system, and there are files that are still mapped using
>> direct/indirect blocks, when you change the mount option to be
>> data=writeback,dioread_nolock, the block allocating writes to these
>> legacy files could result in data getting exposed after a crash.
>>
>> Depending on the workload the upside is that by using data=writeback
>> instead of data=ordered could far outweigh the downside of needing to
>> do an extra block I/O queue flush before the fsync, since it reduces
>> the number of entangled writes to only the metadata blocks, where
>> previously the entagled write problem affected metadata blocks plus
>> all freshly allocated blocks.
>>
>> Kalias, this is something that I plan to look in the near future; if
>> you are interested in helping to benchmark and characterize this
>> solution, I'd be very interested in working with you. ?Can you tell me
>> a little more about your use case and requirements?
>>
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>

Jan and Ted, thank you very much for detailed replies.

We are assessing the use of copy-on-write technique to provide data
level consistency in EXT3/EXT4. We have implemented this in EXT3 by
using the Ordered mode of operation. Benchmark results for IOZone and
Postmark are quiet good. We could get the consistency equivalent to
Journal mode with the overhead almost same as Ordered mode. However,
there are few cases(for example, file rewrite) where performance of
Journal mode is better than our technique. We think that in EXT4, with
the support for delayed block allocation and extents, these problems
can be removed.

However, Ordered mode with delayed block allocation in EXT4 does not
behave in the same way as in EXT3. It does not flush 'all' dirty
blocks to the disk as in EXT3. For implementing our technique in EXT4,
we need EXT3 style Ordered mode, that is
alloc_on_commit(http://lwn.net/Articles/324023/).

I understand that this is not required in EXT4 since the Ordered mode
is provided for security and not consistency. However, from the
discussions on blogs/post, it seems that developers expect Ordered
mode to provide (limited) data consistency as well.

Since the implementation of our technique heavily depends on EXT3
style Ordered mode, I would like to implement alloc_on_commit on EXT4.
I have designed following strategy to address credit reservation
problem in earlier patch. Please let me know your comments on it.

1. In Write path, the call to journal_start() for updating metadata
will reserve credits for delayed allocation also.
2. If the fs is mounted with alloc_on_commit, journal_stop() will not
return remaining credits to the journal (t_outstanding_credits will
not be changed).
3. In journal_commit() -
i. After LOCKing the current transaction, a new special handle will be
created by calling journal_start() with zero credits . Such a call to
journal_start() can be treated as a special case for creating handle
to use accumulated credits (in t_outstanding_credits) of currently
locked transaction.
ii. Before changing transaction state to FLUSH, callback will be used
to perform delayed block allocation for all inodes. This mechanism
will be same as in alloc_on_commit at http://lwn.net/Articles/324023/
, but it will be performed after changing the transaction to LOCKED
state. In the callback, specially created handle will passed to the
callback function and it will use that handle for performing delayed
block allocation.
iii. The special handle will be closed, outstanding credits for
transaction will be zeroed and the transaction flush will continue.

Regarding dioread_nolock work:
Ted, I am new in filesystem development. If this is fine and your
deadlines are not very critical, I will be very happy to work with you
on dioread_nolock even though its not directly related to our current
work. Please let me know more on this.

Thanks & Regards,
Kailas

2010-02-11 19:56:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On Thu, Feb 11, 2010 at 01:02:15PM +0530, Kailas Joshi wrote:
>
> We are assessing the use of copy-on-write technique to provide data
> level consistency in EXT3/EXT4. We have implemented this in EXT3 by
> using the Ordered mode of operation. Benchmark results for IOZone and
> Postmark are quiet good. We could get the consistency equivalent to
> Journal mode with the overhead almost same as Ordered mode. However,
> there are few cases(for example, file rewrite) where performance of
> Journal mode is better than our technique. We think that in EXT4, with
> the support for delayed block allocation and extents, these problems
> can be removed.

Ah, sorry, I misread your initial post; I thouht you were trying to
reimplement the proposed ext4 mode data=guarded.

I've mostly given up on trying to get alloc_on_commit work, for two
reasons.

The first is that one of the reasons why you might be closing the
transaction is if there's not enough space left in the journal. But
if we you going to a large number of data allocations at commit time,
there's no guaratee that there will be space in the journal for all of
the metadata blocks that might have to be modified in order to make
the block allocations.

The second problem with this scheme is a performance problem; while
you are doing handling delayed allocation blocks, you have to do this
while the journal is still locked, using magic handles that are
allowed to be created while the journal is locked. That adds all
sorts of complexity, and that seems to what you are thinking about
doing. The problem though is that while this is going on, all other
file system activity has to be blocked. So this will cause all sorts
of processes to become suspended waiting for the all of the allocation
activity to complete, which may require bitmap allocation blocks to be
read into disk, etc.

The trade off for all of these problems is that it allows you to delay
the block allocation for only 5 seconds. The question is, is this
worth it, compared with simply mounting the file system with
nodelalloc? It may be all of this complexity doesn't produce enough
of a performance gain over simply using nodelalloc.

So maybe the solution for certain distributions that are catering to
the "inexperienced user" / "users who like to use unstable video
drivers" market is to mount with nodelalloc by default, and tell them
that if they want the performance improvements of delayed allocation,
they need to lobby to get the applications fixed.

(After all, these problems are going to be around no matter whether
people use XFS or btrfs; most modern file systems are going to use
delayed allocation, so sooner or later the broken applications really
need to get fixed. The defiant user's cry, "well, if you don't fix
this I'll switch to xfs/btrfs!" isn't going to help in this case....)

- Ted


2010-02-12 03:22:15

by Kailas Joshi

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On 12 February 2010 01:26, <[email protected]> wrote:
> I've mostly given up on trying to get alloc_on_commit work, for two
> reasons.
>
> The first is that one of the reasons why you might be closing the
> transaction is if there's not enough space left in the journal. ?But
> if we you going to a large number of data allocations at commit time,
> there's no guaratee that there will be space in the journal for all of
> the metadata blocks that might have to be modified in order to make
> the block allocations.
Won't this get fixed by performing early reservations as mentioned in
my scheme? We are reserving required credits in the path of write
system call and these will be kept reserved until transaction commit.
So, the journal space for allocation at commit will be guaranteed.

>
> The second problem with this scheme is a performance problem; while
> you are doing handling delayed allocation blocks, you have to do this
> while the journal is still locked, using magic handles that are
> allowed to be created while the journal is locked. ?That adds all
> sorts of complexity, and that seems to what you are thinking about
> doing. ?The problem though is that while this is going on, all other
> file system activity has to be blocked. ?So this will cause all sorts
> of processes to become suspended waiting for the all of the allocation
> activity to complete, which may require bitmap allocation blocks to be
> read into disk, etc.
Sorry, I didn't understand why processes need to be suspended.
In my scheme, I am issuing magic handle only after locking the current
transaction. AFAIK after the transaction is locked, it can receive the
block journaling requests for already created handles(in our case, for
already reserved journal space), and the new concurrent requests for
journal_start() will go to the new current transaction. Since, the
credits for locked transaction are fixed (by means of early
reservations) we can know whether journal has enough space for the new
journal_start(). So, as long as journal has enough space available,
new processes need now be stalled.

Please correct me if this is wrong.


> The trade off for all of these problems is that it allows you to delay
> the block allocation for only 5 seconds. ?The question is, is this
> worth it, compared with simply mounting the file system with
> nodelalloc? ?It may be all of this complexity doesn't produce enough
> of a performance gain over simply using nodelalloc.
I agree. The performance gain might not be good enough compared to
'nodelalloc'. However, our goal is providing data consistency
equivalent to Journal mode at low cost. So, we are interested in
comparing performance of alloc_on_commit (and our technique) with
performance of Journal mode.

Thanks & Regards,
Kailas

2010-02-12 20:07:29

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On Fri, Feb 12, 2010 at 08:52:15AM +0530, Kailas Joshi wrote:
> Won't this get fixed by performing early reservations as mentioned in
> my scheme? We are reserving required credits in the path of write
> system call and these will be kept reserved until transaction commit.
> So, the journal space for allocation at commit will be guaranteed.

Yes, if you account for these separately. One challenge is
over-estimating the needed credits will be tricky. If we go down this
path, be sure that the bonnie style write(fd, &ch, 1) in a tight loop
doesn't end up reserving a separate set of credits for each write
system call to the same block. (It can be done; if the DA block is
already instantiated, you can assume that credits have already been
reserved.)

> Sorry, I didn't understand why processes need to be suspended.
> In my scheme, I am issuing magic handle only after locking the current
> transaction. AFAIK after the transaction is locked, it can receive the
> block journaling requests for already created handles(in our case, for
> already reserved journal space), and the new concurrent requests for
> journal_start() will go to the new current transaction. Since, the
> credits for locked transaction are fixed (by means of early
> reservations) we can know whether journal has enough space for the new
> journal_start(). So, as long as journal has enough space available,
> new processes need now be stalled.

But while you are modifying blocks that need to go into the journal
via the locked (old) transaction, it's not safe to start a new
transaction and start issuing handles against the new transaction.

Just to give one example, suppose we need to update the extent
allocation tree for an inode in the locked/committing transaction as
the delayed allocation blocks are being resolved --- and in another
process, that inode is getting truncated or unlinked, which also needs
to modify the extent allocation tree? Hilarty ensues, unless you use
a block all attempts to create a new handle (practically speaking, by
blocking all attempts to start a new transaction), until this new
delayed allocation resolution phase which you have proposed is
complete.

- Ted

2010-02-13 08:43:18

by Kailas Joshi

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On 13 February 2010 01:37, <[email protected]> wrote:
> On Fri, Feb 12, 2010 at 08:52:15AM +0530, Kailas Joshi wrote:
>> Won't this get fixed by performing early reservations as mentioned in
>> my scheme? We are reserving required credits in the path of write
>> system call and these will be kept reserved until transaction commit.
>> So, the journal space for allocation at commit will be guaranteed.
>
> Yes, if you account for these separately. ?One challenge is
> over-estimating the needed credits will be tricky. ?If we go down this
> path, be sure that the bonnie style write(fd, &ch, 1) in a tight loop
> doesn't end up reserving a separate set of credits for each write
> system call to the same block. ?(It can be done; if the DA block is
> already instantiated, you can assume that credits have already been
> reserved.)
Okay

>> Sorry, I didn't understand why processes need to be suspended.
>> In my scheme, I am issuing magic handle only after locking the current
>> transaction. ?AFAIK after the transaction is locked, it can receive the
>> block journaling requests for already created handles(in our case, for
>> already reserved journal space), and the new concurrent requests for
>> journal_start() will go to the new current transaction. Since, the
>> credits for locked transaction are fixed (by means of early
>> reservations) we can know whether journal has enough space for the new
>> journal_start(). So, as long as journal has enough space available,
>> new processes need now be stalled.
>
> But while you are modifying blocks that need to go into the journal
> via the locked (old) transaction, it's not safe to start a new
> transaction and start issuing handles against the new transaction.
>
> Just to give one example, suppose we need to update the extent
> allocation tree for an inode in the locked/committing transaction as
> the delayed allocation blocks are being resolved --- and in another
> process, that inode is getting truncated or unlinked, which also needs
> to modify the extent allocation tree? ?Hilarty ensues, unless you use
> a block all attempts to create a new handle (practically speaking, by
> blocking all attempts to start a new transaction), until this new
> delayed allocation resolution phase which you have proposed is
> complete.
Okay. So, basically process stalling is unavoidable as we cannot
modify a buffer data in past transaction after it has been modified in
current transaction.
Can we restrict the scope for this blocking? Blocking on
journal_start() will block all processes even though they are
operating on mutually exclusive sets of metadata buffers. Can we
restrict this blocking to allocation/deallocation paths by blocking in
get_write_access() on specific cases(some condition on buffer)? This
way, since all files will use commit-time allocation, very few(sync
and direct-io mode) file operations will be stalled.

I am not sure whether this is feasible or not. Please let me know more on this.

Thanks & Regards,
Kailas

2010-02-15 15:00:13

by Jan Kara

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On Sat 13-02-10 14:13:17, Kailas Joshi wrote:
> On 13 February 2010 01:37, <[email protected]> wrote:
> > On Fri, Feb 12, 2010 at 08:52:15AM +0530, Kailas Joshi wrote:
> >> Sorry, I didn't understand why processes need to be suspended.
> >> In my scheme, I am issuing magic handle only after locking the current
> >> transaction. ?AFAIK after the transaction is locked, it can receive the
> >> block journaling requests for already created handles(in our case, for
> >> already reserved journal space), and the new concurrent requests for
> >> journal_start() will go to the new current transaction. Since, the
> >> credits for locked transaction are fixed (by means of early
> >> reservations) we can know whether journal has enough space for the new
> >> journal_start(). So, as long as journal has enough space available,
> >> new processes need now be stalled.
> >
> > But while you are modifying blocks that need to go into the journal
> > via the locked (old) transaction, it's not safe to start a new
> > transaction and start issuing handles against the new transaction.
> >
> > Just to give one example, suppose we need to update the extent
> > allocation tree for an inode in the locked/committing transaction as
> > the delayed allocation blocks are being resolved --- and in another
> > process, that inode is getting truncated or unlinked, which also needs
> > to modify the extent allocation tree? ?Hilarty ensues, unless you use
> > a block all attempts to create a new handle (practically speaking, by
> > blocking all attempts to start a new transaction), until this new
> > delayed allocation resolution phase which you have proposed is
> > complete.
> Okay. So, basically process stalling is unavoidable as we cannot
> modify a buffer data in past transaction after it has been modified in
> current transaction.
> Can we restrict the scope for this blocking? Blocking on
> journal_start() will block all processes even though they are
> operating on mutually exclusive sets of metadata buffers. Can we
> restrict this blocking to allocation/deallocation paths by blocking in
> get_write_access() on specific cases(some condition on buffer)? This
> way, since all files will use commit-time allocation, very few(sync
> and direct-io mode) file operations will be stalled.
I doubt blocking at buffer-level would be enough. I think that the
journalling layer just does not have enough information for such decisions.
It could be feasible to block on per-inode basis but you'd still have to
give a good thought to modification of filesystem global structures like
bitmaps, superblock, or inode blocks.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-02-16 10:10:22

by Kailas Joshi

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On 15 February 2010 20:30, Jan Kara <[email protected]> wrote:
> On Sat 13-02-10 14:13:17, Kailas Joshi wrote:
>> On 13 February 2010 01:37, ?<[email protected]> wrote:
>> > On Fri, Feb 12, 2010 at 08:52:15AM +0530, Kailas Joshi wrote:
>> >> Sorry, I didn't understand why processes need to be suspended.
>> >> In my scheme, I am issuing magic handle only after locking the current
>> >> transaction. ?AFAIK after the transaction is locked, it can receive the
>> >> block journaling requests for already created handles(in our case, for
>> >> already reserved journal space), and the new concurrent requests for
>> >> journal_start() will go to the new current transaction. Since, the
>> >> credits for locked transaction are fixed (by means of early
>> >> reservations) we can know whether journal has enough space for the new
>> >> journal_start(). So, as long as journal has enough space available,
>> >> new processes need now be stalled.
>> >
>> > But while you are modifying blocks that need to go into the journal
>> > via the locked (old) transaction, it's not safe to start a new
>> > transaction and start issuing handles against the new transaction.
>> >
>> > Just to give one example, suppose we need to update the extent
>> > allocation tree for an inode in the locked/committing transaction as
>> > the delayed allocation blocks are being resolved --- and in another
>> > process, that inode is getting truncated or unlinked, which also needs
>> > to modify the extent allocation tree? ?Hilarty ensues, unless you use
>> > a block all attempts to create a new handle (practically speaking, by
>> > blocking all attempts to start a new transaction), until this new
>> > delayed allocation resolution phase which you have proposed is
>> > complete.
>> Okay. So, basically process stalling is unavoidable as we cannot
>> modify a buffer data in past transaction after it has been modified in
>> current transaction.
>> Can we restrict the scope for this blocking? Blocking on
>> journal_start() will block all processes even though they are
>> operating on mutually exclusive sets of metadata buffers. Can we
>> restrict this blocking to allocation/deallocation paths by blocking in
>> get_write_access() on specific cases(some condition on buffer)? This
>> way, since all files will use commit-time allocation, very few(sync
>> and direct-io mode) file operations will be stalled.
> ?I doubt blocking at buffer-level would be enough. I think that the
> journalling layer just does not have enough information for such decisions.
> It could be feasible to block on per-inode basis but you'd still have to
> give a good thought to modification of filesystem global structures like
> bitmaps, superblock, or inode blocks.
Okay. So, blocking at buffer level will not be easy as global
structures shared among inodes will need modifications(for example,
changing access time for a file in inode block).

One last doubt, while looking at the code, I saw that journal_start()
always stalls all file operations while currently running transaction
is in LOCKED state. Only when the current transaction moves to FLUSH,
the new transaction is created and the stalled operations continue. Is
this interpretation correct?
If yes, why this stalling does not have significant negative impact on
performance of file operations? Also, if it does not have, will
stalling for delayed block allocation really have such significant
negative impact?

Please reply.

Thanks & Regards,
Kailas

2010-02-16 13:10:30

by Jan Kara

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On Tue 16-02-10 15:40:22, Kailas Joshi wrote:
> On 15 February 2010 20:30, Jan Kara <[email protected]> wrote:
> > On Sat 13-02-10 14:13:17, Kailas Joshi wrote:
> >> On 13 February 2010 01:37, ?<[email protected]> wrote:
> >> > On Fri, Feb 12, 2010 at 08:52:15AM +0530, Kailas Joshi wrote:
> >> >> Sorry, I didn't understand why processes need to be suspended.
> >> >> In my scheme, I am issuing magic handle only after locking the current
> >> >> transaction. ?AFAIK after the transaction is locked, it can receive the
> >> >> block journaling requests for already created handles(in our case, for
> >> >> already reserved journal space), and the new concurrent requests for
> >> >> journal_start() will go to the new current transaction. Since, the
> >> >> credits for locked transaction are fixed (by means of early
> >> >> reservations) we can know whether journal has enough space for the new
> >> >> journal_start(). So, as long as journal has enough space available,
> >> >> new processes need now be stalled.
> >> >
> >> > But while you are modifying blocks that need to go into the journal
> >> > via the locked (old) transaction, it's not safe to start a new
> >> > transaction and start issuing handles against the new transaction.
> >> >
> >> > Just to give one example, suppose we need to update the extent
> >> > allocation tree for an inode in the locked/committing transaction as
> >> > the delayed allocation blocks are being resolved --- and in another
> >> > process, that inode is getting truncated or unlinked, which also needs
> >> > to modify the extent allocation tree? ?Hilarty ensues, unless you use
> >> > a block all attempts to create a new handle (practically speaking, by
> >> > blocking all attempts to start a new transaction), until this new
> >> > delayed allocation resolution phase which you have proposed is
> >> > complete.
> >> Okay. So, basically process stalling is unavoidable as we cannot
> >> modify a buffer data in past transaction after it has been modified in
> >> current transaction.
> >> Can we restrict the scope for this blocking? Blocking on
> >> journal_start() will block all processes even though they are
> >> operating on mutually exclusive sets of metadata buffers. Can we
> >> restrict this blocking to allocation/deallocation paths by blocking in
> >> get_write_access() on specific cases(some condition on buffer)? This
> >> way, since all files will use commit-time allocation, very few(sync
> >> and direct-io mode) file operations will be stalled.
> > ?I doubt blocking at buffer-level would be enough. I think that the
> > journalling layer just does not have enough information for such decisions.
> > It could be feasible to block on per-inode basis but you'd still have to
> > give a good thought to modification of filesystem global structures like
> > bitmaps, superblock, or inode blocks.
> Okay. So, blocking at buffer level will not be easy as global
> structures shared among inodes will need modifications(for example,
> changing access time for a file in inode block).
Yes.

> One last doubt, while looking at the code, I saw that journal_start()
> always stalls all file operations while currently running transaction
> is in LOCKED state. Only when the current transaction moves to FLUSH,
> the new transaction is created and the stalled operations continue. Is
> this interpretation correct?
Yes, it is correct.

> If yes, why this stalling does not have significant negative impact on
> performance of file operations? Also, if it does not have, will
> stalling for delayed block allocation really have such significant
> negative impact?
Actually, stalling on a transaction in LOCKED state does have a negative
impact on the filesystem performance. But it's hard to avoid it. The
transaction is in LOCKED state while we've decided it needs a commit but
there are still tasks which have handle to it and are adding new metadata
buffers to it. So this transaction is effectively still running and we
cannot start a next transaction because then we'd have two running
transactions and the journalling logic isn't able to handle that.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-02-16 14:18:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On Tue, Feb 16, 2010 at 02:10:39PM +0100, Jan Kara wrote:
> Actually, stalling on a transaction in LOCKED state does have a negative
> impact on the filesystem performance. But it's hard to avoid it. The
> transaction is in LOCKED state while we've decided it needs a commit but
> there are still tasks which have handle to it and are adding new metadata
> buffers to it. So this transaction is effectively still running and we
> cannot start a next transaction because then we'd have two running
> transactions and the journalling logic isn't able to handle that.

This is also why we try to avoid staying in LOCKED state for very
long.... and why increasing the journal size can help performance
(since if we get ourselves into trouble where are forced to do a
journal checkpoint, we can end up stalling all file system updates for
a non-trivial amount of time).

So changes that increase the amount of time that we spend in LOCKED
are going to be really bad, especially if you have one thread which is
frequently calling fsync() (for example, like Firefox, which can be
*very* fsync() happy) and another thread which is doing lots of file
creates and deletes. Each fsync() will force a transaction commit,
and if you have to stop all transaction updates while the delayed
allocation blocks are getting resolved, life can really get bad.

This is why, ultimately, we really need to distinguish between files
where we might not care when they get written to disk (i.e., object
files being created by the compiler, ISO files being downloaded from
the web since we can always restart them after the hopefully rare
crash --- unless you're using crappy video drivers, of course) from
files written by buggy applications which are precious and yet where
the application writer didn't bother to use fsync().

Maybe something we ought to consider is doing things both ways. Maybe
we should have a way for applications to indicate they have been
audited and any precious files will be properly fsync()'ed. This
could be done via two process personality flags; one which is
inherited across an exec, and which which isn't. (We need this so
that jobs being fired out of make can be properly exempted from
calling fsync(), even if they are using programs like sort, or shell
redirections, where the coreutils authors don't know whether the files
they are writing are precious or not, and thus whether they should be
fsync'ed.)

These flags would be used to exempt processes from a mount option
which could be set by people who are nervous about not trusting their
application writers, which would force an fsync at every file close
(except for those processes which have these process personality flags
set). People who are more confident about having a stable set of
kernel drivers (and/or who are running servers where they have UPS's
and where they aren't using crappy desktop applications that seem to
be the most likely to not properly call fsync for precious files) can
simply avoid using this mount option, but we can give users and system
administrators a choice.

Maybe, just for those whiners at Phoronix, we can give them an mount
option where applications which have this flag set will get delayed
allocation, and applications which don't get their files written with
O_SYNC. :-)

- Ted

2010-02-17 15:37:26

by Kailas Joshi

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

On 16 February 2010 19:48, <[email protected]> wrote:
> On Tue, Feb 16, 2010 at 02:10:39PM +0100, Jan Kara wrote:
>> ? Actually, stalling on a transaction in LOCKED state does have a negative
>> impact on the filesystem performance. But it's hard to avoid it. The
>> transaction is in LOCKED state while we've decided it needs a commit but
>> there are still tasks which have handle to it and are adding new metadata
>> buffers to it. So this transaction is effectively still running and we
>> cannot start a next transaction because then we'd have two running
>> transactions and the journalling logic isn't able to handle that.
>
> This is also why we try to avoid staying in LOCKED state for very
> long.... and why increasing the journal size can help performance
> (since if we get ourselves into trouble where are forced to do a
> journal checkpoint, we can end up stalling all file system updates for
> a non-trivial amount of time).
>
> So changes that increase the amount of time that we spend in LOCKED
> are going to be really bad, especially if you have one thread which is
> frequently calling fsync() (for example, like Firefox, which can be
> *very* fsync() happy) and another thread which is doing lots of file
> creates and deletes. ?Each fsync() will force a transaction commit,
> and if you have to stop all transaction updates while the delayed
> allocation blocks are getting resolved, life can really get bad.

Okay. It seems that there is no easy way to solve this. Probably, the
personality flag based solution is more appropriate.
Still, as we need this mode of operation for our further analysis, for
now we will go with the same design to implement alloc_on_commit and
see how can we optimize it and how much negative impact it has. Will
update you on this.

Thank you very much for the help.

Regards,
Kailas

2010-03-22 16:51:57

by Jan Kara

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

Hi,

On Fri 19-03-10 08:53:08, Kailas Joshi wrote:
> I am facing some problems while implementing alloc_on_commit.
> While performing exhaustive write operations(for example using postmark),
> system locks up after some time.
> It runs fine for (simple)non-exhaustive write operations.
>
> I am using filemap_write_and_wait() in journal commit callback for
> performing synchronous block allocation. It uses special journal handle
> which enables use of early reservations.
> Is it right to use this function here? If no, is there any other alternative
> that should be used in this scenario?
>
> I am using following strategy -
> 1) ext4_da_get_block_prep() marks delayed-allocation buffers with BH_DA
> after reserving space for them.
We have a BH_Delay flag for this already. OK, probably you need a
temporary flag which you can clear in ext4_da_write_begin. I'd find
counting number of BH_Delay buffers before and after block_write_begin
call nicer...

> 2) ext4_da_write_begin() counts the number of buffers marked with BH_DA and
> reserves credits for block allocation.
> 3) journal_stop() accumulates the unused credits of a handle in the
> transaction.
> 4) journal_start() when called with nblocks=0, creates a special handle with
> the credits accumulated by all previous handles(by step 2).
This is a hack. I'd rather create a separate JBD2 function for this.

> 5) journal_commit() creates special handle for block allocation(as in step
> 4) and calls filemap_write_and_wait() to perform block allocation.
>
> I am also sending the patch(for kernel 2.6.32.4) for my implementation (also
> available at
> http://www.cse.iitb.ac.in/~kailasjoshi/files/alloc_on_commit.patch).
>
> Being new to filesystem development, I am not able to identify the problem.
> I will be very greatful if someone can help me out.
Probably you are hitting some lock inversion problem. I suggest you
compile the kernel with lockdep enabled (in Kernel hacking -> Lock debugging
-> Prove lock correctness or something like that) and see whether it issues
some warnings. If not, you can get backtraces of the locked up processes
by pressing Alt-Sysrq-w (or echo "w" >/proc/sysrq-trigger).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-03-23 10:41:46

by Kailas Joshi

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

Thanks Jan. This has definitely given me some pointers to work upon.

I have Lock Debugging enables but that didn't give any warnings.
However, when I did echo "w" >/proc/sysrq-trigger after system lockup,
I got the stack trace for locked up process.

Following are the stack traces of the processes (I suspect) resulting
in total system lockup -
-----------------------------------------------------------------------------------------------------------------------------------------
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] jbd2/sdb1-8 D
00000046 0 5913 2 0x00000000
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] c4473b90
00000046 00000001 00000046 ce9e4a00 00000000 c4473b70 00000046
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] 00000000
c061de60 c061de60 c061de60 c01513bd 00000000 ce9e4a94 ce9e4a00
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] ce9e4b94
c1407e60 c01513ed 00000001 c13430bc 00000296 cfac9278 c4473b90
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] Call Trace:
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01513bd>] ?
prepare_to_wait_exclusive+0x1d/0x60
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01513ed>] ?
prepare_to_wait_exclusive+0x4d/0x60
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c04004f5>]
io_schedule+0x35/0x50
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a6f75>]
sync_page+0x35/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0400820>]
__wait_on_bit_lock+0x40/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a6f40>] ?
sync_page+0x0/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a6f1d>]
__lock_page+0x4d/0x60
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0151250>] ?
wake_bit_function+0x0/0x60
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01ad987>]
write_cache_pages+0x437/0x5d0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0237930>] ?
__mpage_da_writepage+0x0/0x170
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01ad310>] ?
mapping_tagged+0x0/0x70
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01ad310>] ?
mapping_tagged+0x0/0x70
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c02387ec>]
ext4_da_writepages+0x2ec/0x7a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c02c514a>] ?
number+0x25a/0x270
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0328ada>] ?
vt_console_print+0x1da/0x2a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c040238d>] ?
_spin_unlock+0x1d/0x20
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0328ada>] ?
vt_console_print+0x1da/0x2a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0155aeb>] ?
up+0x2b/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01375e7>] ?
release_console_sem+0x197/0x1d0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0238500>] ?
ext4_da_writepages+0x0/0x7a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01adb6d>]
do_writepages+0x1d/0x30
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a76d6>]
__filemap_fdatawrite_range+0x66/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a81e6>]
filemap_fdatawrite+0x26/0x30
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a821c>]
filemap_write_and_wait+0x2c/0x50
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c023228a>]
ext4_sync_alloc_da_blocks+0x5a/0x90
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0244c0c>]
alloc_on_commit_callback+0x6c/0xc0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c02693a5>]
jbd2_journal_commit_transaction+0x335/0x1ae0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c012c10c>] ?
finish_task_switch+0x6c/0xe0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0143225>] ?
lock_timer_base+0x25/0x50
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c04025ad>] ?
_spin_lock_irqsave+0x4d/0x60
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0143287>] ?
try_to_del_timer_sync+0x37/0xb0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c014336a>] ?
del_timer_sync+0x6a/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0143300>] ?
del_timer_sync+0x0/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c02703d6>]
kjournald2+0xb6/0x380
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0151210>] ?
autoremove_wake_function+0x0/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0270320>] ?
kjournald2+0x0/0x380
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0151144>]
kthread+0x74/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01510d0>] ?
kthread+0x0/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0103a07>]
kernel_thread_helper+0x7/0x10
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] flush-8:16 D
c47f1cc0 0 5916 2 0x00000000
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] c47f1cd4
00000046 00000002 c47f1cc0 c14031c4 00000000 c47f1cb4 00000046
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] 00000000
c061de60 c061de60 c061de60 c06191c4 c1407e70 ce9e6f94 ce9e6f00
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] ce9e7094
c1407e60 000095a3 00000000 00000000 00000000 00000000 00000000
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] Call Trace:
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c04004f5>]
io_schedule+0x35/0x50
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a6f75>]
sync_page+0x35/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0400820>]
__wait_on_bit_lock+0x40/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a6f40>] ?
sync_page+0x0/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a6f1d>]
__lock_page+0x4d/0x60
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0151250>] ?
wake_bit_function+0x0/0x60
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0238c06>]
ext4_da_writepages+0x706/0x7a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a6c30>] ?
find_get_pages_tag+0x0/0x120
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0162127>] ?
lock_release_non_nested+0x187/0x2b0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f51d5>] ?
writeback_inodes_wb+0x245/0x3b0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f4925>] ?
writeback_single_inode+0x95/0x260
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f51d5>] ?
writeback_inodes_wb+0x245/0x3b0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f4925>] ?
writeback_single_inode+0x95/0x260
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0238500>] ?
ext4_da_writepages+0x0/0x7a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01adb6d>]
do_writepages+0x1d/0x30
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f4930>]
writeback_single_inode+0xa0/0x260
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f5216>]
writeback_inodes_wb+0x286/0x3b0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f543f>]
wb_writeback+0xff/0x1a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f5606>] ?
wb_do_writeback+0x86/0x1e0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f573b>]
wb_do_writeback+0x1bb/0x1e0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f55a2>] ?
wb_do_writeback+0x22/0x1e0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01f5792>]
bdi_writeback_task+0x32/0xa0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01bca5e>]
bdi_start_fn+0x5e/0xb0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01bca00>] ?
bdi_start_fn+0x0/0xb0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0151144>]
kthread+0x74/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01510d0>] ?
kthread+0x0/0x80
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0103a07>]
kernel_thread_helper+0x7/0x10
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] write_test D
c01272a0 0 5966 1 0x00000005
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] c2149c50
00200046 00200046 c01272a0 00000001 c4f09910 00200296 c4f09910
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] c2149c14
c061de60 c061de60 c061de60 c2149c34 c01272a0 ce9e2594 ce9e2500
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] ce9e2694
c1607e60 00200246 c0267d9c 00000001 c4f09814 c4f09800 c4f09814
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] Call Trace:
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01272a0>] ?
__wake_up+0x40/0x50
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01272a0>] ?
__wake_up+0x40/0x50
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0267d9c>] ?
start_this_handle+0x36c/0x580
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0267da1>]
start_this_handle+0x371/0x580
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0160bac>] ?
lockdep_init_map+0x3c/0x500
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0151210>] ?
autoremove_wake_function+0x0/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0268055>]
jbd2_journal_start+0xa5/0xd0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0248663>]
ext4_journal_start_sb+0x53/0xa0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0245db3>] ?
__ext4_journal_stop+0x43/0x70
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0238ef4>]
ext4_da_write_begin+0x254/0x3a0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0239470>] ?
ext4_da_get_block_prep+0x0/0x360
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a785e>]
generic_file_buffered_write+0xde/0x260
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a7e36>]
__generic_file_aio_write+0x276/0x510
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0400e04>] ?
mutex_lock_nested+0x1e4/0x270
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01a8128>]
generic_file_aio_write+0x58/0xc0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c022fa0f>]
ext4_file_write+0x3f/0xd0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0147ed5>] ?
ptrace_stop+0xa5/0xf0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01d85bd>]
do_sync_write+0xcd/0x110
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c014840a>] ?
ptrace_notify+0x9a/0xb0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0151210>] ?
autoremove_wake_function+0x0/0x40
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c028291f>] ?
security_file_permission+0xf/0x20
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01d877c>] ?
rw_verify_area+0x6c/0xe0
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01d8de6>]
vfs_write+0x96/0x190
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01d84f0>] ?
do_sync_write+0x0/0x110
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c01d950d>]
sys_write+0x3d/0x70
Mar 23 15:24:40 kailas-desktop kernel: [ 532.237073] [<c0102f55>]
syscall_call+0x7/0xb
-----------------------------------------------------------------------------------------------------------------------------------------
I am also attaching complete trace for your reference.

write_test is my testing executable that performs many sequential
writes to a file.
This stack trace confirms that lock_page() in write_cache_pages() is
resulting into this lockup.

I have few questions here.
I guess process named jbd2/sdb1-8 is kjournald thread. But what is
flush-8:16 process? Is it the kernel thread for periodically writing
dirty pages to disk?

Is it the case that these threads are running concurrently at certain
time and are trying to get lock on same pages resulting into deadlock?
If yes, what can be the reason? Am I doing mistake in calling
filemap_write_and_wait() in journal_commit flow?


Please reply.

Thanks & Regards,
Kailas

On 22 March 2010 22:22, Jan Kara <[email protected]> wrote:
>
> ?Hi,
>
> On Fri 19-03-10 08:53:08, Kailas Joshi wrote:
> > I am facing some problems while implementing alloc_on_commit.
> > While performing exhaustive write operations(for example using postmark),
> > system locks up after some time.
> > It runs fine for (simple)non-exhaustive write operations.
> >
> > I am using filemap_write_and_wait() in journal commit callback for
> > performing synchronous block allocation. It uses special journal handle
> > which enables use of early reservations.
> > Is it right to use this function here? If no, is there any other alternative
> > that should be used in this scenario?
> >
> > I am using following strategy -
> > 1) ext4_da_get_block_prep() marks delayed-allocation buffers with BH_DA
> > after reserving space for them.
> ?We have a BH_Delay flag for this already. OK, probably you need a
> temporary flag which you can clear in ext4_da_write_begin. I'd find
> counting number of BH_Delay buffers before and after block_write_begin
> call nicer...
>
> > 2) ext4_da_write_begin() counts the number of buffers marked with BH_DA and
> > reserves credits for block allocation.
> > 3) journal_stop() accumulates the unused credits of a handle in the
> > transaction.
> > 4) journal_start() when called with nblocks=0, creates a special handle with
> > the credits accumulated by all previous handles(by step 2).
> ?This is a hack. I'd rather create a separate JBD2 function for this.
>
> > 5) journal_commit() creates special handle for block allocation(as in step
> > 4) and calls filemap_write_and_wait() to perform block allocation.
> >
> > I am also sending the patch(for kernel 2.6.32.4) for my implementation (also
> > available at
> > http://www.cse.iitb.ac.in/~kailasjoshi/files/alloc_on_commit.patch).
> >
> > Being new to filesystem development, I am not able to identify the problem.
> > I will be very greatful if someone can help me out.
> ?Probably you are hitting some lock inversion problem. I suggest you
> compile the kernel with lockdep enabled (in Kernel hacking -> Lock debugging
> -> Prove lock correctness or something like that) and see whether it issues
> some warnings. If not, you can get backtraces of the locked up processes
> by pressing Alt-Sysrq-w (or echo "w" >/proc/sysrq-trigger).
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR


Attachments:
syslockup.log (28.41 kB)

2010-03-29 16:45:21

by Jan Kara

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

Hi,

On Tue 23-03-10 16:11:45, Kailas Joshi wrote:
> I have Lock Debugging enables but that didn't give any warnings.
> However, when I did echo "w" >/proc/sysrq-trigger after system lockup,
> I got the stack trace for locked up process.
>
> Following are the stack traces of the processes (I suspect) resulting
> in total system lockup -
<snip>

So kjournald is waiting on a page lock and everyone else waits for
kjournald to finish committing or for page lock as well. The strange thing
is that I don't see anybody who could hold the page lock everyone is
waiting on. So I think further debugging should go in this direction - find
out on which page do we wait and who is holding it's lock (you'd need to
add tracking of page lock owner but that shouldn't be too hard).

> I have few questions here.
> I guess process named jbd2/sdb1-8 is kjournald thread. But what is
Yes.

> flush-8:16 process? Is it the kernel thread for periodically writing
> dirty pages to disk?
Yes.

> Is it the case that these threads are running concurrently at certain
> time and are trying to get lock on same pages resulting into deadlock?
It should not happen - they should always acquire page lock in
index-increasing order so that way deadlocks should be avoided...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-04-17 04:42:52

by Kailas Joshi

[permalink] [raw]
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

Hi

I have implemented alloc_on_commit for EXT4.
I haven't tested it thoroughly, but I could run some test scripts and
postmark without any errors.

Though it's working, the performance it very poor.
As it was predicted by Ted, I guess it is because of the increased
time in stalling of filesystem operations as block allocation is done
while transaction is in LOCKED mode.

I am sending the patch(for kernel 2.6.32.4) for my implementation.
Please go through the patch and let me know if I am doing any mistakes
resulting in poor performance.
Also, let me know if it is possible to improve performance by some other means.

Thanks in advanced.

Regards,
Kailas Joshi

Index: linux-2.6.32.4/fs/fs-writeback.c
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/fs/fs-writeback.c,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 fs-writeback.c
*** linux-2.6.32.4/fs/fs-writeback.c 19 Jan 2010 17:27:50 -0000 1.1.1.1
--- linux-2.6.32.4/fs/fs-writeback.c 15 Apr 2010 13:14:56 -0000
*************** int write_inode_now(struct inode *inode,
*** 1259,1264 ****
--- 1259,1278 ----
}
EXPORT_SYMBOL(write_inode_now);

+ /** alloc_on_commit - kailas
+ * map_inode_now - allocate delayed inode blocks and write inode to disk
+ * @inode: inode to write to disk
+ * @sync: not used
+ *
+ * The caller must either have a ref on the inode or must have set
I_WILL_FREE.
+ */
+ int map_inode_now(struct inode *inode, int sync)
+ {
+ return filemap_fdatamap(inode->i_mapping);
+ }
+ EXPORT_SYMBOL(map_inode_now);
+
+
/**
* sync_inode - write an inode and its pages to disk.
* @inode: the inode to sync
Index: linux-2.6.32.4/fs/ext4/ext4.h
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/fs/ext4/ext4.h,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 ext4.h
*** linux-2.6.32.4/fs/ext4/ext4.h 19 Jan 2010 17:27:58 -0000 1.1.1.1
--- linux-2.6.32.4/fs/ext4/ext4.h 4 Mar 2010 00:01:53 -0000
*************** struct ext4_inode_info {
*** 743,750 ****
#define EXT4_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
#define EXT4_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
#define EXT4_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
! #define EXT4_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */
#define EXT4_MOUNT_WRITEBACK_DATA 0x00C00 /* No data ordering */
#define EXT4_MOUNT_UPDATE_JOURNAL 0x01000 /* Update the journal format */
#define EXT4_MOUNT_NO_UID32 0x02000 /* Disable 32-bit UIDs */
#define EXT4_MOUNT_XATTR_USER 0x04000 /* Extended user attributes */
--- 743,751 ----
#define EXT4_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
#define EXT4_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
#define EXT4_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
! #define EXT4_MOUNT_ORDERED_DATA 0x00000 /* Flush data before commit */
#define EXT4_MOUNT_WRITEBACK_DATA 0x00C00 /* No data ordering */
+ #define EXT4_MOUNT_ALLOC_COMMIT_DATA 0x00800 /* Alloc data on commit */
#define EXT4_MOUNT_UPDATE_JOURNAL 0x01000 /* Update the journal format */
#define EXT4_MOUNT_NO_UID32 0x02000 /* Disable 32-bit UIDs */
#define EXT4_MOUNT_XATTR_USER 0x04000 /* Extended user attributes */
*************** struct ext4_sb_info {
*** 1020,1025 ****
--- 1021,1029 ----

/* workqueue for dio unwritten */
struct workqueue_struct *dio_unwritten_wq;
+
+ /* alloc_on_commit - kailas */
+ handle_t *da_handle;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
*************** static inline int ext4_valid_inum(struct
*** 1153,1162 ****
#define EXT4_DEFM_XATTR_USER 0x0004
#define EXT4_DEFM_ACL 0x0008
#define EXT4_DEFM_UID16 0x0010
! #define EXT4_DEFM_JMODE 0x0060
#define EXT4_DEFM_JMODE_DATA 0x0020
#define EXT4_DEFM_JMODE_ORDERED 0x0040
#define EXT4_DEFM_JMODE_WBACK 0x0060

/*
* Default journal batch times
--- 1157,1167 ----
#define EXT4_DEFM_XATTR_USER 0x0004
#define EXT4_DEFM_ACL 0x0008
#define EXT4_DEFM_UID16 0x0010
! #define EXT4_DEFM_JMODE 0x00E0
#define EXT4_DEFM_JMODE_DATA 0x0020
#define EXT4_DEFM_JMODE_ORDERED 0x0040
#define EXT4_DEFM_JMODE_WBACK 0x0060
+ #define EXT4_DEFM_JMODE_ALLOC_COMMIT 0x00C0

/*
* Default journal batch times
*************** extern void ext4_truncate(struct inode *
*** 1428,1435 ****
--- 1433,1442 ----
extern int ext4_truncate_restart_trans(handle_t *, struct inode *,
int nblocks);
extern void ext4_set_inode_flags(struct inode *);
extern void ext4_get_inode_flags(struct ext4_inode_info *);
+ extern int ext4_sync_alloc_da_blocks(struct inode *inode, handle_t
*da_handle);
extern int ext4_alloc_da_blocks(struct inode *inode);
extern void ext4_set_aops(struct inode *inode);
+ extern int ext4_ordered_da_writepage_trans_blocks(struct inode *,
int nrblocks);
extern int ext4_writepage_trans_blocks(struct inode *);
extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int
idxblocks);
extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
Index: linux-2.6.32.4/fs/ext4/ext4_jbd2.h
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/fs/ext4/ext4_jbd2.h,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 ext4_jbd2.h
*** linux-2.6.32.4/fs/ext4/ext4_jbd2.h 19 Jan 2010 17:27:58 -0000 1.1.1.1
--- linux-2.6.32.4/fs/ext4/ext4_jbd2.h 25 Feb 2010 07:51:37 -0000
*************** static inline int ext4_should_order_data
*** 295,301 ****
return 0;
if (EXT4_I(inode)->i_flags & EXT4_JOURNAL_DATA_FL)
return 0;
! if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA)
return 1;
return 0;
}
--- 295,302 ----
return 0;
if (EXT4_I(inode)->i_flags & EXT4_JOURNAL_DATA_FL)
return 0;
! if ((test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA) ||
! (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_ALLOC_COMMIT_DATA))
return 1;
return 0;
}
Index: linux-2.6.32.4/fs/ext4/inode.c
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/fs/ext4/inode.c,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 inode.c
*** linux-2.6.32.4/fs/ext4/inode.c 19 Jan 2010 17:27:58 -0000 1.1.1.1
--- linux-2.6.32.4/fs/ext4/inode.c 15 Apr 2010 08:50:16 -0000
*************** static int walk_page_buffers(handle_t *h
*** 1498,1503 ****
--- 1498,1530 ----
return ret;
}

+ static int count_page_buffers(struct buffer_head *head,
+ unsigned from,
+ unsigned to,
+ int *partial,
+ int (*fn)(struct buffer_head *bh))
+ {
+ struct buffer_head *bh;
+ unsigned block_start, block_end;
+ unsigned blocksize = head->b_size;
+ int ret = 0;
+ struct buffer_head *next;
+
+ for (bh = head, block_start = 0;
+ bh != head || !block_start;
+ block_start = block_end, bh = next) {
+ next = bh->b_this_page;
+ block_end = block_start + blocksize;
+ if (block_end <= from || block_start >= to) {
+ if (partial && !buffer_uptodate(bh))
+ *partial = 1;
+ continue;
+ }
+ ret += ((*fn)(bh)? 1 : 0);
+ }
+ return ret;
+ }
+
/*
* To preserve ordering, it is essential that the hole instantiation and
* the data write be encapsulated in a single transaction. We cannot
*************** static int mpage_da_submit_io(struct mpa
*** 1970,1976 ****
long pages_skipped;
struct pagevec pvec;
unsigned long index, end;
! int ret = 0, err, nr_pages, i;
struct inode *inode = mpd->inode;
struct address_space *mapping = inode->i_mapping;

--- 1997,2003 ----
long pages_skipped;
struct pagevec pvec;
unsigned long index, end;
! int ret = 0, err = 0, nr_pages, i;
struct inode *inode = mpd->inode;
struct address_space *mapping = inode->i_mapping;

*************** static int mpage_da_submit_io(struct mpa
*** 2000,2006 ****
--- 2027,2042 ----
BUG_ON(!PageLocked(page));
BUG_ON(PageWriteback(page));

+ /* alloc_on_commit - kailas */
+ if(mpd->wbc->map_only) {
+ mpd->pages_written++;
+ __set_page_mapped_nobuffers(page);
+ unlock_page(page);
+ continue;
+ }
+
pages_skipped = mpd->wbc->pages_skipped;
+
err = mapping->a_ops->writepage(page, mpd->wbc);
if (!err && (pages_skipped == mpd->wbc->pages_skipped))
/*
*************** static int ext4_da_get_block_prep(struct
*** 2538,2543 ****
--- 2574,2581 ----
map_bh(bh_result, inode->i_sb, invalid_block);
set_buffer_new(bh_result);
set_buffer_delay(bh_result);
+ if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_ALLOC_COMMIT_DATA)
+ set_buffer_da(bh_result);
} else if (ret > 0) {
bh_result->b_size = (ret << inode->i_blkbits);
if (buffer_unwritten(bh_result)) {
*************** static int ext4_da_writepages_trans_bloc
*** 2801,2806 ****
--- 2839,2906 ----
return ext4_chunk_trans_blocks(inode, max_blocks);
}

+ /* alloc_on_commit - kailas */
+ static int ext4_clear_page_mapped(struct address_space *mapping,
+ struct writeback_control *wbc)
+ {
+ int ret = 0;
+ struct pagevec pvec;
+ int nr_pages;
+ pgoff_t index;
+ pgoff_t end;
+ int i;
+
+ index = wbc->range_start >> PAGE_CACHE_SHIFT;
+ end = wbc->range_end >> PAGE_CACHE_SHIFT;
+ pagevec_init(&pvec, 0);
+
+ nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+ PAGECACHE_TAG_MAPPED,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
+ if (nr_pages == 0)
+ return ret;
+
+ for (i = 0; i < nr_pages; i++) {
+ struct page *page = pvec.pages[i];
+
+ /*
+ * At this point, the page may be truncated or
+ * invalidated (changing page->mapping to NULL), or
+ * even swizzled back from swapper_space to tmpfs file
+ * mapping. However, page->index will not change
+ * because we have a reference on the page.
+ */
+ if (page->index > end)
+ break;
+
+ lock_page(page);
+
+ /*
+ * Page truncated or invalidated. We can freely skip it
+ * then, even for data integrity operations: the page
+ * has disappeared concurrently, so there could be no
+ * real expectation of this data interity operation
+ * even if there is now a new, dirty page at the same
+ * pagecache address.
+ */
+ if (unlikely(page->mapping != mapping)) {
+ unlock_page(page);
+ continue;
+ }
+
+ __set_page_dirty_nobuffers(page);
+
+ unlock_page(page);
+ ret = 0;
+
+ pagevec_release(&pvec);
+ cond_resched();
+ }
+
+ return ret;
+ }
+
+
static int ext4_da_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
*************** retry:
*** 3003,3008 ****
--- 3104,3111 ----
mapping->writeback_index = index;

out_writepages:
+ if(wbc->map_only) /* alloc_on_commit - kailas */
+ ext4_clear_page_mapped(mapping, wbc);
if (!no_nrwrite_index_update)
wbc->no_nrwrite_index_update = 0;
if (wbc->nr_to_write > nr_to_writebump)
*************** static int ext4_nonda_switch(struct supe
*** 3039,3044 ****
--- 3142,3157 ----
return 0;
}

+ static int buffer_da_count(struct buffer_head *head)
+ {
+ if(buffer_da(head)) {
+ clear_buffer_da(head);
+ return 1;
+ }
+
+ return 0;
+ }
+
static int ext4_da_write_begin(struct file *file, struct
address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
*************** static int ext4_da_write_begin(struct fi
*** 3062,3067 ****
--- 3175,3182 ----
*fsdata = (void *)0;
trace_ext4_da_write_begin(inode, pos, len, flags);
retry:
+
+ /* alloc_on_commit - kailas */
/*
* With delayed allocation, we don't log the i_disksize update
* if there is delayed block allocation. But we still need
*************** retry:
*** 3102,3107 ****
--- 3217,3258 ----

if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;
+
+ /* alloc_on_commit - kailas */
+ /*
+ * With delayed allocation, we don't log the i_disksize update
+ * if there is delayed block allocation. But we still need
+ * to journalling the i_disksize update if writes to the end
+ * of file which has an already mapped buffer.
+ */
+ /* Count number of page buffers with BH_DA */
+ if (test_opt(inode->i_sb, DATA_FLAGS) ==
+ EXT4_MOUNT_ALLOC_COMMIT_DATA) {
+ int needed_blocks;
+ int credits;
+ int err;
+
+ needed_blocks = count_page_buffers(page_buffers(page),
+ from, to, NULL, buffer_da_count);
+ credits = ext4_ordered_da_writepage_trans_blocks(inode, needed_blocks);
+
+ if (!ext4_handle_has_enough_credits(handle, credits)) {
+ err = ext4_journal_extend(handle, credits - 1);
+ if (err > 0) {
+ unlock_page(page);
+ err = ext4_journal_restart(handle, credits);
+ lock_page(page);
+ }
+ if (err != 0) {
+ ext4_warning(inode->i_sb, __func__,
+ "couldn't extend journal
(err %d)", err);
+ ext4_journal_stop(handle);
+ ret = err;
+ goto out;
+ }
+ }
+ }
+
out:
return ret;
}
*************** static int ext4_da_write_end(struct file
*** 3153,3158 ****
--- 3304,3319 ----
}
}

+ if (test_opt(inode->i_sb, DATA_FLAGS) ==
+ EXT4_MOUNT_ALLOC_COMMIT_DATA) {
+ ret = ext4_jbd2_file_inode(handle, inode);
+ if (ret)
+ goto errout;
+ ret = ext4_mark_inode_dirty(handle, inode);
+ if (ret)
+ goto errout;
+ }
+
trace_ext4_da_write_end(inode, pos, len, copied);
start = pos & (PAGE_CACHE_SIZE - 1);
end = start + copied - 1;
*************** static int ext4_da_write_end(struct file
*** 3191,3196 ****
--- 3352,3358 ----
copied = ret2;
if (ret2 < 0)
ret = ret2;
+ errout:
ret2 = ext4_journal_stop(handle);
if (!ret)
ret = ret2;
*************** int ext4_write_inode(struct inode *inode
*** 5188,5196 ****

if (EXT4_SB(inode->i_sb)->s_journal) {
if (ext4_journal_current_handle()) {
! jbd_debug(1, "called recursively, non-PF_MEMALLOC!\n");
! dump_stack();
! return -EIO;
}

if (!wait)
--- 5351,5360 ----

if (EXT4_SB(inode->i_sb)->s_journal) {
if (ext4_journal_current_handle()) {
! /* jbd_debug(1, "called recursively, non-PF_MEMALLOC!\n"); */
! /* dump_stack(); */
! /* return -EIO; */
! return 0;
}

if (!wait)
*************** int ext4_meta_trans_blocks(struct inode
*** 5457,5462 ****
--- 5621,5642 ----

/*
* Calulate the total number of credits to reserve to fit
+ * the modification of a nrblocks into a single transaction,
+ * which may include multiple chunks of block allocations.
+ *
+ * This could be called via ext4_write_begin() for alloc_on_commit mode
+ *
+ * We need to consider the worse case, when
+ * one new block per extent.
+ */
+ int ext4_ordered_da_writepage_trans_blocks(struct inode *inode, int nrblocks)
+ {
+ return ext4_meta_trans_blocks(inode, nrblocks, 0);
+ }
+
+
+ /*
+ * Calulate the total number of credits to reserve to fit
* the modification of a single pages into a single transaction,
* which may include multiple chunks of block allocations.
*
*************** out_unlock:
*** 5823,5825 ****
--- 6004,6021 ----
up_read(&inode->i_alloc_sem);
return ret;
}
+
+ /* alloc_on_commit - Kailas */
+ int ext4_sync_alloc_da_blocks(struct inode *inode, handle_t *da_handle)
+ {
+ int ret = 0;
+
+ igrab(inode);
+
+ if(!(inode->i_state & I_SYNC))
+ ret = map_inode_now(inode, 1);
+
+ iput(inode);
+
+ return ret;
+ }
Index: linux-2.6.32.4/fs/ext4/super.c
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/fs/ext4/super.c,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 super.c
*** linux-2.6.32.4/fs/ext4/super.c 19 Jan 2010 17:27:58 -0000 1.1.1.1
--- linux-2.6.32.4/fs/ext4/super.c 25 Mar 2010 11:27:14 -0000
*************** static int ext4_statfs(struct dentry *de
*** 68,73 ****
--- 68,74 ----
static int ext4_unfreeze(struct super_block *sb);
static void ext4_write_super(struct super_block *sb);
static int ext4_freeze(struct super_block *sb);
+ static void alloc_on_commit_callback(journal_t *journal, handle_t *da_handle);


ext4_fsblk_t ext4_block_bitmap(struct super_block *sb,
*************** static void ext4_put_nojournal(handle_t
*** 223,228 ****
--- 224,230 ----
handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks)
{
journal_t *journal;
+ handle_t *handle;

if (sb->s_flags & MS_RDONLY)
return ERR_PTR(-EROFS);
*************** handle_t *ext4_journal_start_sb(struct s
*** 236,242 ****
ext4_abort(sb, __func__, "Detected aborted journal");
return ERR_PTR(-EROFS);
}
! return jbd2_journal_start(journal, nblocks);
}
return ext4_get_nojournal();
}
--- 238,251 ----
ext4_abort(sb, __func__, "Detected aborted journal");
return ERR_PTR(-EROFS);
}
!
! handle = jbd2_journal_start(journal, nblocks);
!
! /* alloc_on_commit - kailas */
! if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_ALLOC_COMMIT_DATA)
! handle->h_retain_credits = 1;
!
! return handle;
}
return ext4_get_nojournal();
}
*************** static int ext4_show_options(struct seq_
*** 895,900 ****
--- 904,911 ----
seq_puts(seq, ",data=ordered");
else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_WRITEBACK_DATA)
seq_puts(seq, ",data=writeback");
+ else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_ALLOC_COMMIT_DATA)
+ seq_puts(seq, ",data=alloc_on_commit");

if (sbi->s_inode_readahead_blks != EXT4_DEF_INODE_READAHEAD_BLKS)
seq_printf(seq, ",inode_readahead_blks=%u",
*************** enum {
*** 1087,1093 ****
Opt_journal_update, Opt_journal_dev,
Opt_journal_checksum, Opt_journal_async_commit,
Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
! Opt_data_err_abort, Opt_data_err_ignore,
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
Opt_ignore, Opt_barrier, Opt_nobarrier, Opt_err, Opt_resize,
--- 1098,1104 ----
Opt_journal_update, Opt_journal_dev,
Opt_journal_checksum, Opt_journal_async_commit,
Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
! Opt_data_alloc_on_commit, Opt_data_err_abort, Opt_data_err_ignore,
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
Opt_ignore, Opt_barrier, Opt_nobarrier, Opt_err, Opt_resize,
*************** static const match_table_t tokens = {
*** 1134,1139 ****
--- 1145,1151 ----
{Opt_data_journal, "data=journal"},
{Opt_data_ordered, "data=ordered"},
{Opt_data_writeback, "data=writeback"},
+ {Opt_data_alloc_on_commit, "data=alloc_on_commit"},
{Opt_data_err_abort, "data_err=abort"},
{Opt_data_err_ignore, "data_err=ignore"},
{Opt_offusrjquota, "usrjquota="},
*************** static int parse_options(char *options,
*** 1359,1364 ****
--- 1371,1379 ----
case Opt_data_ordered:
data_opt = EXT4_MOUNT_ORDERED_DATA;
goto datacheck;
+ case Opt_data_alloc_on_commit:
+ data_opt = EXT4_MOUNT_ALLOC_COMMIT_DATA;
+ goto datacheck;
case Opt_data_writeback:
data_opt = EXT4_MOUNT_WRITEBACK_DATA;
datacheck:
*************** static void ext4_orphan_cleanup(struct s
*** 1958,1963 ****
--- 1973,2016 ----
sb->s_flags = s_flags; /* Restore MS_RDONLY status */
}

+
+ /*
+ * This callback is called before each commit when we are using
+ * alloc-on-commit mode.
+ */
+ static void alloc_on_commit_callback(journal_t *journal, handle_t *da_handle)
+ {
+ struct jbd2_inode *jinode, *next_i;
+ transaction_t *transaction = journal->j_running_transaction;
+ struct ext4_sb_info *sbi;
+
+ spin_lock(&journal->j_list_lock);
+ list_for_each_entry_safe(jinode, next_i,
+ &transaction->t_inode_list, i_list) {
+ spin_unlock(&journal->j_list_lock);
+
+ /* sbi = EXT4_SB(jinode->i_vfs_inode->i_sb); */
+ /* sbi->da_handle = da_handle; */
+
+ printk(KERN_ALERT "Writing handle:%x inode:%d\n",
+ da_handle, jinode->i_vfs_inode->i_ino);
+
+ /* ext4_alloc_da_blocks(jinode->i_vfs_inode); */
+ ext4_sync_alloc_da_blocks(jinode->i_vfs_inode, da_handle);
+
+
+ printk(KERN_ALERT "Written handle:%x inode:%d\n",
+ da_handle, jinode->i_vfs_inode->i_ino);
+
+ /* sbi->da_handle = NULL; */
+
+ spin_lock(&journal->j_list_lock);
+ }
+ spin_unlock(&journal->j_list_lock);
+ }
+
+
+
/*
* Maximal extent format file size.
* Resulting logical blkno at s_maxbytes must fit in our on-disk
*************** static int ext4_fill_super(struct super_
*** 2434,2439 ****
--- 2487,2495 ----
sbi->s_mount_opt |= EXT4_MOUNT_ORDERED_DATA;
else if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_WBACK)
sbi->s_mount_opt |= EXT4_MOUNT_WRITEBACK_DATA;
+ else if ((def_mount_opts & EXT4_DEFM_JMODE) ==
+ EXT4_DEFM_JMODE_ALLOC_COMMIT)
+ sbi->s_mount_opt |= EXT4_MOUNT_ALLOC_COMMIT_DATA;

if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_PANIC)
set_opt(sbi->s_mount_opt, ERRORS_PANIC);
*************** static int ext4_fill_super(struct super_
*** 2804,2821 ****
/* We have now updated the journal if required, so we can
* validate the data journaling mode. */
switch (test_opt(sb, DATA_FLAGS)) {
! case 0:
! /* No mode set, assume a default based on the journal
! * capabilities: ORDERED_DATA if the journal can
! * cope, else JOURNAL_DATA
! */
! if (jbd2_journal_check_available_features
! (sbi->s_journal, 0, 0, JBD2_FEATURE_INCOMPAT_REVOKE))
! set_opt(sbi->s_mount_opt, ORDERED_DATA);
! else
! set_opt(sbi->s_mount_opt, JOURNAL_DATA);
! break;
!
case EXT4_MOUNT_ORDERED_DATA:
case EXT4_MOUNT_WRITEBACK_DATA:
if (!jbd2_journal_check_available_features
--- 2860,2868 ----
/* We have now updated the journal if required, so we can
* validate the data journaling mode. */
switch (test_opt(sb, DATA_FLAGS)) {
! case EXT4_MOUNT_ALLOC_COMMIT_DATA:
! sbi->s_journal->j_pre_commit_callback =
! alloc_on_commit_callback;
case EXT4_MOUNT_ORDERED_DATA:
case EXT4_MOUNT_WRITEBACK_DATA:
if (!jbd2_journal_check_available_features
*************** no_journal:
*** 2939,2944 ****
--- 2986,2994 ----
descr = " journalled data mode";
else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA)
descr = " ordered data mode";
+ else if (test_opt(sb, DATA_FLAGS) ==
+ EXT4_MOUNT_ALLOC_COMMIT_DATA)
+ descr = " alloc on commit data mode";
else
descr = " writeback data mode";
} else
Index: linux-2.6.32.4/fs/jbd/journal.c
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/fs/jbd/journal.c,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 journal.c
*** linux-2.6.32.4/fs/jbd/journal.c 19 Jan 2010 17:27:59 -0000 1.1.1.1
--- linux-2.6.32.4/fs/jbd/journal.c 19 Feb 2010 10:07:43 -0000
*************** static void __init jbd_create_debugfs_en
*** 1913,1919 ****
{
jbd_debugfs_dir = debugfs_create_dir("jbd", NULL);
if (jbd_debugfs_dir)
! jbd_debug = debugfs_create_u8("jbd-debug", S_IRUGO,
jbd_debugfs_dir,
&journal_enable_debug);
}
--- 1913,1919 ----
{
jbd_debugfs_dir = debugfs_create_dir("jbd", NULL);
if (jbd_debugfs_dir)
! jbd_debug = debugfs_create_u8("jbd-debug", S_IRUGO | S_IWUSR,
jbd_debugfs_dir,
&journal_enable_debug);
}
Index: linux-2.6.32.4/fs/jbd2/commit.c
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/fs/jbd2/commit.c,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 commit.c
*** linux-2.6.32.4/fs/jbd2/commit.c 19 Jan 2010 17:27:55 -0000 1.1.1.1
--- linux-2.6.32.4/fs/jbd2/commit.c 27 Mar 2010 06:25:47 -0000
*************** void jbd2_journal_commit_transaction(jou
*** 369,374 ****
--- 369,375 ----
struct buffer_head *cbh = NULL; /* For transactional checksums */
__u32 crc32_sum = ~0;
int write_op = WRITE;
+ handle_t *da_handle = NULL;

/*
* First job: lock down the current transaction and wait for
*************** void jbd2_journal_commit_transaction(jou
*** 399,404 ****
--- 400,417 ----
jbd_debug(1, "JBD: starting commit of transaction %d\n",
commit_transaction->t_tid);

+ printk(KERN_ALERT "alloc_on_commit: Commiting\n"
+ , commit_transaction->t_updates);
+
+ /* alloc_on_commit - kailas */
+ if (journal->j_pre_commit_callback) {
+
+ printk(KERN_ALERT "alloc_on_commit: Starting Transaction\n"
+ , commit_transaction->t_updates);
+
+ da_handle = jbd2_journal_start(journal, 0);
+ }
+
spin_lock(&journal->j_state_lock);
commit_transaction->t_state = T_LOCKED;

*************** void jbd2_journal_commit_transaction(jou
*** 416,426 ****
stats.run.rs_locked);

spin_lock(&commit_transaction->t_handle_lock);
! while (commit_transaction->t_updates) {
DEFINE_WAIT(wait);

prepare_to_wait(&journal->j_wait_updates, &wait,
TASK_UNINTERRUPTIBLE);
if (commit_transaction->t_updates) {
spin_unlock(&commit_transaction->t_handle_lock);
spin_unlock(&journal->j_state_lock);
--- 429,469 ----
stats.run.rs_locked);

spin_lock(&commit_transaction->t_handle_lock);
! /* alloc_on_commit - kailas */
! /* while (commit_transaction->t_updates != 1) { */
! while (1) {
! /* printk(KERN_ALERT "alloc_on_commit: Wait Loop\n" */
! /* , commit_transaction->t_updates); */
!
! if (da_handle) {
! if (commit_transaction->t_updates <= 1)
! break;
! }
! else
! if(!commit_transaction->t_updates)
! break;
!
! {
DEFINE_WAIT(wait);

prepare_to_wait(&journal->j_wait_updates, &wait,
TASK_UNINTERRUPTIBLE);
+ /* alloc_on_commit - kailas */
+ /* if (commit_transaction->t_updates != 1) { */
+ /* if (commit_transaction->t_updates) { */
+
+ if (da_handle) {
+ if (commit_transaction->t_updates > 1) {
+ spin_unlock(&commit_transaction->t_handle_lock);
+ spin_unlock(&journal->j_state_lock);
+ /* printk(KERN_ALERT "alloc_on_commit: %d\n" */
+ /* , commit_transaction->t_updates); */
+ schedule();
+ spin_lock(&journal->j_state_lock);
+ spin_lock(&commit_transaction->t_handle_lock);
+ }
+ }
+ else
if (commit_transaction->t_updates) {
spin_unlock(&commit_transaction->t_handle_lock);
spin_unlock(&journal->j_state_lock);
*************** void jbd2_journal_commit_transaction(jou
*** 428,437 ****
--- 471,502 ----
spin_lock(&journal->j_state_lock);
spin_lock(&commit_transaction->t_handle_lock);
}
+
finish_wait(&journal->j_wait_updates, &wait);
}
+ }
+
spin_unlock(&commit_transaction->t_handle_lock);

+ /* alloc_on_commit - kailas */
+ if (da_handle) {
+ J_ASSERT (da_handle->h_buffer_credits == 0);
+ da_handle->h_buffer_credits = commit_transaction->t_retained_credits;
+
+ spin_unlock(&journal->j_state_lock);
+
+ printk(KERN_ALERT "alloc_on_commit: Starting Callback\n"
+ , commit_transaction->t_updates);
+
+ journal->j_pre_commit_callback(journal, da_handle);
+
+ printk(KERN_ALERT "alloc_on_commit: Callback Finished\n"
+ , commit_transaction->t_updates);
+
+ jbd2_journal_stop(da_handle);
+ spin_lock(&journal->j_state_lock);
+ }
+
J_ASSERT (commit_transaction->t_outstanding_credits <=
journal->j_max_transaction_buffers);

*************** restart_loop:
*** 1057,1065 ****
}
spin_unlock(&journal->j_list_lock);

- if (journal->j_commit_callback)
- journal->j_commit_callback(journal, commit_transaction);
-
trace_jbd2_end_commit(journal, commit_transaction);
jbd_debug(1, "JBD: commit %d complete, head %d\n",
journal->j_commit_sequence, journal->j_tail_sequence);
--- 1122,1127 ----
Index: linux-2.6.32.4/fs/jbd2/journal.c
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/fs/jbd2/journal.c,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 journal.c
*** linux-2.6.32.4/fs/jbd2/journal.c 19 Jan 2010 17:27:55 -0000 1.1.1.1
--- linux-2.6.32.4/fs/jbd2/journal.c 19 Feb 2010 10:09:26 -0000
*************** static void __init jbd2_create_debugfs_e
*** 2115,2121 ****
{
jbd2_debugfs_dir = debugfs_create_dir("jbd2", NULL);
if (jbd2_debugfs_dir)
! jbd2_debug = debugfs_create_u8(JBD2_DEBUG_NAME, S_IRUGO,
jbd2_debugfs_dir,
&jbd2_journal_enable_debug);
}
--- 2115,2121 ----
{
jbd2_debugfs_dir = debugfs_create_dir("jbd2", NULL);
if (jbd2_debugfs_dir)
! jbd2_debug = debugfs_create_u8(JBD2_DEBUG_NAME, S_IRUGO | S_IWUSR,
jbd2_debugfs_dir,
&jbd2_journal_enable_debug);
}
Index: linux-2.6.32.4/fs/jbd2/transaction.c
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/fs/jbd2/transaction.c,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 transaction.c
*** linux-2.6.32.4/fs/jbd2/transaction.c 19 Jan 2010 17:27:55 -0000 1.1.1.1
--- linux-2.6.32.4/fs/jbd2/transaction.c 27 Mar 2010 07:20:27 -0000
*************** int jbd2_journal_stop(handle_t *handle)
*** 1313,1325 ****
--- 1314,1345 ----
current->journal_info = NULL;
spin_lock(&journal->j_state_lock);
spin_lock(&transaction->t_handle_lock);
+
+ /* alloc_on_commit - kailas */
+ if (handle->h_retain_credits) {
+ transaction->t_retained_credits += handle->h_buffer_credits;
+ }
+ else {
transaction->t_outstanding_credits -= handle->h_buffer_credits;
+ }
+
transaction->t_updates--;
+
+ /* alloc_on_commit - kailas */
+ if(!handle->h_retain_credits) {
if (!transaction->t_updates) {
wake_up(&journal->j_wait_updates);
if (journal->j_barrier_count)
wake_up(&journal->j_wait_transaction_locked);
}
+ }
+ else {
+ if (transaction->t_updates == 1) {
+ wake_up(&journal->j_wait_updates);
+ if (journal->j_barrier_count)
+ wake_up(&journal->j_wait_transaction_locked);
+ }
+ }

/*
* If the handle is marked SYNC, we need to set another commit
Index: linux-2.6.32.4/include/linux/buffer_head.h
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/include/linux/buffer_head.h,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 buffer_head.h
*** linux-2.6.32.4/include/linux/buffer_head.h 19 Jan 2010 17:27:35
-0000 1.1.1.1
--- linux-2.6.32.4/include/linux/buffer_head.h 19 Feb 2010 12:14:17 -0000
*************** enum bh_state_bits {
*** 40,45 ****
--- 40,46 ----
BH_PrivateStart,/* not a state bit, but the first bit available
* for private allocation by other entities
*/
+ BH_DA, /* Needs credit reservation for delayed block allocation*/
};

#define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)
*************** BUFFER_FNS(Write_EIO, write_io_error)
*** 128,133 ****
--- 129,135 ----
BUFFER_FNS(Ordered, ordered)
BUFFER_FNS(Eopnotsupp, eopnotsupp)
BUFFER_FNS(Unwritten, unwritten)
+ BUFFER_FNS(DA, da)

#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
#define touch_buffer(bh) mark_page_accessed(bh->b_page)
Index: linux-2.6.32.4/include/linux/fs.h
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/include/linux/fs.h,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 fs.h
*** linux-2.6.32.4/include/linux/fs.h 19 Jan 2010 17:27:37 -0000 1.1.1.1
--- linux-2.6.32.4/include/linux/fs.h 15 Apr 2010 08:11:00 -0000
*************** struct block_device {
*** 679,684 ****
--- 679,685 ----
*/
#define PAGECACHE_TAG_DIRTY 0
#define PAGECACHE_TAG_WRITEBACK 1
+ #define PAGECACHE_TAG_MAPPED 2 /* alloc_on_commit - kailas */

int mapping_tagged(struct address_space *mapping, int tag);

*************** extern int invalidate_inode_pages2(struc
*** 2082,2088 ****
--- 2083,2092 ----
extern int invalidate_inode_pages2_range(struct address_space *mapping,
pgoff_t start, pgoff_t end);
extern int write_inode_now(struct inode *, int);
+ extern int map_inode_now(struct inode *, int); /* alloc_on_commit - kailas */
extern int filemap_fdatawrite(struct address_space *);
+ extern int filemap_fdatamap(struct address_space *); /*
alloc_on_commit - kailas */
+ extern int sync_filemap_flush(struct address_space *mapping);
extern int filemap_flush(struct address_space *);
extern int filemap_fdatawait(struct address_space *);
extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
Index: linux-2.6.32.4/include/linux/jbd2.h
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/include/linux/jbd2.h,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 jbd2.h
*** linux-2.6.32.4/include/linux/jbd2.h 19 Jan 2010 17:27:37 -0000 1.1.1.1
--- linux-2.6.32.4/include/linux/jbd2.h 27 Feb 2010 18:30:13 -0000
*************** struct handle_s
*** 453,458 ****
--- 453,463 ----
unsigned int h_jdata: 1; /* force data journaling */
unsigned int h_aborted: 1; /* fatal error on handle */

+ /* alloc_on_commit - kailas */
+ unsigned int h_retain_credits:1; /* Handle will retain credits
+ * till transaction commit.
+ */
+
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map h_lockdep_map;
#endif
*************** struct transaction_s
*** 627,632 ****
--- 632,644 ----
int t_outstanding_credits;

/*
+ * Number of buffers retained by summing unused credits of all handles in
+ * this transaction.
+ * These credits will be used by magic handle in this transaction.
[t_handle_lock]
+ */
+ int t_retained_credits;
+
+ /*
* Forward and backward links for the circular list of all transactions
* awaiting checkpoint. [j_list_lock]
*/
*************** struct journal_s
*** 974,979 ****
--- 986,993 ----
u32 j_min_batch_time;
u32 j_max_batch_time;

+ /* This function is called before a transaction is closed */
+ void (*j_pre_commit_callback)(journal_t *, handle_t *handle);
/* This function is called when a transaction is closed */
void (*j_commit_callback)(journal_t *,
transaction_t *);
Index: linux-2.6.32.4/include/linux/mm.h
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/include/linux/mm.h,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 mm.h
*** linux-2.6.32.4/include/linux/mm.h 19 Jan 2010 17:27:38 -0000 1.1.1.1
--- linux-2.6.32.4/include/linux/mm.h 15 Apr 2010 09:31:13 -0000
*************** extern int try_to_release_page(struct pa
*** 829,834 ****
--- 829,835 ----
extern void do_invalidatepage(struct page *page, unsigned long offset);

int __set_page_dirty_nobuffers(struct page *page);
+ int __set_page_mapped_nobuffers(struct page *page); /*
alloc_on_commit - kailas */
int __set_page_dirty_no_writeback(struct page *page);
int redirty_page_for_writepage(struct writeback_control *wbc,
struct page *page);
Index: linux-2.6.32.4/include/linux/writeback.h
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/include/linux/writeback.h,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 writeback.h
*** linux-2.6.32.4/include/linux/writeback.h 19 Jan 2010 17:27:34 -0000 1.1.1.1
--- linux-2.6.32.4/include/linux/writeback.h 15 Apr 2010 12:48:47 -0000
*************** struct writeback_control {
*** 61,66 ****
--- 61,67 ----
* so we use a single control to update them
*/
unsigned no_nrwrite_index_update:1;
+ unsigned map_only:1; /* Map inode blocks only.
alloc_on_commit - kailas */
};

/*
Index: linux-2.6.32.4/mm/filemap.c
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/mm/filemap.c,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 filemap.c
*** linux-2.6.32.4/mm/filemap.c 19 Jan 2010 17:27:49 -0000 1.1.1.1
--- linux-2.6.32.4/mm/filemap.c 15 Apr 2010 08:09:00 -0000
*************** int filemap_fdatawrite(struct address_sp
*** 239,244 ****
--- 239,267 ----
}
EXPORT_SYMBOL(filemap_fdatawrite);

+ /** alloc_on_commit - kailas
+ * filemap_fdatamap - start block mapping writeback on mapping
+ * @mapping: target address_space
+ */
+ int filemap_fdatamap(struct address_space *mapping)
+ {
+ int ret;
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_ALL,
+ .nr_to_write = LONG_MAX,
+ .range_start = 0,
+ .range_end = LLONG_MAX,
+ .map_only = 1,
+ };
+
+ if (!mapping_cap_writeback_dirty(mapping))
+ return 0;
+
+ ret = do_writepages(mapping, &wbc);
+ return ret;
+ }
+ EXPORT_SYMBOL(filemap_fdatamap);
+
int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
loff_t end)
{
Index: linux-2.6.32.4/mm/page-writeback.c
===================================================================
RCS file: /repo/kernel-source/linux-2.6.32.4/mm/page-writeback.c,v
retrieving revision 1.1.1.1
diff -p -w -B -r1.1.1.1 page-writeback.c
*** linux-2.6.32.4/mm/page-writeback.c 19 Jan 2010 17:27:49 -0000 1.1.1.1
--- linux-2.6.32.4/mm/page-writeback.c 15 Apr 2010 09:28:48 -0000
*************** int __set_page_dirty_nobuffers(struct pa
*** 1141,1146 ****
--- 1141,1156 ----
}
EXPORT_SYMBOL(__set_page_dirty_nobuffers);

+ /* alloc_on_commit - kailas */
+ int __set_page_mapped_nobuffers(struct page *page)
+ {
+ struct address_space *mapping = page_mapping(page);
+ radix_tree_tag_set(&mapping->page_tree,
+ page_index(page), PAGECACHE_TAG_MAPPED);
+ return 0;
+ }
+ EXPORT_SYMBOL(__set_page_mapped_nobuffers);
+
/*
* When a writepage implementation decides that it doesn't want to write this
* page for some reason, it should redirty the locked page via