2014-09-11 10:09:51

by Yuanhan Liu

[permalink] [raw]
Subject: [PATCH] jbd2: remove __jbd2_journal_clean_checkpoint_list to solve scalability issue

Firstly, if I read the code correctly(I have a feeling I missed something),
__jbd2_journal_clean_checkpoint_list is not necessary. jbd2_log_do_checkpoint
will remove them for the checkpointed transaction. In another word,
__jbd2_journal_clean_checkpoint_list can't drop an transaction if it's not
checkpointed.

Secondly, I noticed severe scalability limit caused by this function with fsmark:
$ fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096

It mainly writes tons of small files, each with 4K, and calls fsync after
each write. It's one thread only, and it's tested on a 32G memory disk(brd).

"perf top" shows that journal_clean_one_cp_list() is very hot, like 40%.
However, that function is quite cheap, actually. But __jbd2_journal_clean_checkpoint_list()
will repeatedly call it for each transaction on the j_checkpoint_transactions.
The list becomes longer and longer in linear with time. It soon grows to
a list with 3,000 and more transactions. What's worse, we will iterate all
those transactions on this list every time before committing a transaction,
aka, fsync a file in our case.

So, that's roughly the reason why journal_clean_one_cp_list() is hot. However,
I'm wondering why the j_checkpoint_transactions list keeps growing. Is there
a bug? I did more investigates, and I guess I found the root cause.

In this case, we invoke fsync after each write, so, it basically means one
transaction for one file. One transaction normally contains few blocks of meta
data, like bitmap, inode and so on. All files in same group shares one bitmap
block. But one inode table block could contains 16 files. Hence, it's possible
that 2 different file points to same meta blocks.

For example, if file A and B uses same meta blocks, and if we write A and B in
following style:
write(A);
fsync(A);

write(B);
fsync(B);

then, when we are about to commit transation for B, and assume transaction of A
is not checkpointted yet, we can safely drop transaction A and replace it with
transaction B. Hence, the j_checkpoint_transactions grows by 1 only.

And then assume A is the last inode in one inode block, hence, B will use another
inode table block. Thus transaction A and B is different. Hence, both A and B are
inserted to the j_checkpoint_transactions; the list grows by 2.

Here I also got the proves; I added a trace point in jbd2_log_do_checkpoint(),
and here are some of them:

fs_mark-2646 [000] .... 5.830990: jbd2_start_checkpoint: tid=20
fs_mark-2646 [000] .... 5.832391: jbd2_start_checkpoint: tid=36
fs_mark-2646 [000] .... 5.833794: jbd2_start_checkpoint: tid=52
fs_mark-2646 [000] .... 5.835153: jbd2_start_checkpoint: tid=68
fs_mark-2646 [000] .... 5.836517: jbd2_start_checkpoint: tid=84
fs_mark-2646 [000] .... 5.837982: jbd2_start_checkpoint: tid=100
fs_mark-2646 [000] .... 5.839464: jbd2_start_checkpoint: tid=116
fs_mark-2646 [000] .... 5.840820: jbd2_start_checkpoint: tid=132

As you can see, the tid jumps by 16 each time, and other transactions are just
replaced by the way I described above.

Step by step like above, that list grows. And that's why journal_clean_one_cp_list() is hot.

It removes the scalability issue when I removed this function, and the fsmark
result(Files/sec) jumps from 9000 to 16000, which is an increase about 80%.

Thoughts?

CC: Andi Kleen <[email protected]>
Signed-off-by: Yuanhan Liu <[email protected]>
---
fs/jbd2/checkpoint.c | 115 ---------------------------------------------------
fs/jbd2/commit.c | 9 ----
include/linux/jbd2.h | 1 -
3 files changed, 125 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 9ffb19c..31fce78 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -83,26 +83,6 @@ static inline void __buffer_relink_io(struct journal_head *jh)
}

/*
- * Try to release a checkpointed buffer from its transaction.
- * Returns 1 if we released it and 2 if we also released the
- * whole transaction.
- *
- * Requires j_list_lock
- */
-static int __try_to_free_cp_buf(struct journal_head *jh)
-{
- int ret = 0;
- struct buffer_head *bh = jh2bh(jh);
-
- if (jh->b_transaction == NULL && !buffer_locked(bh) &&
- !buffer_dirty(bh) && !buffer_write_io_error(bh)) {
- JBUFFER_TRACE(jh, "remove from checkpoint list");
- ret = __jbd2_journal_remove_checkpoint(jh) + 1;
- }
- return ret;
-}
-
-/*
* __jbd2_log_wait_for_space: wait until there is space in the journal.
*
* Called under j-state_lock *only*. It will be unlocked if we have to wait
@@ -412,101 +392,6 @@ int jbd2_cleanup_journal_tail(journal_t *journal)

/* Checkpoint list management */

-/*
- * journal_clean_one_cp_list
- *
- * Find all the written-back checkpoint buffers in the given list and
- * release them.
- *
- * Called with the journal locked.
- * Called with j_list_lock held.
- * Returns number of buffers reaped (for debug)
- */
-
-static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
-{
- struct journal_head *last_jh;
- struct journal_head *next_jh = jh;
- int ret, freed = 0;
-
- *released = 0;
- if (!jh)
- return 0;
-
- last_jh = jh->b_cpprev;
- do {
- jh = next_jh;
- next_jh = jh->b_cpnext;
- ret = __try_to_free_cp_buf(jh);
- if (ret) {
- freed++;
- if (ret == 2) {
- *released = 1;
- return freed;
- }
- }
- /*
- * This function only frees up some memory
- * if possible so we dont have an obligation
- * to finish processing. Bail out if preemption
- * requested:
- */
- if (need_resched())
- return freed;
- } while (jh != last_jh);
-
- return freed;
-}
-
-/*
- * journal_clean_checkpoint_list
- *
- * Find all the written-back checkpoint buffers in the journal and release them.
- *
- * Called with the journal locked.
- * Called with j_list_lock held.
- * Returns number of buffers reaped (for debug)
- */
-
-int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
-{
- transaction_t *transaction, *last_transaction, *next_transaction;
- int ret = 0;
- int released;
-
- transaction = journal->j_checkpoint_transactions;
- if (!transaction)
- goto out;
-
- last_transaction = transaction->t_cpprev;
- next_transaction = transaction;
- do {
- transaction = next_transaction;
- next_transaction = transaction->t_cpnext;
- ret += journal_clean_one_cp_list(transaction->
- t_checkpoint_list, &released);
- /*
- * This function only frees up some memory if possible so we
- * dont have an obligation to finish processing. Bail out if
- * preemption requested:
- */
- if (need_resched())
- goto out;
- if (released)
- continue;
- /*
- * It is essential that we are as careful as in the case of
- * t_checkpoint_list with removing the buffer from the list as
- * we can possibly see not yet submitted buffers on io_list
- */
- ret += journal_clean_one_cp_list(transaction->
- t_checkpoint_io_list, &released);
- if (need_resched())
- goto out;
- } while (transaction != last_transaction);
-out:
- return ret;
-}

/*
* journal_remove_checkpoint: called after a buffer has been committed
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b73e021..1ebdd70 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -504,15 +504,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
jbd2_journal_refile_buffer(journal, jh);
}

- /*
- * Now try to drop any written-back buffers from the journal's
- * checkpoint lists. We do this *before* commit because it potentially
- * frees some memory
- */
- spin_lock(&journal->j_list_lock);
- __jbd2_journal_clean_checkpoint_list(journal);
- spin_unlock(&journal->j_list_lock);


2014-09-11 13:51:50

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] jbd2: remove __jbd2_journal_clean_checkpoint_list to solve scalability issue

On Thu 11-09-14 18:09:53, Yuanhan Liu wrote:
> Firstly, if I read the code correctly(I have a feeling I missed something),
> __jbd2_journal_clean_checkpoint_list is not necessary. jbd2_log_do_checkpoint
> will remove them for the checkpointed transaction. In another word,
> __jbd2_journal_clean_checkpoint_list can't drop an transaction if it's not
> checkpointed.
Yes, that's correct. __jbd2_journal_clean_checkpoint_list() is there only
to free up buffers that were already checkpointed.

> Secondly, I noticed severe scalability limit caused by this function with fsmark:
> $ fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096
>
> It mainly writes tons of small files, each with 4K, and calls fsync after
> each write. It's one thread only, and it's tested on a 32G memory disk(brd).
>
> "perf top" shows that journal_clean_one_cp_list() is very hot, like 40%.
> However, that function is quite cheap, actually. But __jbd2_journal_clean_checkpoint_list()
> will repeatedly call it for each transaction on the j_checkpoint_transactions.
> The list becomes longer and longer in linear with time. It soon grows to
> a list with 3,000 and more transactions. What's worse, we will iterate all
> those transactions on this list every time before committing a transaction,
> aka, fsync a file in our case.
OK, this is kind of a pathological workload (generating lots of tiny
transactions) but it's not insane. I'd also note that we should eventually
reach a steady state once the journal fills up and we will be forced to
checkpoint transactions from the journal to make space for new ones.
However at that point we will have few thousands of transactions in the
journal and I agree it takes a while to scan them in
__jbd2_journal_clean_checkpoint_list().

I'm not yet convinced that just dropping
__jbd2_journal_clean_checkpoint_list() is the right thing to do. Some
periodic cleaning of checkpointed entries would seem reasonable so that we
don't save all that work for the moment we run out of space in the journal.
I'll think how to do that in a more efficient way...

Honza

> So, that's roughly the reason why journal_clean_one_cp_list() is hot. However,
> I'm wondering why the j_checkpoint_transactions list keeps growing. Is there
> a bug? I did more investigates, and I guess I found the root cause.
>
> In this case, we invoke fsync after each write, so, it basically means one
> transaction for one file. One transaction normally contains few blocks of meta
> data, like bitmap, inode and so on. All files in same group shares one bitmap
> block. But one inode table block could contains 16 files. Hence, it's possible
> that 2 different file points to same meta blocks.
>
> For example, if file A and B uses same meta blocks, and if we write A and B in
> following style:
> write(A);
> fsync(A);
>
> write(B);
> fsync(B);
>
> then, when we are about to commit transation for B, and assume transaction of A
> is not checkpointted yet, we can safely drop transaction A and replace it with
> transaction B. Hence, the j_checkpoint_transactions grows by 1 only.
>
> And then assume A is the last inode in one inode block, hence, B will use another
> inode table block. Thus transaction A and B is different. Hence, both A and B are
> inserted to the j_checkpoint_transactions; the list grows by 2.
>
> Here I also got the proves; I added a trace point in jbd2_log_do_checkpoint(),
> and here are some of them:
>
> fs_mark-2646 [000] .... 5.830990: jbd2_start_checkpoint: tid=20
> fs_mark-2646 [000] .... 5.832391: jbd2_start_checkpoint: tid=36
> fs_mark-2646 [000] .... 5.833794: jbd2_start_checkpoint: tid=52
> fs_mark-2646 [000] .... 5.835153: jbd2_start_checkpoint: tid=68
> fs_mark-2646 [000] .... 5.836517: jbd2_start_checkpoint: tid=84
> fs_mark-2646 [000] .... 5.837982: jbd2_start_checkpoint: tid=100
> fs_mark-2646 [000] .... 5.839464: jbd2_start_checkpoint: tid=116
> fs_mark-2646 [000] .... 5.840820: jbd2_start_checkpoint: tid=132
>
> As you can see, the tid jumps by 16 each time, and other transactions are just
> replaced by the way I described above.
>
> Step by step like above, that list grows. And that's why journal_clean_one_cp_list() is hot.
>
> It removes the scalability issue when I removed this function, and the fsmark
> result(Files/sec) jumps from 9000 to 16000, which is an increase about 80%.
>
> Thoughts?
>
> CC: Andi Kleen <[email protected]>
> Signed-off-by: Yuanhan Liu <[email protected]>
> ---
> fs/jbd2/checkpoint.c | 115 ---------------------------------------------------
> fs/jbd2/commit.c | 9 ----
> include/linux/jbd2.h | 1 -
> 3 files changed, 125 deletions(-)
>
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 9ffb19c..31fce78 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -83,26 +83,6 @@ static inline void __buffer_relink_io(struct journal_head *jh)
> }
>
> /*
> - * Try to release a checkpointed buffer from its transaction.
> - * Returns 1 if we released it and 2 if we also released the
> - * whole transaction.
> - *
> - * Requires j_list_lock
> - */
> -static int __try_to_free_cp_buf(struct journal_head *jh)
> -{
> - int ret = 0;
> - struct buffer_head *bh = jh2bh(jh);
> -
> - if (jh->b_transaction == NULL && !buffer_locked(bh) &&
> - !buffer_dirty(bh) && !buffer_write_io_error(bh)) {
> - JBUFFER_TRACE(jh, "remove from checkpoint list");
> - ret = __jbd2_journal_remove_checkpoint(jh) + 1;
> - }
> - return ret;
> -}
> -
> -/*
> * __jbd2_log_wait_for_space: wait until there is space in the journal.
> *
> * Called under j-state_lock *only*. It will be unlocked if we have to wait
> @@ -412,101 +392,6 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
>
> /* Checkpoint list management */
>
> -/*
> - * journal_clean_one_cp_list
> - *
> - * Find all the written-back checkpoint buffers in the given list and
> - * release them.
> - *
> - * Called with the journal locked.
> - * Called with j_list_lock held.
> - * Returns number of buffers reaped (for debug)
> - */
> -
> -static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
> -{
> - struct journal_head *last_jh;
> - struct journal_head *next_jh = jh;
> - int ret, freed = 0;
> -
> - *released = 0;
> - if (!jh)
> - return 0;
> -
> - last_jh = jh->b_cpprev;
> - do {
> - jh = next_jh;
> - next_jh = jh->b_cpnext;
> - ret = __try_to_free_cp_buf(jh);
> - if (ret) {
> - freed++;
> - if (ret == 2) {
> - *released = 1;
> - return freed;
> - }
> - }
> - /*
> - * This function only frees up some memory
> - * if possible so we dont have an obligation
> - * to finish processing. Bail out if preemption
> - * requested:
> - */
> - if (need_resched())
> - return freed;
> - } while (jh != last_jh);
> -
> - return freed;
> -}
> -
> -/*
> - * journal_clean_checkpoint_list
> - *
> - * Find all the written-back checkpoint buffers in the journal and release them.
> - *
> - * Called with the journal locked.
> - * Called with j_list_lock held.
> - * Returns number of buffers reaped (for debug)
> - */
> -
> -int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
> -{
> - transaction_t *transaction, *last_transaction, *next_transaction;
> - int ret = 0;
> - int released;
> -
> - transaction = journal->j_checkpoint_transactions;
> - if (!transaction)
> - goto out;
> -
> - last_transaction = transaction->t_cpprev;
> - next_transaction = transaction;
> - do {
> - transaction = next_transaction;
> - next_transaction = transaction->t_cpnext;
> - ret += journal_clean_one_cp_list(transaction->
> - t_checkpoint_list, &released);
> - /*
> - * This function only frees up some memory if possible so we
> - * dont have an obligation to finish processing. Bail out if
> - * preemption requested:
> - */
> - if (need_resched())
> - goto out;
> - if (released)
> - continue;
> - /*
> - * It is essential that we are as careful as in the case of
> - * t_checkpoint_list with removing the buffer from the list as
> - * we can possibly see not yet submitted buffers on io_list
> - */
> - ret += journal_clean_one_cp_list(transaction->
> - t_checkpoint_io_list, &released);
> - if (need_resched())
> - goto out;
> - } while (transaction != last_transaction);
> -out:
> - return ret;
> -}
>
> /*
> * journal_remove_checkpoint: called after a buffer has been committed
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index b73e021..1ebdd70 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -504,15 +504,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> jbd2_journal_refile_buffer(journal, jh);
> }
>
> - /*
> - * Now try to drop any written-back buffers from the journal's
> - * checkpoint lists. We do this *before* commit because it potentially
> - * frees some memory
> - */
> - spin_lock(&journal->j_list_lock);
> - __jbd2_journal_clean_checkpoint_list(journal);
> - spin_unlock(&journal->j_list_lock);
> -
> jbd_debug(3, "JBD2: commit phase 1\n");
>
> /*
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 0dae71e..c41ab38 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1042,7 +1042,6 @@ void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
> extern void jbd2_journal_commit_transaction(journal_t *);
>
> /* Checkpoint list management */
> -int __jbd2_journal_clean_checkpoint_list(journal_t *journal);
> int __jbd2_journal_remove_checkpoint(struct journal_head *);
> void __jbd2_journal_insert_checkpoint(struct journal_head *, transaction_t *);
>
> --
> 1.9.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-09-11 14:14:03

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] jbd2: remove __jbd2_journal_clean_checkpoint_list to solve scalability issue

On Thu, Sep 11, 2014 at 03:51:47PM +0200, Jan Kara wrote:
>
> I'm not yet convinced that just dropping
> __jbd2_journal_clean_checkpoint_list() is the right thing to do. Some
> periodic cleaning of checkpointed entries would seem reasonable so that we
> don't save all that work for the moment we run out of space in the journal.
> I'll think how to do that in a more efficient way...

I'm sure Jan is aware of this, but for those following along, when we
completely run out of space in the journal, and we have to do a formal
checkpoint operation --- at which point, all file system activity has
to grind to a halt, at which point *all* of the CPU cores will be dead
in the water, which is bad for scalability, not to mention really
annoying users whose terminals are stalled or whose web requests are
completely stuck until the checkpoint completes. :-)

So being able to clean up the journal so we don't have to do a forced
checkpoint operation which requires waiting for I/O operations is
rather important from a performance and usability point of view....

Cheers,

- Ted

2014-09-11 15:46:15

by Yuanhan Liu

[permalink] [raw]
Subject: Re: [PATCH] jbd2: remove __jbd2_journal_clean_checkpoint_list to solve scalability issue

On Thu, Sep 11, 2014 at 03:51:47PM +0200, Jan Kara wrote:
> On Thu 11-09-14 18:09:53, Yuanhan Liu wrote:
> > Firstly, if I read the code correctly(I have a feeling I missed something),
> > __jbd2_journal_clean_checkpoint_list is not necessary. jbd2_log_do_checkpoint
> > will remove them for the checkpointed transaction. In another word,
> > __jbd2_journal_clean_checkpoint_list can't drop an transaction if it's not
> > checkpointed.
> Yes, that's correct. __jbd2_journal_clean_checkpoint_list() is there only
> to free up buffers that were already checkpointed.

Yes, and that might be something I missed: if a transaction is
checkpointed, those buffers attached to the transaction should also be
released, right? If that's ture, what __jbd2_journal_clean_checkpoint_list()
is for?

>
> > Secondly, I noticed severe scalability limit caused by this function with fsmark:
> > $ fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096
> >
> > It mainly writes tons of small files, each with 4K, and calls fsync after
> > each write. It's one thread only, and it's tested on a 32G memory disk(brd).
> >
> > "perf top" shows that journal_clean_one_cp_list() is very hot, like 40%.
> > However, that function is quite cheap, actually. But __jbd2_journal_clean_checkpoint_list()
> > will repeatedly call it for each transaction on the j_checkpoint_transactions.
> > The list becomes longer and longer in linear with time. It soon grows to
> > a list with 3,000 and more transactions. What's worse, we will iterate all
> > those transactions on this list every time before committing a transaction,
> > aka, fsync a file in our case.
> OK, this is kind of a pathological workload (generating lots of tiny
> transactions) but it's not insane. I'd also note that we should eventually
> reach a steady state once the journal fills up and we will be forced to
> checkpoint transactions from the journal to make space for new ones.
> However at that point we will have few thousands of transactions in the
> journal and I agree it takes a while to scan them in
> __jbd2_journal_clean_checkpoint_list().
>
> I'm not yet convinced that just dropping
> __jbd2_journal_clean_checkpoint_list() is the right thing to do.

Yes, I feel the same. BTW, I should add a RFC tag before the subject.
This patch is mainly for informing you guys there might be a scalability
issue with current JBD2 code.

--yliu

> Some
> periodic cleaning of checkpointed entries would seem reasonable so that we
> don't save all that work for the moment we run out of space in the journal.
> I'll think how to do that in a more efficient way...
>
> Honza
>
> > So, that's roughly the reason why journal_clean_one_cp_list() is hot. However,
> > I'm wondering why the j_checkpoint_transactions list keeps growing. Is there
> > a bug? I did more investigates, and I guess I found the root cause.
> >
> > In this case, we invoke fsync after each write, so, it basically means one
> > transaction for one file. One transaction normally contains few blocks of meta
> > data, like bitmap, inode and so on. All files in same group shares one bitmap
> > block. But one inode table block could contains 16 files. Hence, it's possible
> > that 2 different file points to same meta blocks.
> >
> > For example, if file A and B uses same meta blocks, and if we write A and B in
> > following style:
> > write(A);
> > fsync(A);
> >
> > write(B);
> > fsync(B);
> >
> > then, when we are about to commit transation for B, and assume transaction of A
> > is not checkpointted yet, we can safely drop transaction A and replace it with
> > transaction B. Hence, the j_checkpoint_transactions grows by 1 only.
> >
> > And then assume A is the last inode in one inode block, hence, B will use another
> > inode table block. Thus transaction A and B is different. Hence, both A and B are
> > inserted to the j_checkpoint_transactions; the list grows by 2.
> >
> > Here I also got the proves; I added a trace point in jbd2_log_do_checkpoint(),
> > and here are some of them:
> >
> > fs_mark-2646 [000] .... 5.830990: jbd2_start_checkpoint: tid=20
> > fs_mark-2646 [000] .... 5.832391: jbd2_start_checkpoint: tid=36
> > fs_mark-2646 [000] .... 5.833794: jbd2_start_checkpoint: tid=52
> > fs_mark-2646 [000] .... 5.835153: jbd2_start_checkpoint: tid=68
> > fs_mark-2646 [000] .... 5.836517: jbd2_start_checkpoint: tid=84
> > fs_mark-2646 [000] .... 5.837982: jbd2_start_checkpoint: tid=100
> > fs_mark-2646 [000] .... 5.839464: jbd2_start_checkpoint: tid=116
> > fs_mark-2646 [000] .... 5.840820: jbd2_start_checkpoint: tid=132
> >
> > As you can see, the tid jumps by 16 each time, and other transactions are just
> > replaced by the way I described above.
> >
> > Step by step like above, that list grows. And that's why journal_clean_one_cp_list() is hot.
> >
> > It removes the scalability issue when I removed this function, and the fsmark
> > result(Files/sec) jumps from 9000 to 16000, which is an increase about 80%.
> >
> > Thoughts?
> >
> > CC: Andi Kleen <[email protected]>
> > Signed-off-by: Yuanhan Liu <[email protected]>
> > ---
> > fs/jbd2/checkpoint.c | 115 ---------------------------------------------------
> > fs/jbd2/commit.c | 9 ----
> > include/linux/jbd2.h | 1 -
> > 3 files changed, 125 deletions(-)
> >
> > diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> > index 9ffb19c..31fce78 100644
> > --- a/fs/jbd2/checkpoint.c
> > +++ b/fs/jbd2/checkpoint.c
> > @@ -83,26 +83,6 @@ static inline void __buffer_relink_io(struct journal_head *jh)
> > }
> >
> > /*
> > - * Try to release a checkpointed buffer from its transaction.
> > - * Returns 1 if we released it and 2 if we also released the
> > - * whole transaction.
> > - *
> > - * Requires j_list_lock
> > - */
> > -static int __try_to_free_cp_buf(struct journal_head *jh)
> > -{
> > - int ret = 0;
> > - struct buffer_head *bh = jh2bh(jh);
> > -
> > - if (jh->b_transaction == NULL && !buffer_locked(bh) &&
> > - !buffer_dirty(bh) && !buffer_write_io_error(bh)) {
> > - JBUFFER_TRACE(jh, "remove from checkpoint list");
> > - ret = __jbd2_journal_remove_checkpoint(jh) + 1;
> > - }
> > - return ret;
> > -}
> > -
> > -/*
> > * __jbd2_log_wait_for_space: wait until there is space in the journal.
> > *
> > * Called under j-state_lock *only*. It will be unlocked if we have to wait
> > @@ -412,101 +392,6 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
> >
> > /* Checkpoint list management */
> >
> > -/*
> > - * journal_clean_one_cp_list
> > - *
> > - * Find all the written-back checkpoint buffers in the given list and
> > - * release them.
> > - *
> > - * Called with the journal locked.
> > - * Called with j_list_lock held.
> > - * Returns number of buffers reaped (for debug)
> > - */
> > -
> > -static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
> > -{
> > - struct journal_head *last_jh;
> > - struct journal_head *next_jh = jh;
> > - int ret, freed = 0;
> > -
> > - *released = 0;
> > - if (!jh)
> > - return 0;
> > -
> > - last_jh = jh->b_cpprev;
> > - do {
> > - jh = next_jh;
> > - next_jh = jh->b_cpnext;
> > - ret = __try_to_free_cp_buf(jh);
> > - if (ret) {
> > - freed++;
> > - if (ret == 2) {
> > - *released = 1;
> > - return freed;
> > - }
> > - }
> > - /*
> > - * This function only frees up some memory
> > - * if possible so we dont have an obligation
> > - * to finish processing. Bail out if preemption
> > - * requested:
> > - */
> > - if (need_resched())
> > - return freed;
> > - } while (jh != last_jh);
> > -
> > - return freed;
> > -}
> > -
> > -/*
> > - * journal_clean_checkpoint_list
> > - *
> > - * Find all the written-back checkpoint buffers in the journal and release them.
> > - *
> > - * Called with the journal locked.
> > - * Called with j_list_lock held.
> > - * Returns number of buffers reaped (for debug)
> > - */
> > -
> > -int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
> > -{
> > - transaction_t *transaction, *last_transaction, *next_transaction;
> > - int ret = 0;
> > - int released;
> > -
> > - transaction = journal->j_checkpoint_transactions;
> > - if (!transaction)
> > - goto out;
> > -
> > - last_transaction = transaction->t_cpprev;
> > - next_transaction = transaction;
> > - do {
> > - transaction = next_transaction;
> > - next_transaction = transaction->t_cpnext;
> > - ret += journal_clean_one_cp_list(transaction->
> > - t_checkpoint_list, &released);
> > - /*
> > - * This function only frees up some memory if possible so we
> > - * dont have an obligation to finish processing. Bail out if
> > - * preemption requested:
> > - */
> > - if (need_resched())
> > - goto out;
> > - if (released)
> > - continue;
> > - /*
> > - * It is essential that we are as careful as in the case of
> > - * t_checkpoint_list with removing the buffer from the list as
> > - * we can possibly see not yet submitted buffers on io_list
> > - */
> > - ret += journal_clean_one_cp_list(transaction->
> > - t_checkpoint_io_list, &released);
> > - if (need_resched())
> > - goto out;
> > - } while (transaction != last_transaction);
> > -out:
> > - return ret;
> > -}
> >
> > /*
> > * journal_remove_checkpoint: called after a buffer has been committed
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index b73e021..1ebdd70 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -504,15 +504,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> > jbd2_journal_refile_buffer(journal, jh);
> > }
> >
> > - /*
> > - * Now try to drop any written-back buffers from the journal's
> > - * checkpoint lists. We do this *before* commit because it potentially
> > - * frees some memory
> > - */
> > - spin_lock(&journal->j_list_lock);
> > - __jbd2_journal_clean_checkpoint_list(journal);
> > - spin_unlock(&journal->j_list_lock);
> > -
> > jbd_debug(3, "JBD2: commit phase 1\n");
> >
> > /*
> > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > index 0dae71e..c41ab38 100644
> > --- a/include/linux/jbd2.h
> > +++ b/include/linux/jbd2.h
> > @@ -1042,7 +1042,6 @@ void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
> > extern void jbd2_journal_commit_transaction(journal_t *);
> >
> > /* Checkpoint list management */
> > -int __jbd2_journal_clean_checkpoint_list(journal_t *journal);
> > int __jbd2_journal_remove_checkpoint(struct journal_head *);
> > void __jbd2_journal_insert_checkpoint(struct journal_head *, transaction_t *);
> >
> > --
> > 1.9.0
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2014-09-12 09:40:57

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] jbd2: remove __jbd2_journal_clean_checkpoint_list to solve scalability issue

On Thu 11-09-14 23:45:39, Yuanhan Liu wrote:
> On Thu, Sep 11, 2014 at 03:51:47PM +0200, Jan Kara wrote:
> > On Thu 11-09-14 18:09:53, Yuanhan Liu wrote:
> > > Firstly, if I read the code correctly(I have a feeling I missed something),
> > > __jbd2_journal_clean_checkpoint_list is not necessary. jbd2_log_do_checkpoint
> > > will remove them for the checkpointed transaction. In another word,
> > > __jbd2_journal_clean_checkpoint_list can't drop an transaction if it's not
> > > checkpointed.
> > Yes, that's correct. __jbd2_journal_clean_checkpoint_list() is there only
> > to free up buffers that were already checkpointed.
>
> Yes, and that might be something I missed: if a transaction is
> checkpointed, those buffers attached to the transaction should also be
> released, right? If that's ture, what __jbd2_journal_clean_checkpoint_list()
> is for?
They aren't released unless memory reclaim wants to free the
corresponding pagecache page. So they need to be detached from the
transaction somehow once they are written so that the transaction can be
eventually freed which frees the space in the journal. Currently we do this
detaching either in __jbd2_journal_clean_checkpoint_list() or in
jbd2_log_do_checkpoint(). However the latter function gets called only when
we really run out of space in the journal and thus everything in the
filesystem waits for some space to be reclaimed. So relying on that function
isn't good for performance either...

One possibility is to remove buffer from transaction on IO completion.
However that's likely going to bounce j_list_lock between CPUs badly. So
I'd rather somehow batch the updates of the checkpointing lists...

Honza

> > > Secondly, I noticed severe scalability limit caused by this function with fsmark:
> > > $ fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096
> > >
> > > It mainly writes tons of small files, each with 4K, and calls fsync after
> > > each write. It's one thread only, and it's tested on a 32G memory disk(brd).
> > >
> > > "perf top" shows that journal_clean_one_cp_list() is very hot, like 40%.
> > > However, that function is quite cheap, actually. But __jbd2_journal_clean_checkpoint_list()
> > > will repeatedly call it for each transaction on the j_checkpoint_transactions.
> > > The list becomes longer and longer in linear with time. It soon grows to
> > > a list with 3,000 and more transactions. What's worse, we will iterate all
> > > those transactions on this list every time before committing a transaction,
> > > aka, fsync a file in our case.
> > OK, this is kind of a pathological workload (generating lots of tiny
> > transactions) but it's not insane. I'd also note that we should eventually
> > reach a steady state once the journal fills up and we will be forced to
> > checkpoint transactions from the journal to make space for new ones.
> > However at that point we will have few thousands of transactions in the
> > journal and I agree it takes a while to scan them in
> > __jbd2_journal_clean_checkpoint_list().
> >
> > I'm not yet convinced that just dropping
> > __jbd2_journal_clean_checkpoint_list() is the right thing to do.
>
> Yes, I feel the same. BTW, I should add a RFC tag before the subject.
> This patch is mainly for informing you guys there might be a scalability
> issue with current JBD2 code.
>
> --yliu
>
> > Some
> > periodic cleaning of checkpointed entries would seem reasonable so that we
> > don't save all that work for the moment we run out of space in the journal.
> > I'll think how to do that in a more efficient way...
> >
> > Honza
> >
> > > So, that's roughly the reason why journal_clean_one_cp_list() is hot. However,
> > > I'm wondering why the j_checkpoint_transactions list keeps growing. Is there
> > > a bug? I did more investigates, and I guess I found the root cause.
> > >
> > > In this case, we invoke fsync after each write, so, it basically means one
> > > transaction for one file. One transaction normally contains few blocks of meta
> > > data, like bitmap, inode and so on. All files in same group shares one bitmap
> > > block. But one inode table block could contains 16 files. Hence, it's possible
> > > that 2 different file points to same meta blocks.
> > >
> > > For example, if file A and B uses same meta blocks, and if we write A and B in
> > > following style:
> > > write(A);
> > > fsync(A);
> > >
> > > write(B);
> > > fsync(B);
> > >
> > > then, when we are about to commit transation for B, and assume transaction of A
> > > is not checkpointted yet, we can safely drop transaction A and replace it with
> > > transaction B. Hence, the j_checkpoint_transactions grows by 1 only.
> > >
> > > And then assume A is the last inode in one inode block, hence, B will use another
> > > inode table block. Thus transaction A and B is different. Hence, both A and B are
> > > inserted to the j_checkpoint_transactions; the list grows by 2.
> > >
> > > Here I also got the proves; I added a trace point in jbd2_log_do_checkpoint(),
> > > and here are some of them:
> > >
> > > fs_mark-2646 [000] .... 5.830990: jbd2_start_checkpoint: tid=20
> > > fs_mark-2646 [000] .... 5.832391: jbd2_start_checkpoint: tid=36
> > > fs_mark-2646 [000] .... 5.833794: jbd2_start_checkpoint: tid=52
> > > fs_mark-2646 [000] .... 5.835153: jbd2_start_checkpoint: tid=68
> > > fs_mark-2646 [000] .... 5.836517: jbd2_start_checkpoint: tid=84
> > > fs_mark-2646 [000] .... 5.837982: jbd2_start_checkpoint: tid=100
> > > fs_mark-2646 [000] .... 5.839464: jbd2_start_checkpoint: tid=116
> > > fs_mark-2646 [000] .... 5.840820: jbd2_start_checkpoint: tid=132
> > >
> > > As you can see, the tid jumps by 16 each time, and other transactions are just
> > > replaced by the way I described above.
> > >
> > > Step by step like above, that list grows. And that's why journal_clean_one_cp_list() is hot.
> > >
> > > It removes the scalability issue when I removed this function, and the fsmark
> > > result(Files/sec) jumps from 9000 to 16000, which is an increase about 80%.
> > >
> > > Thoughts?
> > >
> > > CC: Andi Kleen <[email protected]>
> > > Signed-off-by: Yuanhan Liu <[email protected]>
> > > ---
> > > fs/jbd2/checkpoint.c | 115 ---------------------------------------------------
> > > fs/jbd2/commit.c | 9 ----
> > > include/linux/jbd2.h | 1 -
> > > 3 files changed, 125 deletions(-)
> > >
> > > diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> > > index 9ffb19c..31fce78 100644
> > > --- a/fs/jbd2/checkpoint.c
> > > +++ b/fs/jbd2/checkpoint.c
> > > @@ -83,26 +83,6 @@ static inline void __buffer_relink_io(struct journal_head *jh)
> > > }
> > >
> > > /*
> > > - * Try to release a checkpointed buffer from its transaction.
> > > - * Returns 1 if we released it and 2 if we also released the
> > > - * whole transaction.
> > > - *
> > > - * Requires j_list_lock
> > > - */
> > > -static int __try_to_free_cp_buf(struct journal_head *jh)
> > > -{
> > > - int ret = 0;
> > > - struct buffer_head *bh = jh2bh(jh);
> > > -
> > > - if (jh->b_transaction == NULL && !buffer_locked(bh) &&
> > > - !buffer_dirty(bh) && !buffer_write_io_error(bh)) {
> > > - JBUFFER_TRACE(jh, "remove from checkpoint list");
> > > - ret = __jbd2_journal_remove_checkpoint(jh) + 1;
> > > - }
> > > - return ret;
> > > -}
> > > -
> > > -/*
> > > * __jbd2_log_wait_for_space: wait until there is space in the journal.
> > > *
> > > * Called under j-state_lock *only*. It will be unlocked if we have to wait
> > > @@ -412,101 +392,6 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
> > >
> > > /* Checkpoint list management */
> > >
> > > -/*
> > > - * journal_clean_one_cp_list
> > > - *
> > > - * Find all the written-back checkpoint buffers in the given list and
> > > - * release them.
> > > - *
> > > - * Called with the journal locked.
> > > - * Called with j_list_lock held.
> > > - * Returns number of buffers reaped (for debug)
> > > - */
> > > -
> > > -static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
> > > -{
> > > - struct journal_head *last_jh;
> > > - struct journal_head *next_jh = jh;
> > > - int ret, freed = 0;
> > > -
> > > - *released = 0;
> > > - if (!jh)
> > > - return 0;
> > > -
> > > - last_jh = jh->b_cpprev;
> > > - do {
> > > - jh = next_jh;
> > > - next_jh = jh->b_cpnext;
> > > - ret = __try_to_free_cp_buf(jh);
> > > - if (ret) {
> > > - freed++;
> > > - if (ret == 2) {
> > > - *released = 1;
> > > - return freed;
> > > - }
> > > - }
> > > - /*
> > > - * This function only frees up some memory
> > > - * if possible so we dont have an obligation
> > > - * to finish processing. Bail out if preemption
> > > - * requested:
> > > - */
> > > - if (need_resched())
> > > - return freed;
> > > - } while (jh != last_jh);
> > > -
> > > - return freed;
> > > -}
> > > -
> > > -/*
> > > - * journal_clean_checkpoint_list
> > > - *
> > > - * Find all the written-back checkpoint buffers in the journal and release them.
> > > - *
> > > - * Called with the journal locked.
> > > - * Called with j_list_lock held.
> > > - * Returns number of buffers reaped (for debug)
> > > - */
> > > -
> > > -int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
> > > -{
> > > - transaction_t *transaction, *last_transaction, *next_transaction;
> > > - int ret = 0;
> > > - int released;
> > > -
> > > - transaction = journal->j_checkpoint_transactions;
> > > - if (!transaction)
> > > - goto out;
> > > -
> > > - last_transaction = transaction->t_cpprev;
> > > - next_transaction = transaction;
> > > - do {
> > > - transaction = next_transaction;
> > > - next_transaction = transaction->t_cpnext;
> > > - ret += journal_clean_one_cp_list(transaction->
> > > - t_checkpoint_list, &released);
> > > - /*
> > > - * This function only frees up some memory if possible so we
> > > - * dont have an obligation to finish processing. Bail out if
> > > - * preemption requested:
> > > - */
> > > - if (need_resched())
> > > - goto out;
> > > - if (released)
> > > - continue;
> > > - /*
> > > - * It is essential that we are as careful as in the case of
> > > - * t_checkpoint_list with removing the buffer from the list as
> > > - * we can possibly see not yet submitted buffers on io_list
> > > - */
> > > - ret += journal_clean_one_cp_list(transaction->
> > > - t_checkpoint_io_list, &released);
> > > - if (need_resched())
> > > - goto out;
> > > - } while (transaction != last_transaction);
> > > -out:
> > > - return ret;
> > > -}
> > >
> > > /*
> > > * journal_remove_checkpoint: called after a buffer has been committed
> > > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > > index b73e021..1ebdd70 100644
> > > --- a/fs/jbd2/commit.c
> > > +++ b/fs/jbd2/commit.c
> > > @@ -504,15 +504,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> > > jbd2_journal_refile_buffer(journal, jh);
> > > }
> > >
> > > - /*
> > > - * Now try to drop any written-back buffers from the journal's
> > > - * checkpoint lists. We do this *before* commit because it potentially
> > > - * frees some memory
> > > - */
> > > - spin_lock(&journal->j_list_lock);
> > > - __jbd2_journal_clean_checkpoint_list(journal);
> > > - spin_unlock(&journal->j_list_lock);
> > > -
> > > jbd_debug(3, "JBD2: commit phase 1\n");
> > >
> > > /*
> > > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > > index 0dae71e..c41ab38 100644
> > > --- a/include/linux/jbd2.h
> > > +++ b/include/linux/jbd2.h
> > > @@ -1042,7 +1042,6 @@ void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
> > > extern void jbd2_journal_commit_transaction(journal_t *);
> > >
> > > /* Checkpoint list management */
> > > -int __jbd2_journal_clean_checkpoint_list(journal_t *journal);
> > > int __jbd2_journal_remove_checkpoint(struct journal_head *);
> > > void __jbd2_journal_insert_checkpoint(struct journal_head *, transaction_t *);
> > >
> > > --
> > > 1.9.0
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > Jan Kara <[email protected]>
> > SUSE Labs, CR
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-09-15 03:38:48

by Yuanhan Liu

[permalink] [raw]
Subject: Re: [PATCH] jbd2: remove __jbd2_journal_clean_checkpoint_list to solve scalability issue

On Fri, Sep 12, 2014 at 11:40:54AM +0200, Jan Kara wrote:
> On Thu 11-09-14 23:45:39, Yuanhan Liu wrote:
> > On Thu, Sep 11, 2014 at 03:51:47PM +0200, Jan Kara wrote:
> > > On Thu 11-09-14 18:09:53, Yuanhan Liu wrote:
> > > > Firstly, if I read the code correctly(I have a feeling I missed something),
> > > > __jbd2_journal_clean_checkpoint_list is not necessary. jbd2_log_do_checkpoint
> > > > will remove them for the checkpointed transaction. In another word,
> > > > __jbd2_journal_clean_checkpoint_list can't drop an transaction if it's not
> > > > checkpointed.
> > > Yes, that's correct. __jbd2_journal_clean_checkpoint_list() is there only
> > > to free up buffers that were already checkpointed.
> >
> > Yes, and that might be something I missed: if a transaction is
> > checkpointed, those buffers attached to the transaction should also be
> > released, right? If that's ture, what __jbd2_journal_clean_checkpoint_list()
> > is for?
> They aren't released unless memory reclaim wants to free the
> corresponding pagecache page. So they need to be detached from the
> transaction somehow once they are written so that the transaction can be
> eventually freed which frees the space in the journal. Currently we do this
> detaching either in __jbd2_journal_clean_checkpoint_list() or in
> jbd2_log_do_checkpoint(). However the latter function gets called only when
> we really run out of space in the journal and thus everything in the
> filesystem waits for some space to be reclaimed. So relying on that function
> isn't good for performance either...
>
> One possibility is to remove buffer from transaction on IO completion.

Yes, that's the same thing I thought of. b_end_io hook is the first
thing I thought of, and I tried it. Badly, it was never being invoked.

I then realised that it should be written by page cache(writeback or
page reclaim as you mentioned), and that's the stuff I missed before.
So, my patch was wrong and sorry for that.

> However that's likely going to bounce j_list_lock between CPUs badly. So

I checked __jbd2_journal_remove_checkpoint(jh) again, which is the
entrance to free a journal buffer. It has two main parts:

__buffer_unlink(jh)

drop_transaction() if all jh are written

The two are all protected by j_list_lock. IMO, we can split it, say
let __buffer_unlink(jh) be protected by a per-transaction spin lock,
and let drop_transaction() be protected by by j_list_lock as it was.

As drop_transaction() doesn't happen frequently as __buffer_unlink(jh),
it should alleviate the lock contention a bit.

Thoughts?


--yliu

> I'd rather somehow batch the updates of the checkpointing lists...
>
> Honza
>
> > > > Secondly, I noticed severe scalability limit caused by this function with fsmark:
> > > > $ fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096
> > > >
> > > > It mainly writes tons of small files, each with 4K, and calls fsync after
> > > > each write. It's one thread only, and it's tested on a 32G memory disk(brd).
> > > >
> > > > "perf top" shows that journal_clean_one_cp_list() is very hot, like 40%.
> > > > However, that function is quite cheap, actually. But __jbd2_journal_clean_checkpoint_list()
> > > > will repeatedly call it for each transaction on the j_checkpoint_transactions.
> > > > The list becomes longer and longer in linear with time. It soon grows to
> > > > a list with 3,000 and more transactions. What's worse, we will iterate all
> > > > those transactions on this list every time before committing a transaction,
> > > > aka, fsync a file in our case.
> > > OK, this is kind of a pathological workload (generating lots of tiny
> > > transactions) but it's not insane. I'd also note that we should eventually
> > > reach a steady state once the journal fills up and we will be forced to
> > > checkpoint transactions from the journal to make space for new ones.
> > > However at that point we will have few thousands of transactions in the
> > > journal and I agree it takes a while to scan them in
> > > __jbd2_journal_clean_checkpoint_list().
> > >
> > > I'm not yet convinced that just dropping
> > > __jbd2_journal_clean_checkpoint_list() is the right thing to do.
> >
> > Yes, I feel the same. BTW, I should add a RFC tag before the subject.
> > This patch is mainly for informing you guys there might be a scalability
> > issue with current JBD2 code.
> >
> > --yliu
> >
> > > Some
> > > periodic cleaning of checkpointed entries would seem reasonable so that we
> > > don't save all that work for the moment we run out of space in the journal.
> > > I'll think how to do that in a more efficient way...
> > >
> > > Honza
> > >
> > > > So, that's roughly the reason why journal_clean_one_cp_list() is hot. However,
> > > > I'm wondering why the j_checkpoint_transactions list keeps growing. Is there
> > > > a bug? I did more investigates, and I guess I found the root cause.
> > > >
> > > > In this case, we invoke fsync after each write, so, it basically means one
> > > > transaction for one file. One transaction normally contains few blocks of meta
> > > > data, like bitmap, inode and so on. All files in same group shares one bitmap
> > > > block. But one inode table block could contains 16 files. Hence, it's possible
> > > > that 2 different file points to same meta blocks.
> > > >
> > > > For example, if file A and B uses same meta blocks, and if we write A and B in
> > > > following style:
> > > > write(A);
> > > > fsync(A);
> > > >
> > > > write(B);
> > > > fsync(B);
> > > >
> > > > then, when we are about to commit transation for B, and assume transaction of A
> > > > is not checkpointted yet, we can safely drop transaction A and replace it with
> > > > transaction B. Hence, the j_checkpoint_transactions grows by 1 only.
> > > >
> > > > And then assume A is the last inode in one inode block, hence, B will use another
> > > > inode table block. Thus transaction A and B is different. Hence, both A and B are
> > > > inserted to the j_checkpoint_transactions; the list grows by 2.
> > > >
> > > > Here I also got the proves; I added a trace point in jbd2_log_do_checkpoint(),
> > > > and here are some of them:
> > > >
> > > > fs_mark-2646 [000] .... 5.830990: jbd2_start_checkpoint: tid=20
> > > > fs_mark-2646 [000] .... 5.832391: jbd2_start_checkpoint: tid=36
> > > > fs_mark-2646 [000] .... 5.833794: jbd2_start_checkpoint: tid=52
> > > > fs_mark-2646 [000] .... 5.835153: jbd2_start_checkpoint: tid=68
> > > > fs_mark-2646 [000] .... 5.836517: jbd2_start_checkpoint: tid=84
> > > > fs_mark-2646 [000] .... 5.837982: jbd2_start_checkpoint: tid=100
> > > > fs_mark-2646 [000] .... 5.839464: jbd2_start_checkpoint: tid=116
> > > > fs_mark-2646 [000] .... 5.840820: jbd2_start_checkpoint: tid=132
> > > >
> > > > As you can see, the tid jumps by 16 each time, and other transactions are just
> > > > replaced by the way I described above.
> > > >
> > > > Step by step like above, that list grows. And that's why journal_clean_one_cp_list() is hot.
> > > >
> > > > It removes the scalability issue when I removed this function, and the fsmark
> > > > result(Files/sec) jumps from 9000 to 16000, which is an increase about 80%.
> > > >
> > > > Thoughts?
> > > >
> > > > CC: Andi Kleen <[email protected]>
> > > > Signed-off-by: Yuanhan Liu <[email protected]>
> > > > ---
> > > > fs/jbd2/checkpoint.c | 115 ---------------------------------------------------
> > > > fs/jbd2/commit.c | 9 ----
> > > > include/linux/jbd2.h | 1 -
> > > > 3 files changed, 125 deletions(-)
> > > >
> > > > diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> > > > index 9ffb19c..31fce78 100644
> > > > --- a/fs/jbd2/checkpoint.c
> > > > +++ b/fs/jbd2/checkpoint.c
> > > > @@ -83,26 +83,6 @@ static inline void __buffer_relink_io(struct journal_head *jh)
> > > > }
> > > >
> > > > /*
> > > > - * Try to release a checkpointed buffer from its transaction.
> > > > - * Returns 1 if we released it and 2 if we also released the
> > > > - * whole transaction.
> > > > - *
> > > > - * Requires j_list_lock
> > > > - */
> > > > -static int __try_to_free_cp_buf(struct journal_head *jh)
> > > > -{
> > > > - int ret = 0;
> > > > - struct buffer_head *bh = jh2bh(jh);
> > > > -
> > > > - if (jh->b_transaction == NULL && !buffer_locked(bh) &&
> > > > - !buffer_dirty(bh) && !buffer_write_io_error(bh)) {
> > > > - JBUFFER_TRACE(jh, "remove from checkpoint list");
> > > > - ret = __jbd2_journal_remove_checkpoint(jh) + 1;
> > > > - }
> > > > - return ret;
> > > > -}
> > > > -
> > > > -/*
> > > > * __jbd2_log_wait_for_space: wait until there is space in the journal.
> > > > *
> > > > * Called under j-state_lock *only*. It will be unlocked if we have to wait
> > > > @@ -412,101 +392,6 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
> > > >
> > > > /* Checkpoint list management */
> > > >
> > > > -/*
> > > > - * journal_clean_one_cp_list
> > > > - *
> > > > - * Find all the written-back checkpoint buffers in the given list and
> > > > - * release them.
> > > > - *
> > > > - * Called with the journal locked.
> > > > - * Called with j_list_lock held.
> > > > - * Returns number of buffers reaped (for debug)
> > > > - */
> > > > -
> > > > -static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
> > > > -{
> > > > - struct journal_head *last_jh;
> > > > - struct journal_head *next_jh = jh;
> > > > - int ret, freed = 0;
> > > > -
> > > > - *released = 0;
> > > > - if (!jh)
> > > > - return 0;
> > > > -
> > > > - last_jh = jh->b_cpprev;
> > > > - do {
> > > > - jh = next_jh;
> > > > - next_jh = jh->b_cpnext;
> > > > - ret = __try_to_free_cp_buf(jh);
> > > > - if (ret) {
> > > > - freed++;
> > > > - if (ret == 2) {
> > > > - *released = 1;
> > > > - return freed;
> > > > - }
> > > > - }
> > > > - /*
> > > > - * This function only frees up some memory
> > > > - * if possible so we dont have an obligation
> > > > - * to finish processing. Bail out if preemption
> > > > - * requested:
> > > > - */
> > > > - if (need_resched())
> > > > - return freed;
> > > > - } while (jh != last_jh);
> > > > -
> > > > - return freed;
> > > > -}
> > > > -
> > > > -/*
> > > > - * journal_clean_checkpoint_list
> > > > - *
> > > > - * Find all the written-back checkpoint buffers in the journal and release them.
> > > > - *
> > > > - * Called with the journal locked.
> > > > - * Called with j_list_lock held.
> > > > - * Returns number of buffers reaped (for debug)
> > > > - */
> > > > -
> > > > -int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
> > > > -{
> > > > - transaction_t *transaction, *last_transaction, *next_transaction;
> > > > - int ret = 0;
> > > > - int released;
> > > > -
> > > > - transaction = journal->j_checkpoint_transactions;
> > > > - if (!transaction)
> > > > - goto out;
> > > > -
> > > > - last_transaction = transaction->t_cpprev;
> > > > - next_transaction = transaction;
> > > > - do {
> > > > - transaction = next_transaction;
> > > > - next_transaction = transaction->t_cpnext;
> > > > - ret += journal_clean_one_cp_list(transaction->
> > > > - t_checkpoint_list, &released);
> > > > - /*
> > > > - * This function only frees up some memory if possible so we
> > > > - * dont have an obligation to finish processing. Bail out if
> > > > - * preemption requested:
> > > > - */
> > > > - if (need_resched())
> > > > - goto out;
> > > > - if (released)
> > > > - continue;
> > > > - /*
> > > > - * It is essential that we are as careful as in the case of
> > > > - * t_checkpoint_list with removing the buffer from the list as
> > > > - * we can possibly see not yet submitted buffers on io_list
> > > > - */
> > > > - ret += journal_clean_one_cp_list(transaction->
> > > > - t_checkpoint_io_list, &released);
> > > > - if (need_resched())
> > > > - goto out;
> > > > - } while (transaction != last_transaction);
> > > > -out:
> > > > - return ret;
> > > > -}
> > > >
> > > > /*
> > > > * journal_remove_checkpoint: called after a buffer has been committed
> > > > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > > > index b73e021..1ebdd70 100644
> > > > --- a/fs/jbd2/commit.c
> > > > +++ b/fs/jbd2/commit.c
> > > > @@ -504,15 +504,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> > > > jbd2_journal_refile_buffer(journal, jh);
> > > > }
> > > >
> > > > - /*
> > > > - * Now try to drop any written-back buffers from the journal's
> > > > - * checkpoint lists. We do this *before* commit because it potentially
> > > > - * frees some memory
> > > > - */
> > > > - spin_lock(&journal->j_list_lock);
> > > > - __jbd2_journal_clean_checkpoint_list(journal);
> > > > - spin_unlock(&journal->j_list_lock);
> > > > -
> > > > jbd_debug(3, "JBD2: commit phase 1\n");
> > > >
> > > > /*
> > > > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > > > index 0dae71e..c41ab38 100644
> > > > --- a/include/linux/jbd2.h
> > > > +++ b/include/linux/jbd2.h
> > > > @@ -1042,7 +1042,6 @@ void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
> > > > extern void jbd2_journal_commit_transaction(journal_t *);
> > > >
> > > > /* Checkpoint list management */
> > > > -int __jbd2_journal_clean_checkpoint_list(journal_t *journal);
> > > > int __jbd2_journal_remove_checkpoint(struct journal_head *);
> > > > void __jbd2_journal_insert_checkpoint(struct journal_head *, transaction_t *);
> > > >
> > > > --
> > > > 1.9.0
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > --
> > > Jan Kara <[email protected]>
> > > SUSE Labs, CR
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2014-09-16 14:15:25

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] jbd2: remove __jbd2_journal_clean_checkpoint_list to solve scalability issue

On Mon 15-09-14 11:38:52, Yuanhan Liu wrote:
> On Fri, Sep 12, 2014 at 11:40:54AM +0200, Jan Kara wrote:
> > On Thu 11-09-14 23:45:39, Yuanhan Liu wrote:
> > > On Thu, Sep 11, 2014 at 03:51:47PM +0200, Jan Kara wrote:
> > > > On Thu 11-09-14 18:09:53, Yuanhan Liu wrote:
> > > > > Firstly, if I read the code correctly(I have a feeling I missed something),
> > > > > __jbd2_journal_clean_checkpoint_list is not necessary. jbd2_log_do_checkpoint
> > > > > will remove them for the checkpointed transaction. In another word,
> > > > > __jbd2_journal_clean_checkpoint_list can't drop an transaction if it's not
> > > > > checkpointed.
> > > > Yes, that's correct. __jbd2_journal_clean_checkpoint_list() is there only
> > > > to free up buffers that were already checkpointed.
> > >
> > > Yes, and that might be something I missed: if a transaction is
> > > checkpointed, those buffers attached to the transaction should also be
> > > released, right? If that's ture, what __jbd2_journal_clean_checkpoint_list()
> > > is for?
> > They aren't released unless memory reclaim wants to free the
> > corresponding pagecache page. So they need to be detached from the
> > transaction somehow once they are written so that the transaction can be
> > eventually freed which frees the space in the journal. Currently we do this
> > detaching either in __jbd2_journal_clean_checkpoint_list() or in
> > jbd2_log_do_checkpoint(). However the latter function gets called only when
> > we really run out of space in the journal and thus everything in the
> > filesystem waits for some space to be reclaimed. So relying on that function
> > isn't good for performance either...
> >
> > One possibility is to remove buffer from transaction on IO completion.
>
> Yes, that's the same thing I thought of. b_end_io hook is the first
> thing I thought of, and I tried it. Badly, it was never being invoked.
>
> I then realised that it should be written by page cache(writeback or
> page reclaim as you mentioned), and that's the stuff I missed before.
> So, my patch was wrong and sorry for that.
No problem :).

> > However that's likely going to bounce j_list_lock between CPUs badly. So
>
> I checked __jbd2_journal_remove_checkpoint(jh) again, which is the
> entrance to free a journal buffer. It has two main parts:
>
> __buffer_unlink(jh)
>
> drop_transaction() if all jh are written
>
> The two are all protected by j_list_lock. IMO, we can split it, say
> let __buffer_unlink(jh) be protected by a per-transaction spin lock,
> and let drop_transaction() be protected by by j_list_lock as it was.
>
> As drop_transaction() doesn't happen frequently as __buffer_unlink(jh),
> it should alleviate the lock contention a bit.
Yes, drop_transaction() isn't really an issue. But when doing
__buffer_unlink() during IO completion, making the checkpoint list lock
per-transaction one won't help the contention for the common case where
transactions are reasonably large.

I was thinking about it for a while. I was thinking about various clever
solutions but in the end what I think would be the best is just
to change __jbd2_journal_clean_checkpoint_list() to bail out once it spots
a transaction it cannot free. That will deal with the pointless scanning of
checkpointing lists you are hitting with your workload, it will also do all
the cleanup that is necessary for transactions to be eventually cleaned up
from the journal without having to wait in log_do_checkpoint().

Attached are two patches - lightly tested only. Can you try whether they
fix the contention for you? Thanks!

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (3.55 kB)
0001-jbd2-Avoid-pointless-scanning-of-checkpoint-lists.patch (3.54 kB)
0002-jbd2-Simplify-calling-convention-around-__jbd2_journ.patch (4.28 kB)
Download all attachments

2014-09-17 08:14:05

by Yuanhan Liu

[permalink] [raw]
Subject: Re: [PATCH] jbd2: remove __jbd2_journal_clean_checkpoint_list to solve scalability issue

On Tue, Sep 16, 2014 at 01:05:25PM +0200, Jan Kara wrote:
...
> > As drop_transaction() doesn't happen frequently as __buffer_unlink(jh),
> > it should alleviate the lock contention a bit.
> Yes, drop_transaction() isn't really an issue. But when doing
> __buffer_unlink() during IO completion, making the checkpoint list lock
> per-transaction one won't help the contention for the common case where
> transactions are reasonably large.

Good point.

>
> I was thinking about it for a while. I was thinking about various clever
> solutions but in the end what I think would be the best is just
> to change __jbd2_journal_clean_checkpoint_list() to bail out once it spots
> a transaction it cannot free. That will deal with the pointless scanning of
> checkpointing lists you are hitting with your workload, it will also do all
> the cleanup that is necessary for transactions to be eventually cleaned up
> from the journal without having to wait in log_do_checkpoint().
>
> Attached are two patches - lightly tested only. Can you try whether they
> fix the contention for you? Thanks!

Yes, it fixs the contention as expected. So, feel free to add:

Tested-by: Yuanhan Liu <[email protected]>


Thanks.

--yliu
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

> >From ee5a1b87e267a8ae8935952c3c6971fa6cc41c63 Mon Sep 17 00:00:00 2001
> From: Jan Kara <[email protected]>
> Date: Tue, 16 Sep 2014 12:25:20 +0200
> Subject: [PATCH 1/2] jbd2: Avoid pointless scanning of checkpoint lists
>
> Yuanhan has reported that when he is running fsync(2) heavy workload
> creating new files over ramdisk, significant amount of time is spent in
> __jbd2_journal_clean_checkpoint_list() trying to clean old transactions
> (but they cannot be cleaned up because flusher hasn't yet checkpointed
> those buffers).
>
> Reduce the amount of scanning by stopping to scan the transaction list
> once we find a transaction that cannot be checkpointed. Note that this
> way of cleaning is still enough to keep freeing space in the journal
> after fully checkpointed transactions.
>
> Reported-by: Yuanhan Liu <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/jbd2/checkpoint.c | 32 ++++++++++++++++++--------------
> 1 file changed, 18 insertions(+), 14 deletions(-)
>
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 7f34f4716165..e39b2d0e1079 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -478,7 +478,6 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
> * Find all the written-back checkpoint buffers in the given list and
> * release them.
> *
> - * Called with the journal locked.
> * Called with j_list_lock held.
> * Returns number of buffers reaped (for debug)
> */
> @@ -498,12 +497,12 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
> jh = next_jh;
> next_jh = jh->b_cpnext;
> ret = __try_to_free_cp_buf(jh);
> - if (ret) {
> - freed++;
> - if (ret == 2) {
> - *released = 1;
> - return freed;
> - }
> + if (!ret)
> + return freed;
> + freed++;
> + if (ret == 2) {
> + *released = 1;
> + return freed;
> }
> /*
> * This function only frees up some memory
> @@ -523,7 +522,6 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
> *
> * Find all the written-back checkpoint buffers in the journal and release them.
> *
> - * Called with the journal locked.
> * Called with j_list_lock held.
> * Returns number of buffers reaped (for debug)
> */
> @@ -531,7 +529,8 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
> int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
> {
> transaction_t *transaction, *last_transaction, *next_transaction;
> - int ret = 0;
> + int ret;
> + int freed = 0;
> int released;
>
> transaction = journal->j_checkpoint_transactions;
> @@ -543,17 +542,21 @@ int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
> do {
> transaction = next_transaction;
> next_transaction = transaction->t_cpnext;
> - ret += journal_clean_one_cp_list(transaction->
> + ret = journal_clean_one_cp_list(transaction->
> t_checkpoint_list, &released);
> /*
> * This function only frees up some memory if possible so we
> * dont have an obligation to finish processing. Bail out if
> * preemption requested:
> */
> - if (need_resched())
> + if (need_resched()) {
> + freed += ret;
> goto out;
> - if (released)
> + }
> + if (released) {
> + freed += ret;
> continue;
> + }
> /*
> * It is essential that we are as careful as in the case of
> * t_checkpoint_list with removing the buffer from the list as
> @@ -561,11 +564,12 @@ int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
> */
> ret += journal_clean_one_cp_list(transaction->
> t_checkpoint_io_list, &released);
> - if (need_resched())
> + freed += ret;
> + if (need_resched() || !ret)
> goto out;
> } while (transaction != last_transaction);
> out:
> - return ret;
> + return freed;
> }
>
> /*
> --
> 1.8.1.4
>

> >From 8e17e25b044489429ef27433ee83066950a202b3 Mon Sep 17 00:00:00 2001
> From: Jan Kara <[email protected]>
> Date: Tue, 16 Sep 2014 12:46:34 +0200
> Subject: [PATCH 2/2] jbd2: Simplify calling convention around
> __jbd2_journal_clean_checkpoint_list
>
> __jbd2_journal_clean_checkpoint_list() returns number of buffers it
> freed but noone was using the value so just stop doing that. This
> also allows for simplifying the calling convention for
> journal_clean_once_cp_list().
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/jbd2/checkpoint.c | 56 ++++++++++++++++++++++------------------------------
> include/linux/jbd2.h | 2 +-
> 2 files changed, 25 insertions(+), 33 deletions(-)
>
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index e39b2d0e1079..ef23d6f4a6a2 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -479,16 +479,15 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
> * release them.
> *
> * Called with j_list_lock held.
> - * Returns number of buffers reaped (for debug)
> + * Returns 1 if we freed the transaction, 0 otherwise.
> */
> -
> -static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
> +static int journal_clean_one_cp_list(struct journal_head *jh)
> {
> struct journal_head *last_jh;
> struct journal_head *next_jh = jh;
> - int ret, freed = 0;
> + int ret;
> + int freed = 0;
>
> - *released = 0;
> if (!jh)
> return 0;
>
> @@ -499,11 +498,9 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
> ret = __try_to_free_cp_buf(jh);
> if (!ret)
> return freed;
> - freed++;
> - if (ret == 2) {
> - *released = 1;
> - return freed;
> - }
> + if (ret == 2)
> + return 1;
> + freed = 1;
> /*
> * This function only frees up some memory
> * if possible so we dont have an obligation
> @@ -523,53 +520,48 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
> * Find all the written-back checkpoint buffers in the journal and release them.
> *
> * Called with j_list_lock held.
> - * Returns number of buffers reaped (for debug)
> */
> -
> -int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
> +void __jbd2_journal_clean_checkpoint_list(journal_t *journal)
> {
> transaction_t *transaction, *last_transaction, *next_transaction;
> int ret;
> - int freed = 0;
> - int released;
>
> transaction = journal->j_checkpoint_transactions;
> if (!transaction)
> - goto out;
> + return;
>
> last_transaction = transaction->t_cpprev;
> next_transaction = transaction;
> do {
> transaction = next_transaction;
> next_transaction = transaction->t_cpnext;
> - ret = journal_clean_one_cp_list(transaction->
> - t_checkpoint_list, &released);
> + ret = journal_clean_one_cp_list(transaction->t_checkpoint_list);
> /*
> * This function only frees up some memory if possible so we
> * dont have an obligation to finish processing. Bail out if
> * preemption requested:
> */
> - if (need_resched()) {
> - freed += ret;
> - goto out;
> - }
> - if (released) {
> - freed += ret;
> + if (need_resched())
> + return;
> + if (ret)
> continue;
> - }
> /*
> * It is essential that we are as careful as in the case of
> * t_checkpoint_list with removing the buffer from the list as
> * we can possibly see not yet submitted buffers on io_list
> */
> - ret += journal_clean_one_cp_list(transaction->
> - t_checkpoint_io_list, &released);
> - freed += ret;
> - if (need_resched() || !ret)
> - goto out;
> + ret = journal_clean_one_cp_list(transaction->
> + t_checkpoint_io_list);
> + if (need_resched())
> + return;
> + /*
> + * Stop scanning if we couldn't free the transaction. This
> + * avoids pointless scanning of transactions which still
> + * weren't checkpointed.
> + */
> + if (!ret)
> + return;
> } while (transaction != last_transaction);
> -out:
> - return freed;
> }
>
> /*
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 0dae71e9971c..704b9a599b26 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1042,7 +1042,7 @@ void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
> extern void jbd2_journal_commit_transaction(journal_t *);
>
> /* Checkpoint list management */
> -int __jbd2_journal_clean_checkpoint_list(journal_t *journal);
> +void __jbd2_journal_clean_checkpoint_list(journal_t *journal);
> int __jbd2_journal_remove_checkpoint(struct journal_head *);
> void __jbd2_journal_insert_checkpoint(struct journal_head *, transaction_t *);
>
> --
> 1.8.1.4
>