From: Yuanhan Liu Subject: Re: [PATCH] jbd2: remove __jbd2_journal_clean_checkpoint_list to solve scalability issue Date: Mon, 15 Sep 2014 11:38:52 +0800 Message-ID: <20140915033852.GL3052@yliu-dev.sh.intel.com> References: <1410430193-6112-1-git-send-email-yuanhan.liu@linux.intel.com> <20140911135147.GD30901@quack.suse.cz> <20140911154539.GK3052@yliu-dev.sh.intel.com> <20140912094054.GA5958@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-ext4@vger.kernel.org, Andi Kleen , Yuanhan Liu To: Jan Kara Return-path: Received: from mga14.intel.com ([192.55.52.115]:15365 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752282AbaIODis (ORCPT ); Sun, 14 Sep 2014 23:38:48 -0400 Content-Disposition: inline In-Reply-To: <20140912094054.GA5958@quack.suse.cz> Sender: linux-ext4-owner@vger.kernel.org List-ID: On Fri, Sep 12, 2014 at 11:40:54AM +0200, Jan Kara wrote: > On Thu 11-09-14 23:45:39, Yuanhan Liu wrote: > > On Thu, Sep 11, 2014 at 03:51:47PM +0200, Jan Kara wrote: > > > On Thu 11-09-14 18:09:53, Yuanhan Liu wrote: > > > > Firstly, if I read the code correctly(I have a feeling I missed something), > > > > __jbd2_journal_clean_checkpoint_list is not necessary. jbd2_log_do_checkpoint > > > > will remove them for the checkpointed transaction. In another word, > > > > __jbd2_journal_clean_checkpoint_list can't drop an transaction if it's not > > > > checkpointed. > > > Yes, that's correct. __jbd2_journal_clean_checkpoint_list() is there only > > > to free up buffers that were already checkpointed. > > > > Yes, and that might be something I missed: if a transaction is > > checkpointed, those buffers attached to the transaction should also be > > released, right? If that's ture, what __jbd2_journal_clean_checkpoint_list() > > is for? > They aren't released unless memory reclaim wants to free the > corresponding pagecache page. So they need to be detached from the > transaction somehow once they are written so that the transaction can be > eventually freed which frees the space in the journal. Currently we do this > detaching either in __jbd2_journal_clean_checkpoint_list() or in > jbd2_log_do_checkpoint(). However the latter function gets called only when > we really run out of space in the journal and thus everything in the > filesystem waits for some space to be reclaimed. So relying on that function > isn't good for performance either... > > One possibility is to remove buffer from transaction on IO completion. Yes, that's the same thing I thought of. b_end_io hook is the first thing I thought of, and I tried it. Badly, it was never being invoked. I then realised that it should be written by page cache(writeback or page reclaim as you mentioned), and that's the stuff I missed before. So, my patch was wrong and sorry for that. > However that's likely going to bounce j_list_lock between CPUs badly. So I checked __jbd2_journal_remove_checkpoint(jh) again, which is the entrance to free a journal buffer. It has two main parts: __buffer_unlink(jh) drop_transaction() if all jh are written The two are all protected by j_list_lock. IMO, we can split it, say let __buffer_unlink(jh) be protected by a per-transaction spin lock, and let drop_transaction() be protected by by j_list_lock as it was. As drop_transaction() doesn't happen frequently as __buffer_unlink(jh), it should alleviate the lock contention a bit. Thoughts? --yliu > I'd rather somehow batch the updates of the checkpointing lists... > > Honza > > > > > Secondly, I noticed severe scalability limit caused by this function with fsmark: > > > > $ fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096 > > > > > > > > It mainly writes tons of small files, each with 4K, and calls fsync after > > > > each write. It's one thread only, and it's tested on a 32G memory disk(brd). > > > > > > > > "perf top" shows that journal_clean_one_cp_list() is very hot, like 40%. > > > > However, that function is quite cheap, actually. But __jbd2_journal_clean_checkpoint_list() > > > > will repeatedly call it for each transaction on the j_checkpoint_transactions. > > > > The list becomes longer and longer in linear with time. It soon grows to > > > > a list with 3,000 and more transactions. What's worse, we will iterate all > > > > those transactions on this list every time before committing a transaction, > > > > aka, fsync a file in our case. > > > OK, this is kind of a pathological workload (generating lots of tiny > > > transactions) but it's not insane. I'd also note that we should eventually > > > reach a steady state once the journal fills up and we will be forced to > > > checkpoint transactions from the journal to make space for new ones. > > > However at that point we will have few thousands of transactions in the > > > journal and I agree it takes a while to scan them in > > > __jbd2_journal_clean_checkpoint_list(). > > > > > > I'm not yet convinced that just dropping > > > __jbd2_journal_clean_checkpoint_list() is the right thing to do. > > > > Yes, I feel the same. BTW, I should add a RFC tag before the subject. > > This patch is mainly for informing you guys there might be a scalability > > issue with current JBD2 code. > > > > --yliu > > > > > Some > > > periodic cleaning of checkpointed entries would seem reasonable so that we > > > don't save all that work for the moment we run out of space in the journal. > > > I'll think how to do that in a more efficient way... > > > > > > Honza > > > > > > > So, that's roughly the reason why journal_clean_one_cp_list() is hot. However, > > > > I'm wondering why the j_checkpoint_transactions list keeps growing. Is there > > > > a bug? I did more investigates, and I guess I found the root cause. > > > > > > > > In this case, we invoke fsync after each write, so, it basically means one > > > > transaction for one file. One transaction normally contains few blocks of meta > > > > data, like bitmap, inode and so on. All files in same group shares one bitmap > > > > block. But one inode table block could contains 16 files. Hence, it's possible > > > > that 2 different file points to same meta blocks. > > > > > > > > For example, if file A and B uses same meta blocks, and if we write A and B in > > > > following style: > > > > write(A); > > > > fsync(A); > > > > > > > > write(B); > > > > fsync(B); > > > > > > > > then, when we are about to commit transation for B, and assume transaction of A > > > > is not checkpointted yet, we can safely drop transaction A and replace it with > > > > transaction B. Hence, the j_checkpoint_transactions grows by 1 only. > > > > > > > > And then assume A is the last inode in one inode block, hence, B will use another > > > > inode table block. Thus transaction A and B is different. Hence, both A and B are > > > > inserted to the j_checkpoint_transactions; the list grows by 2. > > > > > > > > Here I also got the proves; I added a trace point in jbd2_log_do_checkpoint(), > > > > and here are some of them: > > > > > > > > fs_mark-2646 [000] .... 5.830990: jbd2_start_checkpoint: tid=20 > > > > fs_mark-2646 [000] .... 5.832391: jbd2_start_checkpoint: tid=36 > > > > fs_mark-2646 [000] .... 5.833794: jbd2_start_checkpoint: tid=52 > > > > fs_mark-2646 [000] .... 5.835153: jbd2_start_checkpoint: tid=68 > > > > fs_mark-2646 [000] .... 5.836517: jbd2_start_checkpoint: tid=84 > > > > fs_mark-2646 [000] .... 5.837982: jbd2_start_checkpoint: tid=100 > > > > fs_mark-2646 [000] .... 5.839464: jbd2_start_checkpoint: tid=116 > > > > fs_mark-2646 [000] .... 5.840820: jbd2_start_checkpoint: tid=132 > > > > > > > > As you can see, the tid jumps by 16 each time, and other transactions are just > > > > replaced by the way I described above. > > > > > > > > Step by step like above, that list grows. And that's why journal_clean_one_cp_list() is hot. > > > > > > > > It removes the scalability issue when I removed this function, and the fsmark > > > > result(Files/sec) jumps from 9000 to 16000, which is an increase about 80%. > > > > > > > > Thoughts? > > > > > > > > CC: Andi Kleen > > > > Signed-off-by: Yuanhan Liu > > > > --- > > > > fs/jbd2/checkpoint.c | 115 --------------------------------------------------- > > > > fs/jbd2/commit.c | 9 ---- > > > > include/linux/jbd2.h | 1 - > > > > 3 files changed, 125 deletions(-) > > > > > > > > diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c > > > > index 9ffb19c..31fce78 100644 > > > > --- a/fs/jbd2/checkpoint.c > > > > +++ b/fs/jbd2/checkpoint.c > > > > @@ -83,26 +83,6 @@ static inline void __buffer_relink_io(struct journal_head *jh) > > > > } > > > > > > > > /* > > > > - * Try to release a checkpointed buffer from its transaction. > > > > - * Returns 1 if we released it and 2 if we also released the > > > > - * whole transaction. > > > > - * > > > > - * Requires j_list_lock > > > > - */ > > > > -static int __try_to_free_cp_buf(struct journal_head *jh) > > > > -{ > > > > - int ret = 0; > > > > - struct buffer_head *bh = jh2bh(jh); > > > > - > > > > - if (jh->b_transaction == NULL && !buffer_locked(bh) && > > > > - !buffer_dirty(bh) && !buffer_write_io_error(bh)) { > > > > - JBUFFER_TRACE(jh, "remove from checkpoint list"); > > > > - ret = __jbd2_journal_remove_checkpoint(jh) + 1; > > > > - } > > > > - return ret; > > > > -} > > > > - > > > > -/* > > > > * __jbd2_log_wait_for_space: wait until there is space in the journal. > > > > * > > > > * Called under j-state_lock *only*. It will be unlocked if we have to wait > > > > @@ -412,101 +392,6 @@ int jbd2_cleanup_journal_tail(journal_t *journal) > > > > > > > > /* Checkpoint list management */ > > > > > > > > -/* > > > > - * journal_clean_one_cp_list > > > > - * > > > > - * Find all the written-back checkpoint buffers in the given list and > > > > - * release them. > > > > - * > > > > - * Called with the journal locked. > > > > - * Called with j_list_lock held. > > > > - * Returns number of buffers reaped (for debug) > > > > - */ > > > > - > > > > -static int journal_clean_one_cp_list(struct journal_head *jh, int *released) > > > > -{ > > > > - struct journal_head *last_jh; > > > > - struct journal_head *next_jh = jh; > > > > - int ret, freed = 0; > > > > - > > > > - *released = 0; > > > > - if (!jh) > > > > - return 0; > > > > - > > > > - last_jh = jh->b_cpprev; > > > > - do { > > > > - jh = next_jh; > > > > - next_jh = jh->b_cpnext; > > > > - ret = __try_to_free_cp_buf(jh); > > > > - if (ret) { > > > > - freed++; > > > > - if (ret == 2) { > > > > - *released = 1; > > > > - return freed; > > > > - } > > > > - } > > > > - /* > > > > - * This function only frees up some memory > > > > - * if possible so we dont have an obligation > > > > - * to finish processing. Bail out if preemption > > > > - * requested: > > > > - */ > > > > - if (need_resched()) > > > > - return freed; > > > > - } while (jh != last_jh); > > > > - > > > > - return freed; > > > > -} > > > > - > > > > -/* > > > > - * journal_clean_checkpoint_list > > > > - * > > > > - * Find all the written-back checkpoint buffers in the journal and release them. > > > > - * > > > > - * Called with the journal locked. > > > > - * Called with j_list_lock held. > > > > - * Returns number of buffers reaped (for debug) > > > > - */ > > > > - > > > > -int __jbd2_journal_clean_checkpoint_list(journal_t *journal) > > > > -{ > > > > - transaction_t *transaction, *last_transaction, *next_transaction; > > > > - int ret = 0; > > > > - int released; > > > > - > > > > - transaction = journal->j_checkpoint_transactions; > > > > - if (!transaction) > > > > - goto out; > > > > - > > > > - last_transaction = transaction->t_cpprev; > > > > - next_transaction = transaction; > > > > - do { > > > > - transaction = next_transaction; > > > > - next_transaction = transaction->t_cpnext; > > > > - ret += journal_clean_one_cp_list(transaction-> > > > > - t_checkpoint_list, &released); > > > > - /* > > > > - * This function only frees up some memory if possible so we > > > > - * dont have an obligation to finish processing. Bail out if > > > > - * preemption requested: > > > > - */ > > > > - if (need_resched()) > > > > - goto out; > > > > - if (released) > > > > - continue; > > > > - /* > > > > - * It is essential that we are as careful as in the case of > > > > - * t_checkpoint_list with removing the buffer from the list as > > > > - * we can possibly see not yet submitted buffers on io_list > > > > - */ > > > > - ret += journal_clean_one_cp_list(transaction-> > > > > - t_checkpoint_io_list, &released); > > > > - if (need_resched()) > > > > - goto out; > > > > - } while (transaction != last_transaction); > > > > -out: > > > > - return ret; > > > > -} > > > > > > > > /* > > > > * journal_remove_checkpoint: called after a buffer has been committed > > > > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c > > > > index b73e021..1ebdd70 100644 > > > > --- a/fs/jbd2/commit.c > > > > +++ b/fs/jbd2/commit.c > > > > @@ -504,15 +504,6 @@ void jbd2_journal_commit_transaction(journal_t *journal) > > > > jbd2_journal_refile_buffer(journal, jh); > > > > } > > > > > > > > - /* > > > > - * Now try to drop any written-back buffers from the journal's > > > > - * checkpoint lists. We do this *before* commit because it potentially > > > > - * frees some memory > > > > - */ > > > > - spin_lock(&journal->j_list_lock); > > > > - __jbd2_journal_clean_checkpoint_list(journal); > > > > - spin_unlock(&journal->j_list_lock); > > > > - > > > > jbd_debug(3, "JBD2: commit phase 1\n"); > > > > > > > > /* > > > > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h > > > > index 0dae71e..c41ab38 100644 > > > > --- a/include/linux/jbd2.h > > > > +++ b/include/linux/jbd2.h > > > > @@ -1042,7 +1042,6 @@ void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block); > > > > extern void jbd2_journal_commit_transaction(journal_t *); > > > > > > > > /* Checkpoint list management */ > > > > -int __jbd2_journal_clean_checkpoint_list(journal_t *journal); > > > > int __jbd2_journal_remove_checkpoint(struct journal_head *); > > > > void __jbd2_journal_insert_checkpoint(struct journal_head *, transaction_t *); > > > > > > > > -- > > > > 1.9.0 > > > > > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > > > > the body of a message to majordomo@vger.kernel.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > > > Jan Kara > > > SUSE Labs, CR > -- > Jan Kara > SUSE Labs, CR