2015-04-02 13:58:27

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled

Hello,

this patch set improves do_get_write_access(), jbd2_journal_get_undo_access(),
and jbd2_journal_dirty_metadata() to be completely lockless in case buffer
is already part of an appropriate journalling list. First three patches
are independent small cleanups so they can go in right away I think.

The other two patches *should* improve the situation for frequent bitmap
or inode table block updates. But frankly, I haven't been able to come up
with a load where I'd see significant contention on update of a single buffer
(or it's hidden by a larger lock). Similarly we could see improvements when
do_get_write_access() would be waiting for buffer lock because buffer is
being written out by checkpointing code. But again I wasn't able to hit this
reliably.

Ted, you mentioned at Vault you had a setup where frequent
do_get_write_access() calls were contending in the revoke code. What was the
load exactly? These patches should improve that as well...

Honza


2015-04-02 13:58:27

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/5] jbd2: Simplify error path on allocation failure in do_get_write_access()

We were acquiring bh_state_lock when allocation of buffer failed in
do_get_write_access() only to be able to jump to a label that releases
the lock and does all other checks that don't make sense for this error
path. Just jump into the right label instead.

Signed-off-by: Jan Kara <[email protected]>
---
fs/jbd2/transaction.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 2b75f0f109be..1995ea539e96 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -956,8 +956,7 @@ repeat:
__func__);
JBUFFER_TRACE(jh, "oom!");
error = -ENOMEM;
- jbd_lock_bh_state(bh);
- goto done;
+ goto out;
}
goto repeat;
}
--
2.1.4


2015-04-02 13:58:27

by Jan Kara

[permalink] [raw]
Subject: [PATCH 1/5] jbd2: Simplify code flow in do_get_write_access()

needs_copy is set only in one place in do_get_write_access(), just move
the frozen buffer copying into that place and factor it out to a
separate function to make do_get_write_access() slightly more readable.

Signed-off-by: Jan Kara <[email protected]>
---
fs/jbd2/transaction.c | 49 +++++++++++++++++++++++++------------------------
1 file changed, 25 insertions(+), 24 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 5f09370c90a8..2b75f0f109be 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -763,6 +763,30 @@ static void warn_dirty_buffer(struct buffer_head *bh)
bdevname(bh->b_bdev, b), (unsigned long long)bh->b_blocknr);
}

+/* Call t_frozen trigger and copy buffer data into jh->b_frozen_data. */
+static void jbd2_freeze_jh_data(struct journal_head *jh)
+{
+ struct page *page;
+ int offset;
+ char *source;
+ struct buffer_head *bh = jh2bh(jh);
+
+ J_EXPECT_JH(jh, buffer_uptodate(bh), "Possible IO failure.\n");
+ page = bh->b_page;
+ offset = offset_in_page(bh->b_data);
+ source = kmap_atomic(page);
+ /* Fire data frozen trigger just before we copy the data */
+ jbd2_buffer_frozen_trigger(jh, source + offset, jh->b_triggers);
+ memcpy(jh->b_frozen_data, source + offset, bh->b_size);
+ kunmap_atomic(source);
+
+ /*
+ * Now that the frozen data is saved off, we need to store any matching
+ * triggers.
+ */
+ jh->b_frozen_triggers = jh->b_triggers;
+}
+
/*
* If the buffer is already part of the current transaction, then there
* is nothing we need to do. If it is already part of a prior
@@ -782,7 +806,6 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
journal_t *journal;
int error;
char *frozen_buffer = NULL;
- int need_copy = 0;
unsigned long start_lock, time_lock;

WARN_ON(!transaction);
@@ -940,7 +963,7 @@ repeat:
}
jh->b_frozen_data = frozen_buffer;
frozen_buffer = NULL;
- need_copy = 1;
+ jbd2_freeze_jh_data(jh);
}
jh->b_next_transaction = transaction;
}
@@ -961,28 +984,6 @@ repeat:
}

done:
- if (need_copy) {
- struct page *page;
- int offset;
- char *source;
-
- J_EXPECT_JH(jh, buffer_uptodate(jh2bh(jh)),
- "Possible IO failure.\n");
- page = jh2bh(jh)->b_page;
- offset = offset_in_page(jh2bh(jh)->b_data);
- source = kmap_atomic(page);
- /* Fire data frozen trigger just before we copy the data */
- jbd2_buffer_frozen_trigger(jh, source + offset,
- jh->b_triggers);
- memcpy(jh->b_frozen_data, source+offset, jh2bh(jh)->b_size);
- kunmap_atomic(source);

2015-04-02 13:58:28

by Jan Kara

[permalink] [raw]
Subject: [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()

jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are
frequently called for buffers that are already part of the running
transaction - most frequently it is the case for bitmaps, inode table
blocks, and superblock. Since in such cases we have nothing to do, it is
unfortunate we still grab reference to journal head, lock the bh, lock
bh_state only to find out there's nothing to do.

Improving this is a bit subtle though since until we find out journal
head is attached to the running transaction, it can disappear from under
us because checkpointing / commit decided it's no longer needed. We deal
with this by protecting journal_head slab with RCU. We still have to be
careful about journal head being freed & reallocated within slab and
about exposing journal head in consistent state (in particular
b_modified and b_frozen_data must be in correct state before we allow
user to touch the buffer).

FIXME: Performance data.

Signed-off-by: Jan Kara <[email protected]>
---
fs/jbd2/journal.c | 2 +-
fs/jbd2/transaction.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index b96bd8076b70..f29872ed4097 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2330,7 +2330,7 @@ static int jbd2_journal_init_journal_head_cache(void)
jbd2_journal_head_cache = kmem_cache_create("jbd2_journal_head",
sizeof(struct journal_head),
0, /* offset */
- SLAB_TEMPORARY, /* flags */
+ SLAB_TEMPORARY | SLAB_DESTROY_BY_RCU,
NULL); /* ctor */
retval = 0;
if (!jbd2_journal_head_cache) {
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 5207825d1038..a91f639af6c3 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -901,6 +901,12 @@ repeat:
JBUFFER_TRACE(jh, "no transaction");
J_ASSERT_JH(jh, !jh->b_next_transaction);
JBUFFER_TRACE(jh, "file as BJ_Reserved");
+ /*
+ * Make sure all stores to jh (b_modified, b_frozen_data) are
+ * visible before attaching it to the running transaction.
+ * Paired with barrier in jbd2_write_access_granted()
+ */
+ smp_wmb();
spin_lock(&journal->j_list_lock);
__jbd2_journal_file_buffer(jh, transaction, BJ_Reserved);
spin_unlock(&journal->j_list_lock);
@@ -913,8 +919,7 @@ repeat:
if (jh->b_frozen_data) {
JBUFFER_TRACE(jh, "has frozen data");
J_ASSERT_JH(jh, jh->b_next_transaction == NULL);
- jh->b_next_transaction = transaction;
- goto done;
+ goto attach_next;
}

JBUFFER_TRACE(jh, "owned by older transaction");
@@ -968,6 +973,13 @@ repeat:
frozen_buffer = NULL;
jbd2_freeze_jh_data(jh);
}
+attach_next:
+ /*
+ * Make sure all stores to jh (b_modified, b_frozen_data) are visible
+ * before attaching it to the running transaction. Paired with barrier
+ * in jbd2_write_access_granted()
+ */
+ smp_wmb();
jh->b_next_transaction = transaction;

done:
@@ -987,6 +999,55 @@ out:
return error;
}

+/* Fast check whether buffer is already attached to the required transaction */
+static bool jbd2_write_access_granted(handle_t *handle, struct buffer_head *bh)
+{
+ struct journal_head *jh;
+ bool ret = false;
+
+ /* Dirty buffers require special handling... */
+ if (buffer_dirty(bh))
+ return false;
+
+ /*
+ * RCU protects us from dereferencing freed pages. So the checks we do
+ * are guaranteed not to oops. However the jh slab object can get freed
+ * & reallocated while we work with it. So we have to be careful. When
+ * we see jh attached to the running transaction, we know it must stay
+ * so until the transaction is committed. Thus jh won't be freed and
+ * will be attached to the same bh while we run. However it can
+ * happen jh gets freed, reallocated, and attached to the transaction
+ * just after we get pointer to it from bh. So we have to be careful
+ * and recheck jh still belongs to our bh before we return success.
+ */
+ rcu_read_lock();
+ if (!buffer_jbd(bh))
+ goto out;
+ /* This should be bh2jh() but that doesn't work with inline functions */
+ jh = READ_ONCE(bh->b_private);
+ if (!jh)
+ goto out;
+ if (jh->b_transaction != handle->h_transaction &&
+ jh->b_next_transaction != handle->h_transaction)
+ goto out;
+ /*
+ * There are two reasons for the barrier here:
+ * 1) Make sure to fetch b_bh after we did previous checks so that we
+ * detect when jh went through free, realloc, attach to transaction
+ * while we were checking. Paired with implicit barrier in that path.
+ * 2) So that access to bh done after jbd2_write_access_granted()
+ * doesn't get reordered and see inconsistent state of concurrent
+ * do_get_write_access().
+ */
+ smp_mb();
+ if (unlikely(jh->b_bh != bh))
+ goto out;
+ ret = true;
+out:
+ rcu_read_unlock();
+ return ret;
+}
+
/**
* int jbd2_journal_get_write_access() - notify intent to modify a buffer for metadata (not data) update.
* @handle: transaction to add buffer modifications to
@@ -1000,9 +1061,13 @@ out:

int jbd2_journal_get_write_access(handle_t *handle, struct buffer_head *bh)
{
- struct journal_head *jh = jbd2_journal_add_journal_head(bh);
+ struct journal_head *jh;
int rc;

+ if (jbd2_write_access_granted(handle, bh))
+ return 0;
+
+ jh = jbd2_journal_add_journal_head(bh);
/* We do not want to get caught playing with fields which the
* log thread also manipulates. Make sure that the buffer
* completes any outstanding IO before proceeding. */
@@ -1133,11 +1198,14 @@ out:
int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
{
int err;
- struct journal_head *jh = jbd2_journal_add_journal_head(bh);
+ struct journal_head *jh;
char *committed_data = NULL;

JBUFFER_TRACE(jh, "entry");
+ if (jbd2_write_access_granted(handle, bh))
+ return 0;

+ jh = jbd2_journal_add_journal_head(bh);
/*
* Do this first --- it can drop the journal lock, so we want to
* make sure that obtaining the committed_data is done
--
2.1.4


2015-04-02 13:58:30

by Jan Kara

[permalink] [raw]
Subject: [PATCH 5/5] jbd2: Speedup jbd2_journal_dirty_metadata()

It is often the case that we mark buffer as having dirty metadata when
the buffer is already in that state (frequent for bitmaps, inode table
blocks, superblock). Thus it is unnecessary to contend on grabbing
journal head reference and bh_state lock. Avoid that by checking whether
any modification to the buffer is needed before grabbing any locks or
references.

Signed-off-by: Jan Kara <[email protected]>
---
fs/jbd2/transaction.c | 22 ++++++++++++++++------
1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a91f639af6c3..ad10ca8fb9ef 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1290,8 +1290,6 @@ void jbd2_buffer_abort_trigger(struct journal_head *jh,
triggers->t_abort(triggers, jh2bh(jh));
}

-
-
/**
* int jbd2_journal_dirty_metadata() - mark a buffer as containing dirty metadata
* @handle: transaction to add buffer to.
@@ -1325,12 +1323,25 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
WARN_ON(!transaction);
if (is_handle_aborted(handle))
return -EROFS;
- journal = transaction->t_journal;
- jh = jbd2_journal_grab_journal_head(bh);
- if (!jh) {
+ if (!buffer_jbd(bh)) {
ret = -EUCLEAN;
goto out;
}
+ /*
+ * We don't grab jh reference here since the buffer must be part
+ * of the running transaction.
+ */
+ jh = bh2jh(bh);
+ J_ASSERT_JH(jh, jh->b_transaction == transaction ||
+ jh->b_next_transaction == transaction);
+ if (jh->b_modified == 1) {
+ /* If it's in our transaction it must be in BJ_Metadata list */
+ J_ASSERT_JH(jh, jh->b_transaction != transaction ||
+ jh->b_jlist == BJ_Metadata);
+ goto out;
+ }
+
+ journal = transaction->t_journal;
jbd_debug(5, "journal_head %p\n", jh);
JBUFFER_TRACE(jh, "entry");

@@ -1421,7 +1432,6 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
spin_unlock(&journal->j_list_lock);
out_unlock_bh:
jbd_unlock_bh_state(bh);
- jbd2_journal_put_journal_head(jh);
out:
JBUFFER_TRACE(jh, "exit");
return ret;
--
2.1.4


2015-04-02 13:58:27

by Jan Kara

[permalink] [raw]
Subject: [PATCH 3/5] jbd2: Simplify code flow in do_get_write_access()

Check for the simple case of unjournaled buffer first, handle it and
bail out. This allows us to remove one if and unindent the difficult case
by one tab. The result is easier to read.

Signed-off-by: Jan Kara <[email protected]>
---
fs/jbd2/transaction.c | 130 +++++++++++++++++++++++---------------------------
1 file changed, 59 insertions(+), 71 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 1995ea539e96..5207825d1038 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -893,6 +893,20 @@ repeat:
jh->b_modified = 0;

/*
+ * If the buffer is not journaled right now, we need to make sure it
+ * doesn't get written to disk before the caller actually commits the
+ * new data
+ */
+ if (!jh->b_transaction) {
+ JBUFFER_TRACE(jh, "no transaction");
+ J_ASSERT_JH(jh, !jh->b_next_transaction);
+ JBUFFER_TRACE(jh, "file as BJ_Reserved");
+ spin_lock(&journal->j_list_lock);
+ __jbd2_journal_file_buffer(jh, transaction, BJ_Reserved);
+ spin_unlock(&journal->j_list_lock);
+ goto done;
+ }
+ /*
* If there is already a copy-out version of this buffer, then we don't
* need to make another one
*/
@@ -903,84 +917,58 @@ repeat:
goto done;
}

- /* Is there data here we need to preserve? */
+ JBUFFER_TRACE(jh, "owned by older transaction");
+ J_ASSERT_JH(jh, jh->b_next_transaction == NULL);
+ J_ASSERT_JH(jh, jh->b_transaction == journal->j_committing_transaction);

- if (jh->b_transaction && jh->b_transaction != transaction) {
- JBUFFER_TRACE(jh, "owned by older transaction");
- J_ASSERT_JH(jh, jh->b_next_transaction == NULL);
- J_ASSERT_JH(jh, jh->b_transaction ==
- journal->j_committing_transaction);
+ /*
+ * There is one case we have to be very careful about. If the
+ * committing transaction is currently writing this buffer out to disk
+ * and has NOT made a copy-out, then we cannot modify the buffer
+ * contents at all right now. The essence of copy-out is that it is
+ * the extra copy, not the primary copy, which gets journaled. If the
+ * primary copy is already going to disk then we cannot do copy-out
+ * here.
+ */
+ if (buffer_shadow(bh)) {
+ JBUFFER_TRACE(jh, "on shadow: sleep");
+ jbd_unlock_bh_state(bh);
+ wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE);
+ goto repeat;
+ }

- /* There is one case we have to be very careful about.
- * If the committing transaction is currently writing
- * this buffer out to disk and has NOT made a copy-out,
- * then we cannot modify the buffer contents at all
- * right now. The essence of copy-out is that it is the
- * extra copy, not the primary copy, which gets
- * journaled. If the primary copy is already going to
- * disk then we cannot do copy-out here. */
-
- if (buffer_shadow(bh)) {
- JBUFFER_TRACE(jh, "on shadow: sleep");
+ /*
+ * Only do the copy if the currently-owning transaction still needs it.
+ * If buffer isn't on BJ_Metadata list, the committing transaction is
+ * past that stage (here we use the fact that BH_Shadow is set under
+ * bh_state lock together with refiling to BJ_Shadow list and at this
+ * point we know the buffer doesn't have BH_Shadow set).
+ *
+ * Subtle point, though: if this is a get_undo_access, then we will be
+ * relying on the frozen_data to contain the new value of the
+ * committed_data record after the transaction, so we HAVE to force the
+ * frozen_data copy in that case.
+ */
+ if (jh->b_jlist == BJ_Metadata || force_copy) {
+ JBUFFER_TRACE(jh, "generate frozen data");
+ if (!frozen_buffer) {
+ JBUFFER_TRACE(jh, "allocate memory for buffer");
jbd_unlock_bh_state(bh);
- wait_on_bit_io(&bh->b_state, BH_Shadow,
- TASK_UNINTERRUPTIBLE);
- goto repeat;
- }
-
- /*
- * Only do the copy if the currently-owning transaction still
- * needs it. If buffer isn't on BJ_Metadata list, the
- * committing transaction is past that stage (here we use the
- * fact that BH_Shadow is set under bh_state lock together with
- * refiling to BJ_Shadow list and at this point we know the
- * buffer doesn't have BH_Shadow set).
- *
- * Subtle point, though: if this is a get_undo_access,
- * then we will be relying on the frozen_data to contain
- * the new value of the committed_data record after the
- * transaction, so we HAVE to force the frozen_data copy
- * in that case.
- */
- if (jh->b_jlist == BJ_Metadata || force_copy) {
- JBUFFER_TRACE(jh, "generate frozen data");
+ frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
if (!frozen_buffer) {
- JBUFFER_TRACE(jh, "allocate memory for buffer");
- jbd_unlock_bh_state(bh);
- frozen_buffer =
- jbd2_alloc(jh2bh(jh)->b_size,
- GFP_NOFS);
- if (!frozen_buffer) {
- printk(KERN_ERR
- "%s: OOM for frozen_buffer\n",
- __func__);
- JBUFFER_TRACE(jh, "oom!");
- error = -ENOMEM;
- goto out;
- }
- goto repeat;
+ printk(KERN_ERR "%s: OOM for frozen_buffer\n",
+ __func__);
+ JBUFFER_TRACE(jh, "oom!");
+ error = -ENOMEM;
+ goto out;
}
- jh->b_frozen_data = frozen_buffer;
- frozen_buffer = NULL;
- jbd2_freeze_jh_data(jh);
+ goto repeat;
}
- jh->b_next_transaction = transaction;
- }
-

2015-04-02 14:23:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled

On Thu, Apr 02, 2015 at 03:58:15PM +0200, Jan Kara wrote:
>
> this patch set improves do_get_write_access(), jbd2_journal_get_undo_access(),
> and jbd2_journal_dirty_metadata() to be completely lockless in case buffer
> is already part of an appropriate journalling list. First three patches
> are independent small cleanups so they can go in right away I think.
>
> The other two patches *should* improve the situation for frequent bitmap
> or inode table block updates. But frankly, I haven't been able to come up
> with a load where I'd see significant contention on update of a single buffer
> (or it's hidden by a larger lock). Similarly we could see improvements when
> do_get_write_access() would be waiting for buffer lock because buffer is
> being written out by checkpointing code. But again I wasn't able to hit this
> reliably.
>
> Ted, you mentioned at Vault you had a setup where frequent
> do_get_write_access() calls were contending in the revoke code. What was the
> load exactly? These patches should improve that as well...

Use a 32-core Intel processor with 128GB memory; create a 32GB ram
disk, but ext4 on it, and then run your favorite scalability workload
on it. I used a random 4k write workload, and noted that we were
calling start_handle() all the time. This was fixed in dioread_nolock
since we check to see if it's an overwrite.

I'll have to look at this again, but I remember thinking that we could
push the overwrite check down a level, and with a few other tweaks,
end up fixing the AIO race condition you were worrying about it, as
well as skipping the start_handle() call in the case where we know
we're doing an overwrite in all cases, not just dioread_nolock.

Cheers,

- Ted

2015-04-12 10:09:18

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled


Jan Kara <[email protected]> writes:

> Hello,
>
> this patch set improves do_get_write_access(), jbd2_journal_get_undo_access(),
> and jbd2_journal_dirty_metadata() to be completely lockless in case buffer
> is already part of an appropriate journalling list. First three patches
> are independent small cleanups so they can go in right away I think.
>
> The other two patches *should* improve the situation for frequent bitmap
> or inode table block updates. But frankly, I haven't been able to come up
> with a load where I'd see significant contention on update of a single buffer
> (or it's hidden by a larger lock). Similarly we could see improvements when
> do_get_write_access() would be waiting for buffer lock because buffer is
> being written out by checkpointing code. But again I wasn't able to hit this
> reliably.
One of most annoying performance issues was unpredictable latency of aio submission
This is typical workload on chunk server (object storage, cloud storage,
ceph) where one some tasks performs aio/dio submission and other
performs fsync(). Some times we got this io_submit->touch_mtime()->
do_get_write_access() observes that jh->b_jlist == BJ_Shadow and wait
for transaction commit. So aio-dio submission can block (even if file
was previously allocated) for a long time(1-5sec) on ext3/4
But this was fixed by 'lazytime' option

#Simplified testcase
#BAD workload which provoke endless fsync->commit_transaction
while true; do
xfs_io -c "pwrite -b 1M 1M 32M" \
-f t{1,2,3,5,6,7,8,9,10};
xfs_io -c "pwrite -b 1M 1M 1M" -c \
"fsync" -f -d t11

# Measure aio-dio latency
[root@alice Z]# uname -a
Linux alice.qa.sw.ru 4.0.0-rc7+ #13 SMP Sun Apr 12 00:34:51 MSK 2015
x86_64 x86_64 x86_64 GNU/Linux
[root@alice Z]# ioping -A -C -D -WWW t
4.0 KiB from t (ext4 /dev/sdb1): request=1 time=441 us
...
4.0 KiB from t (ext4 /dev/sdb1): request=12 time=393 us
4.0 KiB from t (ext4 /dev/sdb1): request=13 time=2.7 s <---- too long
4.0 KiB from t (ext4 /dev/sdb1): request=14 time=397 us
4.0 KiB from t (ext4 /dev/sdb1): request=15 time=398 us
^C
--- t (ext4 /dev/sdb1) ioping statistics ---
15 requests completed in 17.2 s, 5 iops, 22.0 KiB/s
min/avg/max/mdev = 384 us / 182.2 ms / 2.7 s / 679.1 ms
>
> Ted, you mentioned at Vault you had a setup where frequent
> do_get_write_access() calls were contending in the revoke code. What was the
> load exactly? These patches should improve that as well...
>
> Honza
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-04-16 10:46:25

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled

On Sun 12-04-15 14:09:14, Dmitry Monakhov wrote:
>
> Jan Kara <[email protected]> writes:
>
> > Hello,
> >
> > this patch set improves do_get_write_access(), jbd2_journal_get_undo_access(),
> > and jbd2_journal_dirty_metadata() to be completely lockless in case buffer
> > is already part of an appropriate journalling list. First three patches
> > are independent small cleanups so they can go in right away I think.
> >
> > The other two patches *should* improve the situation for frequent bitmap
> > or inode table block updates. But frankly, I haven't been able to come up
> > with a load where I'd see significant contention on update of a single buffer
> > (or it's hidden by a larger lock). Similarly we could see improvements when
> > do_get_write_access() would be waiting for buffer lock because buffer is
> > being written out by checkpointing code. But again I wasn't able to hit this
> > reliably.
> One of most annoying performance issues was unpredictable latency of aio submission
> This is typical workload on chunk server (object storage, cloud storage,
> ceph) where one some tasks performs aio/dio submission and other
> performs fsync(). Some times we got this io_submit->touch_mtime()->
> do_get_write_access() observes that jh->b_jlist == BJ_Shadow and wait
> for transaction commit. So aio-dio submission can block (even if file
> was previously allocated) for a long time(1-5sec) on ext3/4
> But this was fixed by 'lazytime' option
Yeah, my patches don't really help this particular problem. They help the
case where buffer already is part of the running transaction but in your
case we need to move the buffer to the running transaction and that blocks
on buffer being under IO. There isn't much we can do to improve this (well,
we could unconditionally create a frozen buffer for journal IO which would
limit maximum latency but at the cost of memcpy for each journal write so
throughput would suffer in case of faster storage).

Honza
>
> #Simplified testcase
> #BAD workload which provoke endless fsync->commit_transaction
> while true; do
> xfs_io -c "pwrite -b 1M 1M 32M" \
> -f t{1,2,3,5,6,7,8,9,10};
> xfs_io -c "pwrite -b 1M 1M 1M" -c \
> "fsync" -f -d t11
>
> # Measure aio-dio latency
> [root@alice Z]# uname -a
> Linux alice.qa.sw.ru 4.0.0-rc7+ #13 SMP Sun Apr 12 00:34:51 MSK 2015
> x86_64 x86_64 x86_64 GNU/Linux
> [root@alice Z]# ioping -A -C -D -WWW t
> 4.0 KiB from t (ext4 /dev/sdb1): request=1 time=441 us
> ...
> 4.0 KiB from t (ext4 /dev/sdb1): request=12 time=393 us
> 4.0 KiB from t (ext4 /dev/sdb1): request=13 time=2.7 s <---- too long
> 4.0 KiB from t (ext4 /dev/sdb1): request=14 time=397 us
> 4.0 KiB from t (ext4 /dev/sdb1): request=15 time=398 us
> ^C
> --- t (ext4 /dev/sdb1) ioping statistics ---
> 15 requests completed in 17.2 s, 5 iops, 22.0 KiB/s
> min/avg/max/mdev = 384 us / 182.2 ms / 2.7 s / 679.1 ms
> >
> > Ted, you mentioned at Vault you had a setup where frequent
> > do_get_write_access() calls were contending in the revoke code. What was the
> > load exactly? These patches should improve that as well...
> >
> > Honza
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-08 16:39:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd2: Simplify code flow in do_get_write_access()

On Thu, Apr 02, 2015 at 03:58:16PM +0200, Jan Kara wrote:
> needs_copy is set only in one place in do_get_write_access(), just move
> the frozen buffer copying into that place and factor it out to a
> separate function to make do_get_write_access() slightly more readable.
>
> Signed-off-by: Jan Kara <[email protected]>

Thanks, applied.

- Ted

2015-06-08 16:42:14

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/5] jbd2: Simplify error path on allocation failure in do_get_write_access()

On Thu, Apr 02, 2015 at 03:58:17PM +0200, Jan Kara wrote:
> We were acquiring bh_state_lock when allocation of buffer failed in
> do_get_write_access() only to be able to jump to a label that releases
> the lock and does all other checks that don't make sense for this error
> path. Just jump into the right label instead.
>
> Signed-off-by: Jan Kara <[email protected]>

Thanks, applied.

- Ted

2015-06-08 16:45:22

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/5] jbd2: Simplify code flow in do_get_write_access()

On Thu, Apr 02, 2015 at 03:58:18PM +0200, Jan Kara wrote:
> Check for the simple case of unjournaled buffer first, handle it and
> bail out. This allows us to remove one if and unindent the difficult case
> by one tab. The result is easier to read.
>
> Signed-off-by: Jan Kara <[email protected]>

Applied, although I renamed the summary so we didn't have to commits
with the identical one-line summary in close proximity to each other.

- Ted

2015-06-08 16:47:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()

On Thu, Apr 02, 2015 at 03:58:19PM +0200, Jan Kara wrote:
> jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are
> frequently called for buffers that are already part of the running
> transaction - most frequently it is the case for bitmaps, inode table
> blocks, and superblock. Since in such cases we have nothing to do, it is
> unfortunate we still grab reference to journal head, lock the bh, lock
> bh_state only to find out there's nothing to do.
>
> Improving this is a bit subtle though since until we find out journal
> head is attached to the running transaction, it can disappear from under
> us because checkpointing / commit decided it's no longer needed. We deal
> with this by protecting journal_head slab with RCU. We still have to be
> careful about journal head being freed & reallocated within slab and
> about exposing journal head in consistent state (in particular
> b_modified and b_frozen_data must be in correct state before we allow
> user to touch the buffer).
>
> FIXME: Performance data.
>
> Signed-off-by: Jan Kara <[email protected]>

Applied, so we can start getting some testing on this patch. Did you
ever get performance data?

Thanks,

- Ted

2015-06-08 16:50:39

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 5/5] jbd2: Speedup jbd2_journal_dirty_metadata()

On Thu, Apr 02, 2015 at 03:58:20PM +0200, Jan Kara wrote:
> It is often the case that we mark buffer as having dirty metadata when
> the buffer is already in that state (frequent for bitmaps, inode table
> blocks, superblock). Thus it is unnecessary to contend on grabbing
> journal head reference and bh_state lock. Avoid that by checking whether
> any modification to the buffer is needed before grabbing any locks or
> references.
>
> Signed-off-by: Jan Kara <[email protected]>

Thanks, applied.

- Ted

2015-06-08 22:32:38

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()

On Mon, Jun 08, 2015 at 12:47:26PM -0400, Theodore Ts'o wrote:
> On Thu, Apr 02, 2015 at 03:58:19PM +0200, Jan Kara wrote:
> > jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are
> > frequently called for buffers that are already part of the running
> > transaction - most frequently it is the case for bitmaps, inode table
> > blocks, and superblock. Since in such cases we have nothing to do, it is
> > unfortunate we still grab reference to journal head, lock the bh, lock
> > bh_state only to find out there's nothing to do.
> >
> > Improving this is a bit subtle though since until we find out journal
> > head is attached to the running transaction, it can disappear from under
> > us because checkpointing / commit decided it's no longer needed. We deal
> > with this by protecting journal_head slab with RCU. We still have to be
> > careful about journal head being freed & reallocated within slab and
> > about exposing journal head in consistent state (in particular
> > b_modified and b_frozen_data must be in correct state before we allow
> > user to touch the buffer).
> >
> > FIXME: Performance data.
> >
> > Signed-off-by: Jan Kara <[email protected]>
>
> Applied, so we can start getting some testing on this patch. Did you
> ever get performance data?

.... and this patch is causing generic/011 to fail.

generic/011 2s ... [18:26:52][ 13.085375] run fstests generic/011 at 2015-06-08 18:26:52
[ 13.698245] ------------[ cut here ]------------
[ 13.699093] kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1329!
[ 13.700354] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 13.701388] Modules linked in:
[ 13.701505] CPU: 0 PID: 3947 Comm: dirstress Not tainted 4.1.0-rc4-00034-g562bef4 #2758
[ 13.701505] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 13.701505] task: ee1bc110 ti: ec080000 task.ti: ec080000
[ 13.701505] EIP: 0060:[<c02eeb96>] EFLAGS: 00210206 CPU: 0
[ 13.701505] EIP is at jbd2_journal_dirty_metadata+0x5e/0x1da
[ 13.701505] EAX: 00000000 EBX: eccde090 ECX: f03fd580 EDX: f03fd580
[ 13.701505] ESI: eee85860 EDI: ed918cc0 EBP: ec081e14 ESP: ec081e00
[ 13.701505] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 13.701505] CR0: 8005003b CR2: b73fbb20 CR3: 2fd58700 CR4: 000006f0
[ 13.701505] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 13.701505] DR6: fffe0ff0 DR7: 00000400
[ 13.701505] Stack:
[ 13.701505] 000000f7 f03fd580 ed918cc0 eee85860 00000000 ec081e30 c02d7fc8 00001179
[ 13.701505] c0865c3c eae67c30 ee17f800 00000000 ec081e84 c02b6458 00000000 ed918cc0
[ 13.701505] c02ee941 00000000 00000000 00000000 eae67b20 00000000 00000000 eae67e78
[ 13.701505] Call Trace:
[ 13.701505] [<c02d7fc8>] __ext4_handle_dirty_metadata+0xd4/0x19d
[ 13.701505] [<c02b6458>] ext4_mark_iloc_dirty+0x458/0x577
[ 13.701505] [<c02ee941>] ? jbd2_journal_get_write_access+0x3d/0x48
[ 13.701505] [<c02b66e4>] ext4_mark_inode_dirty+0x105/0x252
[ 13.701505] [<c02b121b>] __ext4_new_inode+0xcb6/0xe9b
[ 13.701505] [<c02bd87d>] ext4_mknod+0x8b/0x11c
[ 13.701505] [<c025ffbe>] vfs_mknod+0x7e/0x9e
[ 13.701505] [<c0263a0a>] SyS_mknodat+0x119/0x15a
[ 13.701505] [<c0263a65>] SyS_mknod+0x1a/0x1c
[ 13.701505] [<c083292a>] syscall_call+0x7/0x7
[ 13.701505] Code: 00 00 8b 5f 24 8b 4b 18 39 d1 74 07 39 53 1c 74 02 0f 0b 83 7b 0c 01 75 14 39 d1 0f 85 7e 01 00 00 83 7b 08 01 0f 84 74 01 00 00 <0f> 0b 8b 02 53 68 16 2c ac c0 68 36 05 00 00 68 14 81 86 c0 68
[ 13.701505] EIP: [<c02eeb96>] jbd2_journal_dirty_metadata+0x5e/0x1da SS:ESP 0068:ec081e00
[ 13.734150] ---[ end trace eb359de3ec6c3af4 ]---

I will drop 4/5 and 5/5 from this patch series from the ext4 tree for
now. Could you take a look at this?

- Ted

2015-06-09 05:24:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()

On Mon, Jun 08, 2015 at 06:32:30PM -0400, Theodore Ts'o wrote:
>
> .... and this patch is causing generic/011 to fail.
>
> generic/011 2s ... [18:26:52][ 13.085375] run fstests generic/011 at 2015-06-08 18:26:52
> [ 13.698245] ------------[ cut here ]------------
> [ 13.699093] kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1329!
> [ 13.700354] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 13.701388] Modules linked in:
> [ 13.701505] CPU: 0 PID: 3947 Comm: dirstress Not tainted 4.1.0-rc4-00034-g562bef4 #2758
> [ 13.701505] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 13.701505] task: ee1bc110 ti: ec080000 task.ti: ec080000
> [ 13.701505] EIP: 0060:[<c02eeb96>] EFLAGS: 00210206 CPU: 0
> [ 13.701505] EIP is at jbd2_journal_dirty_metadata+0x5e/0x1da
> [ 13.701505] EAX: 00000000 EBX: eccde090 ECX: f03fd580 EDX: f03fd580
> [ 13.701505] ESI: eee85860 EDI: ed918cc0 EBP: ec081e14 ESP: ec081e00
> [ 13.701505] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> [ 13.701505] CR0: 8005003b CR2: b73fbb20 CR3: 2fd58700 CR4: 000006f0
> [ 13.701505] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> [ 13.701505] DR6: fffe0ff0 DR7: 00000400
> [ 13.701505] Stack:
> [ 13.701505] 000000f7 f03fd580 ed918cc0 eee85860 00000000 ec081e30 c02d7fc8 00001179
> [ 13.701505] c0865c3c eae67c30 ee17f800 00000000 ec081e84 c02b6458 00000000 ed918cc0
> [ 13.701505] c02ee941 00000000 00000000 00000000 eae67b20 00000000 00000000 eae67e78
> [ 13.701505] Call Trace:
> [ 13.701505] [<c02d7fc8>] __ext4_handle_dirty_metadata+0xd4/0x19d
> [ 13.701505] [<c02b6458>] ext4_mark_iloc_dirty+0x458/0x577
> [ 13.701505] [<c02ee941>] ? jbd2_journal_get_write_access+0x3d/0x48
> [ 13.701505] [<c02b66e4>] ext4_mark_inode_dirty+0x105/0x252
> [ 13.701505] [<c02b121b>] __ext4_new_inode+0xcb6/0xe9b
> [ 13.701505] [<c02bd87d>] ext4_mknod+0x8b/0x11c
> [ 13.701505] [<c025ffbe>] vfs_mknod+0x7e/0x9e
> [ 13.701505] [<c0263a0a>] SyS_mknodat+0x119/0x15a
> [ 13.701505] [<c0263a65>] SyS_mknod+0x1a/0x1c
> [ 13.701505] [<c083292a>] syscall_call+0x7/0x7
> [ 13.701505] Code: 00 00 8b 5f 24 8b 4b 18 39 d1 74 07 39 53 1c 74 02 0f 0b 83 7b 0c 01 75 14 39 d1 0f 85 7e 01 00 00 83 7b 08 01 0f 84 74 01 00 00 <0f> 0b 8b 02 53 68 16 2c ac c0 68 36 05 00 00 68 14 81 86 c0 68
> [ 13.701505] EIP: [<c02eeb96>] jbd2_journal_dirty_metadata+0x5e/0x1da SS:ESP 0068:ec081e00
> [ 13.734150] ---[ end trace eb359de3ec6c3af4 ]---
>
> I will drop 4/5 and 5/5 from this patch series from the ext4 tree for
> now. Could you take a look at this?

It looks like the problem is really caused by the patch #5/5, not
patch #4/5. So I'll drop patch #5 and do more in-depth testing with
the first four patches in the patch series.

- Ted


2015-06-17 16:39:53

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()

On Tue 09-06-15 01:24:50, Ted Tso wrote:
> On Mon, Jun 08, 2015 at 06:32:30PM -0400, Theodore Ts'o wrote:
> >
> > .... and this patch is causing generic/011 to fail.
> >
> > generic/011 2s ... [18:26:52][ 13.085375] run fstests generic/011 at 2015-06-08 18:26:52
> > [ 13.698245] ------------[ cut here ]------------
> > [ 13.699093] kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1329!
> > [ 13.700354] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> > [ 13.701388] Modules linked in:
> > [ 13.701505] CPU: 0 PID: 3947 Comm: dirstress Not tainted 4.1.0-rc4-00034-g562bef4 #2758
> > [ 13.701505] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > [ 13.701505] task: ee1bc110 ti: ec080000 task.ti: ec080000
> > [ 13.701505] EIP: 0060:[<c02eeb96>] EFLAGS: 00210206 CPU: 0
> > [ 13.701505] EIP is at jbd2_journal_dirty_metadata+0x5e/0x1da
> > [ 13.701505] EAX: 00000000 EBX: eccde090 ECX: f03fd580 EDX: f03fd580
> > [ 13.701505] ESI: eee85860 EDI: ed918cc0 EBP: ec081e14 ESP: ec081e00
> > [ 13.701505] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> > [ 13.701505] CR0: 8005003b CR2: b73fbb20 CR3: 2fd58700 CR4: 000006f0
> > [ 13.701505] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> > [ 13.701505] DR6: fffe0ff0 DR7: 00000400
> > [ 13.701505] Stack:
> > [ 13.701505] 000000f7 f03fd580 ed918cc0 eee85860 00000000 ec081e30 c02d7fc8 00001179
> > [ 13.701505] c0865c3c eae67c30 ee17f800 00000000 ec081e84 c02b6458 00000000 ed918cc0
> > [ 13.701505] c02ee941 00000000 00000000 00000000 eae67b20 00000000 00000000 eae67e78
> > [ 13.701505] Call Trace:
> > [ 13.701505] [<c02d7fc8>] __ext4_handle_dirty_metadata+0xd4/0x19d
> > [ 13.701505] [<c02b6458>] ext4_mark_iloc_dirty+0x458/0x577
> > [ 13.701505] [<c02ee941>] ? jbd2_journal_get_write_access+0x3d/0x48
> > [ 13.701505] [<c02b66e4>] ext4_mark_inode_dirty+0x105/0x252
> > [ 13.701505] [<c02b121b>] __ext4_new_inode+0xcb6/0xe9b
> > [ 13.701505] [<c02bd87d>] ext4_mknod+0x8b/0x11c
> > [ 13.701505] [<c025ffbe>] vfs_mknod+0x7e/0x9e
> > [ 13.701505] [<c0263a0a>] SyS_mknodat+0x119/0x15a
> > [ 13.701505] [<c0263a65>] SyS_mknod+0x1a/0x1c
> > [ 13.701505] [<c083292a>] syscall_call+0x7/0x7
> > [ 13.701505] Code: 00 00 8b 5f 24 8b 4b 18 39 d1 74 07 39 53 1c 74 02 0f 0b 83 7b 0c 01 75 14 39 d1 0f 85 7e 01 00 00 83 7b 08 01 0f 84 74 01 00 00 <0f> 0b 8b 02 53 68 16 2c ac c0 68 36 05 00 00 68 14 81 86 c0 68
> > [ 13.701505] EIP: [<c02eeb96>] jbd2_journal_dirty_metadata+0x5e/0x1da SS:ESP 0068:ec081e00
> > [ 13.734150] ---[ end trace eb359de3ec6c3af4 ]---
> >
> > I will drop 4/5 and 5/5 from this patch series from the ext4 tree for
> > now. Could you take a look at this?
>
> It looks like the problem is really caused by the patch #5/5, not
> patch #4/5. So I'll drop patch #5 and do more in-depth testing with
> the first four patches in the patch series.
Yeah, sorry. I had fixed this locally (the assertion I have added had
false positives) but I didn't resend. I have checked and I don't have any
other changes locally. New version of the patch 5/5 is attached.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (3.11 kB)
0005-jbd2-Speedup-jbd2_journal_dirty_metadata.patch (2.61 kB)
Download all attachments

2015-06-17 16:56:51

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()

On Mon 08-06-15 12:47:26, Ted Tso wrote:
> On Thu, Apr 02, 2015 at 03:58:19PM +0200, Jan Kara wrote:
> > jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are
> > frequently called for buffers that are already part of the running
> > transaction - most frequently it is the case for bitmaps, inode table
> > blocks, and superblock. Since in such cases we have nothing to do, it is
> > unfortunate we still grab reference to journal head, lock the bh, lock
> > bh_state only to find out there's nothing to do.
> >
> > Improving this is a bit subtle though since until we find out journal
> > head is attached to the running transaction, it can disappear from under
> > us because checkpointing / commit decided it's no longer needed. We deal
> > with this by protecting journal_head slab with RCU. We still have to be
> > careful about journal head being freed & reallocated within slab and
> > about exposing journal head in consistent state (in particular
> > b_modified and b_frozen_data must be in correct state before we allow
> > user to touch the buffer).
> >
> > FIXME: Performance data.
> >
> > Signed-off-by: Jan Kara <[email protected]>
>
> Applied, so we can start getting some testing on this patch. Did you
> ever get performance data?

Yes. Here are results for reaim fserver workload for 32 core machine with
128 GB of ram with ext4 on ramdisk:
Procs Vanilla Patched
1 20420.688 21155.556
21 49684.704 178934.074
41 84630.364 196647.482
61 106344.284 204831.652
81 120751.370 214842.428
101 131585.450 208761.832
121 138092.078 212741.648
141 142271.578 212118.502
161 146008.364 213731.388
181 149569.494 216121.444

Numbers are operations per second so larger is better. You can see that
for 21 processes we get increase by 260% in the number operations. Also the
total maximum of operations the machine is able to achieve increases by
44% because of overall lower CPU overhead.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-18 08:52:41

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()

On Thu 18-06-15 15:59:33, Jerry Lee wrote:
> Hi:
>
> I found that there may be a typo in the attached patch 5/5:
>
> + /*
> + * If it's in our transaction it must be in BJ_Metadata list.
> + * The assertion is unreliable since we may see jh in
> + * inconsistent state unless we grab bh_state lock. But this
> + * is crutial to catch bugs so let's do a reliable check until
> + * the lockless handling is fully proven.
> + */
> + if (jh->b_transaction == transaction &&
> + jh->b_jlist != BJ_Metadata) {
> + jbd_lock_bh_state(bh);
> + J_ASSERT_JH(jh, jh->b_transaction != transaction ||
> + jh->b_jlist == BJ_Metadata);
> + jbd_lock_bh_state(bh);
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> + }
> + goto out;
>
> Does the highlight part should be "jbd_unlock_bh_state(bh)"?
Yeah, thanks for catching this. I was apparently pretty lucky in this
passing all the tests... Attached is the new version of the patch.

Honza

> On Thu, Jun 18, 2015 at 12:56 AM, Jan Kara <[email protected]> wrote:
>
> > On Mon 08-06-15 12:47:26, Ted Tso wrote:
> > > On Thu, Apr 02, 2015 at 03:58:19PM +0200, Jan Kara wrote:
> > > > jbd2_journal_get_write_access() and jbd2_journal_get_create_access()
> > are
> > > > frequently called for buffers that are already part of the running
> > > > transaction - most frequently it is the case for bitmaps, inode table
> > > > blocks, and superblock. Since in such cases we have nothing to do, it
> > is
> > > > unfortunate we still grab reference to journal head, lock the bh, lock
> > > > bh_state only to find out there's nothing to do.
> > > >
> > > > Improving this is a bit subtle though since until we find out journal
> > > > head is attached to the running transaction, it can disappear from
> > under
> > > > us because checkpointing / commit decided it's no longer needed. We
> > deal
> > > > with this by protecting journal_head slab with RCU. We still have to be
> > > > careful about journal head being freed & reallocated within slab and
> > > > about exposing journal head in consistent state (in particular
> > > > b_modified and b_frozen_data must be in correct state before we allow
> > > > user to touch the buffer).
> > > >
> > > > FIXME: Performance data.
> > > >
> > > > Signed-off-by: Jan Kara <[email protected]>
> > >
> > > Applied, so we can start getting some testing on this patch. Did you
> > > ever get performance data?
> >
> > Yes. Here are results for reaim fserver workload for 32 core machine with
> > 128 GB of ram with ext4 on ramdisk:
> > Procs Vanilla Patched
> > 1 20420.688 21155.556
> > 21 49684.704 178934.074
> > 41 84630.364 196647.482
> > 61 106344.284 204831.652
> > 81 120751.370 214842.428
> > 101 131585.450 208761.832
> > 121 138092.078 212741.648
> > 141 142271.578 212118.502
> > 161 146008.364 213731.388
> > 181 149569.494 216121.444
> >
> > Numbers are operations per second so larger is better. You can see that
> > for 21 processes we get increase by 260% in the number operations. Also the
> > total maximum of operations the machine is able to achieve increases by
> > 44% because of overall lower CPU overhead.
> >
> > Honza
> > --
> > Jan Kara <[email protected]>
> > SUSE Labs, CR
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (3.59 kB)
0005-jbd2-Speedup-jbd2_journal_dirty_metadata.patch (2.61 kB)
Download all attachments

2015-06-21 01:56:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()

On Thu, Jun 18, 2015 at 10:52:35AM +0200, Jan Kara wrote:
> Yeah, thanks for catching this. I was apparently pretty lucky in this
> passing all the tests... Attached is the new version of the patch.

Thanks, I've added this to the ext4 tree.

- Ted