On Thu 17-06-21 01:44:13, Andreas Dilger wrote:
> On Jun 16, 2021, at 4:56 AM, Jan Kara <[email protected]> wrote:
> >
> > Ext4 orphan inode handling is a bottleneck for workloads which heavily
> > truncate / unlink small files since it contends on the global
> > s_orphan_mutex lock (and generally it's difficult to improve scalability
> > of the ondisk linked list of orphaned inodes).
> >
> > This patch implements new way of handling orphan inodes. Instead of
> > linking orphaned inode into a linked list, we store it's inode number in
> > a new special file which we call "orphan file". Currently we still
> > protect the orphan file with a spinlock for simplicity but even in this
> > setting we can substantially reduce the length of the critical section
> > and thus speedup some workloads.
>
> Is it a single spinlock for the whole file? Did you consider using
> a per-page lock or grouplock? With a page in the orphan file for each
> CPU core, it would basically be lockless.

See the next patch :) I've made this one simple in terms of locking:

a) to be able to evaluate how global spinlock performs
b) to make code simpler for review

> > +static int ext4_orphan_file_add(handle_t *handle, struct inode *inode)
> > +{
> > spin_lock(&oi->of_lock);
> > + for (i = 0; i < oi->of_blocks && !oi->of_binfo[i].ob_free_entries; i++);
> > + if (i == oi->of_blocks) {
> > + spin_unlock(&oi->of_lock);
> > + /*
> > + * For now we don't grow or shrink orphan file. We just use
> > + * whatever was allocated at mke2fs time. The additional
> > + * credits we would have to reserve for each orphan inode
> > + * operation just don't seem worth it.
> > + */
> > + return -ENOSPC;
> > + }
> > + oi->of_binfo[i].ob_free_entries--;
> > + spin_unlock(&oi->of_lock);
>
> How do we know how large to make the orphan file at mkfs time? What if it
> becomes full during use? It seems like reserving a fixed number of blocks
> will invariably be incorrect for the actual workload on the filesystem.

If orphan file gets full (too many orphaned inodes at this moment), we will
just fallback to using the good old orphan list. So only performance will
suffer.

In terms of number of blocks, for reasonably large filesystems we reserve
512 4k blocks for orphan file so that allows for 523776 orphaned inodes.
Sure it's possible to exhaust it but frankly I don't find it likely so I'm
not sure dynamic sizing is worth the hassle.

> > @@ -49,6 +95,16 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> > ASSERT((S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> > S_ISLNK(inode->i_mode)) || inode->i_nlink == 0);
> >
> > + if (sbi->s_orphan_info.of_blocks) {
> > + err = ext4_orphan_file_add(handle, inode);
> > + /*
> > + * Fallback to normal orphan list of orphan file is
> > + * out of space
> > + */
> > + if (err != -ENOSPC)
> > + return err;
> > + }
>
> This could schedule a task on a workqueue to allocate a few more blocks?
> That could easily reserve more credits for this action, without making
> the common case more expensive. Even if it isn't used with the current
> mount, it would be available for the next mount (which presumably would
> also need additional blocks).
>
> Whether it is worth the complexity to make this fully dynamic, at least
> it would auto-tune for the workload placed on this filesystem, and would
> not initially be worse than the old single-linked list.

Adding more blocks would not be that hard as you say but if we are growing
a file there may be need to make it shorter as well (as e.g. shortlived
peak in number of orphaned inodes could have accumulated bazilion blocks
for orphan file) and that will be a bit more tricky. It can be done but I
don't think it's worth the complexity...

Thanks for the review!
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-06-17 08:25:07

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH 1/4] ext4: Support for checksumming from journal triggers

On Wed 16-06-21 13:56:30, Andreas Dilger wrote:
> On Jun 16, 2021, at 4:56 AM, Jan Kara <[email protected]> wrote:
> >
> > JBD2 layer support triggers which are called when journaling layer moves
> > buffer to a certain state. We can use the frozen trigger, which gets
> > called when buffer data is frozen and about to be written out to the
> > journal, to compute block checksums for some buffer types (similarly as
> > does ocfs2). This avoids unnecessary repeated recomputation of the
> > checksum (at the cost of larger window where memory corruption won't be
> > caught by checksumming) and is even necessary when there are
> > unsynchronized updaters of the checksummed data.
> >
> > So add argument to ext4_journal_get_write_access() and
> > ext4_journal_get_create_access() which describes buffer type so that
> > triggers can be set accordingly. This patch is mostly only a change of
> > prototype of the above mentioned functions and a few small helpers. Real
> > checksumming will come later.
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
>
> Comment inline.
>
> >
> > diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> > index be799040a415..f601e24b6015 100644
> > --- a/fs/ext4/ext4_jbd2.c
> > +++ b/fs/ext4/ext4_jbd2.c
> > @@ -229,11 +231,18 @@ int __ext4_journal_get_write_access(const char *where, unsigned int line,
> >
> > if (ext4_handle_valid(handle)) {
> > err = jbd2_journal_get_write_access(handle, bh);
> > - if (err)
> > + if (err) {
> > ext4_journal_abort_handle(where, line, __func__, bh,
> > handle, err);
> > + return err;
> > + }
> > }
> > - return err;
> > + if (trigger_type == EXT4_JTR_NONE || !ext4_has_metadata_csum(sb))
> > + return 0;
> > + WARN_ON_ONCE(trigger_type >= EXT4_JOURNAL_TRIGGER_COUNT);
>
> I'm not sure WARN_ON_ONCE() is enough here. This would essentially result
> in executing a random (or maybe NULL) function pointer later on. Either
> trigger_type should be checked early and return an error, or this should
> be a BUG_ON() so that the crash happens here instead of in jbd context.

Good point, I'll fix that.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-06-17 14:06:55

On Wed 30-06-21 15:46:35, Lukas Czerner wrote:
> On Wed, Jun 16, 2021 at 12:56:55PM +0200, Jan Kara wrote:
> > @@ -28,28 +42,24 @@ static int ext4_orphan_file_add(handle_t *handle, struct inode *inode)
> > */
> > return -ENOSPC;
> > }
> > - oi->of_binfo[i].ob_free_entries--;
> > - spin_unlock(&oi->of_lock);
> >
> > - /*
> > - * Get access to orphan block. We have dropped of_lock but since we
> > - * have decremented number of free entries we are guaranteed free entry
> > - * in our block.
> > - */
> > ret = ext4_journal_get_write_access(handle, inode->i_sb,
> > oi->of_binfo[i].ob_bh, EXT4_JTR_ORPHAN_FILE);
> > if (ret)
> > return ret;
> >
> > bdata = (__le32 *)(oi->of_binfo[i].ob_bh->b_data);
> > - spin_lock(&oi->of_lock);
> > /* Find empty slot in a block */
> > - for (j = 0; j < inodes_per_ob && bdata[j]; j++);
> > - BUG_ON(j == inodes_per_ob);
> > - bdata[j] = cpu_to_le32(inode->i_ino);
> > + j = 0;
> > + do {
> > + while (bdata[j]) {
> > + if (++j >= inodes_per_ob)
> > + j = 0;
> > + }
> > + } while (cmpxchg(&bdata[j], 0, cpu_to_le32(inode->i_ino)) != 0);
>
> In case there is any sort of corruption on disk or in memory we can
> potentially get stuck here forever right ? Not sure if that matters
> all that much.
>
> Other than that it looks good and negates some of my comments on the
> previous patch, sorry about that ;)
>
> You can add
>
> Reviewed-by: Lukas Czerner <[email protected]>

Good point. I've added some limitations (and cond_resched()) to the loop so
that we cannot loop indefinitely. Thanks for review!

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR