Hi Ted,
now that FAST'17 is behind us, is there any plan to land the ext4-lazy code
(SMR optimizations) to the upstream kernel? This looks like it improves
some workloads even without SMR disks, and doesn't have any noticeable
overhead for other workloads.
I'd guess the one thing that we might want to do is still allow the journal
to optionally checkpoint the metadata to the filesystem in the background,
when the filesystem is otherwise idle, so that in case of journal loss for
some reason the whole filesystem is not lost?
Cheers, Andreas
On Mon, Apr 10, 2017 at 09:06:23PM -0600, Andreas Dilger wrote:
> Hi Ted,
> now that FAST'17 is behind us, is there any plan to land the ext4-lazy code
> (SMR optimizations) to the upstream kernel? This looks like it improves
> some workloads even without SMR disks, and doesn't have any noticeable
> overhead for other workloads.
>
> I'd guess the one thing that we might want to do is still allow the journal
> to optionally checkpoint the metadata to the filesystem in the background,
> when the filesystem is otherwise idle, so that in case of journal loss for
> some reason the whole filesystem is not lost?
Don't forget to fix jbd2_bh_submit_read to behave when JBD2_FLAG_ESCAPE
is set on a journal data block.
--D
>
> Cheers, Andreas
>
>
>
>
>
On 4/10/17 10:06 PM, Andreas Dilger wrote:
> Hi Ted,
> now that FAST'17 is behind us, is there any plan to land the ext4-lazy code
> (SMR optimizations) to the upstream kernel? This looks like it improves
> some workloads even without SMR disks, and doesn't have any noticeable
> overhead for other workloads.
>
> I'd guess the one thing that we might want to do is still allow the journal
> to optionally checkpoint the metadata to the filesystem in the background,
> when the filesystem is otherwise idle, so that in case of journal loss for
> some reason the whole filesystem is not lost?
IIRC even the new larger default journal size was a big win by itself, yes?
-Eric
On Apr 17, 2017, at 8:18 AM, Eric Sandeen <[email protected]> wrote:
>
> On 4/10/17 10:06 PM, Andreas Dilger wrote:
>> Hi Ted,
>> now that FAST'17 is behind us, is there any plan to land the ext4-lazy code
>> (SMR optimizations) to the upstream kernel? This looks like it improves
>> some workloads even without SMR disks, and doesn't have any noticeable
>> overhead for other workloads.
>>
>> I'd guess the one thing that we might want to do is still allow the journal
>> to optionally checkpoint the metadata to the filesystem in the background,
>> when the filesystem is otherwise idle, so that in case of journal loss for
>> some reason the whole filesystem is not lost?
>
> IIRC even the new larger default journal size was a big win by itself, yes?
For many-thread modification that is definitely a win. We've used
journal sizes up to 1GB for Lustre object targets and up to 4GB for
metadata targets, just because worst-case journal credit reservation
causes transaction stalls even if the transaction doesn't grow large.
That is especially true for fast devices like SSD metadata targets
that do tens of thousands of ops/sec with quotas, ACLs, xattrs, etc.
This is somewhat worse on Lustre because we also store additional
xattrs and also update Lustre-specific transaction log files in the
same transaction as each filesystem modifying operation.
IMHO, the ext4-lazy feature would also potentially be useful for non-SMR
devices, where we could do full data journaling (optimistically, small
files?) to a large flash journal device, and only write to the disk device
periodically (once the journal gets near full, or when the HDD is spun up
from sleep).
Cheers, Andreas
On Mon, Apr 17, 2017 at 09:18:11AM -0500, Eric Sandeen wrote:
> On 4/10/17 10:06 PM, Andreas Dilger wrote:
> > Hi Ted,
> > now that FAST'17 is behind us, is there any plan to land the ext4-lazy code
> > (SMR optimizations) to the upstream kernel? This looks like it improves
> > some workloads even without SMR disks, and doesn't have any noticeable
> > overhead for other workloads.
There is a plan to do this, but I've been crazy busy lately. A
colleague of mine, Tashin Erdogan, has been taking a look at it. It
looks like the fault may be mine, in that Abutalib's original patch
complately disabled the normal journalling paths, and for upstream
adoption we need to keep the original paths working until we're really
sure the new mode is an always a win. It looks like I might not have
done a complete job suppressing the original checkpointing code,
resulting in some journal transaction getting trimmed when they
shouldn't have been. But we'll see.
> > to optionally checkpoint the metadata to the filesystem in the background,
> > when the filesystem is otherwise idle, so that in case of journal loss for
> > some reason the whole filesystem is not lost?
So long as this isn't a SMR disk, some kind of background trickle to
the final location is indeed something we can do. It's probably
better to focus on stablizing the existing feature first, and then get
the cleaner to be smarter about its hueristics first, though.
Checkpointing metadata to the file system when the file system is idle
and if the system is not running on battery power on a laptop are both
examples of an advanced cleaner policy, and there are probably simpler
hueristics that we might want to do first.
> IIRC even the new larger default journal size was a big win by itself, yes?
It's a big win for workloads that have a sufficiently heavy metadata
workload that the journal size was forcing blocking, synchronous
checkpoint operations. For many customer workloads it won't make any
difference at all, of course.
- Ted