Subject: Re: jbd2 bh_shadow issue
To: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org, fisherthepooh@protonmail.com,
        tytso@mit.edu
References: <871f4cbf-f7bc-35d8-bd42-32721b44c458@linux.alibaba.com>
 <20181129180439.GA11089@quack2.suse.cz>
 <ff170504-2c35-a8c6-69fc-a930e4bdec96@linux.alibaba.com>
 <20181211101953.GB19614@quack2.suse.cz>
From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Message-ID: <584323f6-484b-95fc-9a53-9c3c6fe21b3e@linux.alibaba.com>
Date: Wed, 12 Dec 2018 22:24:42 +0800
MIME-Version: 1.0
In-Reply-To: <20181211101953.GB19614@quack2.suse.cz>
Content-Type: text/plain; charset=gbk; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-ext4-owner@vger.kernel.org

hi,

> Hi,
> 
> On Tue 04-12-18 18:49:49, Xiaoguang Wang wrote:
>>> On Tue 27-11-18 19:48:28, Xiaoguang Wang wrote:
>>>> I also meet this issue in our server, sometimes we got big latency for
>>>> buffered writes. Using fio, I have done some tests to reproduce this issue,
>>>> Please see test results in this mail:
>>>> https://www.spinics.net/lists/linux-ext4/msg62885.html
>>>>
>>>> I agree that doing buffer copy-out for every journaled buffer wastes cpu
>>>> time, so here I wonder whether we can only do buffer copy-out when buffer
>>>> is BH_SHADOW state. The core idea is simple:
>>>>       In do_get_write_access(), if buffer has already been in BH_SHADOW state,
>>>>       we allocate a new page, make buffer's b_page point to this new page, and
>>>>       following journal dirty work can be done in this patch. When transaction
>>>>       commits, in jbd2_journal_write_metadata_buffer(), we copy new page's conten
>>>>       to primary page, make buffer point to primary page and free the temporary
>>>>       page.
>>>
>>> This isn't going to work. That are places in the kernel that assume that
>>> the page bh points to is the page stored in the page cache. Also you need
>>> to make sure that functions like sb_getblk() are going to return your new
>>> bh etc. So it is not really simple. On the other hand if we speak only
>>> about metadata buffers, the potential for breakage is certainly much
>>> smaller (but you'd have to then make sure data=journal case does not use
>>> your optimization) and there aren't that many ways how bh can be accessed
>>> so maybe you could make something like this working. But it certainly isn't
>>> going to work in the form you've presented here...
>> I had thought some of your mentioned cases and only just considered meta
>> buffer before, thanks you for pointing more cases. Do you have any suggestions
>> for this bh_shadow issue, sometimes it really caused app's long tail latency.
> 
> I was thinking for some time about this but I don't have a good solution to
> this problem. Additionally I have realized that you will face the same
> latency hit not only when the block is written out to the journal but also
> when it is being checkpointed. 
To be honest, we often see long tail latency caused by bh_shadow issue, but
don't see any long tail latency caused by checkpointed buffer...

> So the solution could not be just some
> special journalling hack but would need to provide a general way how to
> modify metadata buffer heads that are being written out.
Agree.

> 
> One solution that looks at least remotely sane to me is that you could do
> an equivalent of page migration (it is somewhat similar to what you've
> suggested) - allocate new page, copy there data from the old page, replace
> original page in the address space with this page, move buffer heads over
> from the old page to the new page. You can check what buffer_migrate_page()
> is doing. And then you can let processes modify this new page and when the
> IO completes the old page would get released.  But you'd have to be extra
> careful with the locking and also make sure buffer heads you migrate don't
> have extra references so that you do not move buffer heads under someone's
> hands (as that could cause data corruption)...
Thanks Jan, I'll have a try according to your suggestions.

> 
> 								Honza
>