Return-Path: Received: from out30-130.freemail.mail.aliyun.com ([115.124.30.130]:52727 "EHLO out30-130.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726440AbeLLO0S (ORCPT ); Wed, 12 Dec 2018 09:26:18 -0500 Subject: Re: jbd2 bh_shadow issue To: Jan Kara Cc: linux-ext4@vger.kernel.org, fisherthepooh@protonmail.com, tytso@mit.edu References: <871f4cbf-f7bc-35d8-bd42-32721b44c458@linux.alibaba.com> <20181129180439.GA11089@quack2.suse.cz> <20181211101953.GB19614@quack2.suse.cz> From: Xiaoguang Wang Message-ID: <584323f6-484b-95fc-9a53-9c3c6fe21b3e@linux.alibaba.com> Date: Wed, 12 Dec 2018 22:24:42 +0800 MIME-Version: 1.0 In-Reply-To: <20181211101953.GB19614@quack2.suse.cz> Content-Type: text/plain; charset=gbk; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-ext4-owner@vger.kernel.org List-ID: hi, > Hi, > > On Tue 04-12-18 18:49:49, Xiaoguang Wang wrote: >>> On Tue 27-11-18 19:48:28, Xiaoguang Wang wrote: >>>> I also meet this issue in our server, sometimes we got big latency for >>>> buffered writes. Using fio, I have done some tests to reproduce this issue, >>>> Please see test results in this mail: >>>> https://www.spinics.net/lists/linux-ext4/msg62885.html >>>> >>>> I agree that doing buffer copy-out for every journaled buffer wastes cpu >>>> time, so here I wonder whether we can only do buffer copy-out when buffer >>>> is BH_SHADOW state. The core idea is simple: >>>> In do_get_write_access(), if buffer has already been in BH_SHADOW state, >>>> we allocate a new page, make buffer's b_page point to this new page, and >>>> following journal dirty work can be done in this patch. When transaction >>>> commits, in jbd2_journal_write_metadata_buffer(), we copy new page's conten >>>> to primary page, make buffer point to primary page and free the temporary >>>> page. >>> >>> This isn't going to work. That are places in the kernel that assume that >>> the page bh points to is the page stored in the page cache. Also you need >>> to make sure that functions like sb_getblk() are going to return your new >>> bh etc. So it is not really simple. On the other hand if we speak only >>> about metadata buffers, the potential for breakage is certainly much >>> smaller (but you'd have to then make sure data=journal case does not use >>> your optimization) and there aren't that many ways how bh can be accessed >>> so maybe you could make something like this working. But it certainly isn't >>> going to work in the form you've presented here... >> I had thought some of your mentioned cases and only just considered meta >> buffer before, thanks you for pointing more cases. Do you have any suggestions >> for this bh_shadow issue, sometimes it really caused app's long tail latency. > > I was thinking for some time about this but I don't have a good solution to > this problem. Additionally I have realized that you will face the same > latency hit not only when the block is written out to the journal but also > when it is being checkpointed. To be honest, we often see long tail latency caused by bh_shadow issue, but don't see any long tail latency caused by checkpointed buffer... > So the solution could not be just some > special journalling hack but would need to provide a general way how to > modify metadata buffer heads that are being written out. Agree. > > One solution that looks at least remotely sane to me is that you could do > an equivalent of page migration (it is somewhat similar to what you've > suggested) - allocate new page, copy there data from the old page, replace > original page in the address space with this page, move buffer heads over > from the old page to the new page. You can check what buffer_migrate_page() > is doing. And then you can let processes modify this new page and when the > IO completes the old page would get released. But you'd have to be extra > careful with the locking and also make sure buffer heads you migrate don't > have extra references so that you do not move buffer heads under someone's > hands (as that could cause data corruption)... Thanks Jan, I'll have a try according to your suggestions. > > Honza >