Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp1366928ybl; Tue, 3 Dec 2019 06:10:37 -0800 (PST) X-Google-Smtp-Source: APXvYqzH7djER53/kDE6pSw8OmYmilDWOOhvF9le7kioJ49EKvkS6tIB0Z5P3ZHndm9zfc0tAOd/ X-Received: by 2002:a9d:7cd9:: with SMTP id r25mr3067679otn.326.1575382237506; Tue, 03 Dec 2019 06:10:37 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1575382237; cv=none; d=google.com; s=arc-20160816; b=uzeia1wV0sXDoEdMfkBt54xFKCfILvfjsqUbHli/XLHoYIfjFHQ2q9YCCnzutSc4Xt vVZatMs3JWP0cPo4QTb+Wr37c7PMtld+vz64sJ0Tq2OvCqn5SE8zxq563dyttbim/843 Cj/jrJLQEN+m1tj+HtD2fNdekcdd0kOsl2OzODaKfMuw0cl1LcRlyRbjhnatSHE+EzNY FB94X0OXJGwxXVEQhA6G5v734AuROLhYKOwFkxvf7T+zbkzRhXTY7mH5LD+jccbkW7Pc MowOBVomAk3uBQxX/2K0mOZ6rDyXWhlh9rrDj3VacWiaqlCDFJk7BydgrJxNDHydWtvL kwaw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:cc:to:subject; bh=ydDG6koehMqqEwCu58azLn4hB7VLb2Wk7kH9wpatiac=; b=Il3NTJDFG2h2pVihFr19SQWUYMF9JiN22hGFj34cgJpWSHgUu1A3o3Ofr7gpbDmvJX mlN79JcB4iNCAqHRh+/oizXsHFCSZzuyI/Aa53px4vBCrqb+AMevb8kQ6wY8G5ZG9CYw qjieWvx0N9bdnw6coMcYRzWGWhyjea9BSUTp8IWlumSEtvPvd/ziDHdP2L8/G3IRbcKA Qu9rpVi7Oi630/zIjgl+4VOLYNzEX4kYxEwOKQeiiyNk7lBsSX4FSnTb4oWhmmmTpG8+ ys05vu155jqD2sGFy2vlmtdJGi0dIQbtfZ/BDXT5oWzNLPqzN6gUfpDD62XWaTiQX+++ CazA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p131si1317669oib.65.2019.12.03.06.10.24; Tue, 03 Dec 2019 06:10:37 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726195AbfLCOJz (ORCPT + 99 others); Tue, 3 Dec 2019 09:09:55 -0500 Received: from mx2.suse.de ([195.135.220.15]:43782 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726017AbfLCOJz (ORCPT ); Tue, 3 Dec 2019 09:09:55 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 85C8BAC9D; Tue, 3 Dec 2019 14:09:50 +0000 (UTC) Subject: Re: [PATCH] bcache: add REQ_FUA to avoid data lost in writeback mode To: kungf Cc: kent.overstreet@gmail.com, linux-bcache@vger.kernel.org, linux-kernel@vger.kernel.org References: <20191202102409.3980-1-wings.wyang@gmail.com> <785fe04f-f841-3083-66db-53fab7bc0577@suse.de> From: Coly Li Organization: SUSE Labs Message-ID: <74b8fcf6-b5e0-7cae-d860-0ed894bfe938@suse.de> Date: Tue, 3 Dec 2019 22:09:32 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019/12/3 3:16 下午, kungf wrote: > > > On Mon, 2 Dec 2019 at 19:09, Coly Li > wrote: >> >> On 2019/12/2 6:24 下午, kungf wrote: >> > data may lost when in the follow scene of writeback mode: >> > 1. client write data1 to bcache >> > 2. client fdatasync >> > 3. bcache flush cache set and backing device >> > if now data1 was not writed back to backing, it was only guaranteed > safe in cache. >> > 4.then cache writeback data1 to backing with only REQ_OP_WRITE >> > So data1 was not guaranteed in non-volatile storage,  it may lost if >  power interruption >> > >> >> Hi, >> >> Do you encounter such problem in real work load ? With bcache journal, I >> don't see the possibility of data lost with your description. >> >> Correct me if I am wrong. >> >> Coly Li >> > Hi Coly, > > Sorry to confuse you.  As i known now, write_dirty function write dirty > to backing without FUA,and write_dirty_finish make dirty key clean, > it means the data indexed by the key will not be writeback again, am i > wrong? Yes, you are right. This is the behavior as design. We don't guarantee the data will be on always on platter, this is what most storage systems do. > I only find that the backing device will be flushed when bcache get an > PREFLUSH bio, any other place  it will be flushed in journal? > Storage system flushes its buffer when upper layer requires, that means if the application wants to make its writing data flushed on platter, it should explicitly issue a flush request. What you observe and test are all as designed IMHO. The I/O stack does not guarantee any data persistent on storage media unless an explicit flush request received from upper layer and returned to upper layer. Coly Li > I made a test that write bcache with dd,and then detach it, blktrace > the cache and backing device at the same time. > 1. close writeback > # echo 0 > /sys/block/bcache0/bcache/writeback_running > 2. write data with a fdatasync > #dd if=/dev/zero of=/dev/bcache0 bs=16k count=1 oflag=direct > 3. detach and trigger writeback > #echo b1f40ca5-37a3-4852-9abf-6abed96d71db >/sys/block/bcache0/bcache/detach > > the blow text is blkparse result. > from cache blktrace blow, we can see 16k data write to cache set, and > then flush with op FWFSM (PREFLUSH| WRITE| FUA|SYNC|META ) > ``` >   8,160 33        1     0.000000000 222844  A   W 630609920 + 32 <- > (8,167) 1464320 >   8,167 33        2     0.000000478 222844  Q   W 630609920 + 32 [dd] >   8,167 33        3     0.000006167 222844  G   W 630609920 + 32 [dd] >   8,167 33        5     0.000011385 222844  I   W 630609920 + 32 [dd] >   8,167 33        6     0.000023890   948  D   W 630609920 + 32 > [kworker/33:1H] >   8,167 33        7     0.000111203     0  C   W 630609920 + 32 [0] >   8,160 34        1     0.000167029 215616  A FWFSM 629153808 + 8 <- > (8,167) 8208 >   8,167 34        2     0.000167490 215616  Q FWFSM 629153808 + 8 > [kworker/34:2] >   8,167 34        3     0.000169061 215616  G FWFSM 629153808 + 8 > [kworker/34:2] >   8,167 34        4     0.000301308   949  D WFSM 629153808 + 8 > [kworker/34:1H] >   8,167 34        5     0.000348832     0  C WFSM 629153808 + 8 [0] >   8,167 34        6     0.000349612     0  C WFSM 629153808 [0] > ``` > > from backing blktrace blow, the backing device first get flush op FWS > (PERFLUSH|WRITE|SYNC)  because of we stop writeback, then get W op after > detach, > the 16k data was writeback to backing device, and after this, the > backing device never get flush op, */it means that the 16k data we write > it's not safe in backing/* > */device, even we dd write with fdatasync./* > ``` >   8,144 33        1     0.000000000 222844  Q WSM 8 + 8 [dd] >   8,144 33        2     0.000016609 222844  G WSM 8 + 8 [dd] >   8,144 33        5     0.000020710 222844  I WSM 8 + 8 [dd] >   8,144 33        6     0.000031967   948  D WSM 8 + 8 [kworker/33:1H] >   8,144 33        7     0.000152945 88631  C  WS 16 + 32 [0] >   8,144 34        1     0.000186127 215616  Q FWS [kworker/34:2] >   8,144 34        2     0.000187006 215616  G FWS [kworker/34:2] >   8,144 33        8     0.000326761     0  C WSM 8 + 8 [0] >   8,144 34        3     0.020195027     0  C  WS 16 [0] >   8,144 34        4     0.020195904     0  C FWS 16 [0] >   8,144 23        1    19.415130395 215884  Q   W 16 + 32 [kworker/23:2] >   8,144 23        2    19.415132072 215884  G   W 16 + 32 [kworker/23:2] >   8,144 23        3    19.415133134 215884  I   W 16 + 32 [kworker/23:2] >   8,144 23        4    19.415137776  1215  D   W 16 + 32 [kworker/23:1H] >   8,144 23        5    19.416607260     0  C   W 16 + 32 [0] >   8,144 24        1    19.416640754 222593  Q WSM 8 + 8 [bcache_writebac] >   8,144 24        2    19.416642698 222593  G WSM 8 + 8 [bcache_writebac] >   8,144 24        3    19.416643505 222593  I WSM 8 + 8 [bcache_writebac] >   8,144 24        4    19.416650589  1107  D WSM 8 + 8 [kworker/24:1H] >   8,144 24        5    19.416865258     0  C WSM 8 + 8 [0] >   8,144 24        6    19.416871350 221889  Q WSM 8 + 8 [kworker/24:1] >   8,144 24        7    19.416872201 221889  G WSM 8 + 8 [kworker/24:1] >   8,144 24        8    19.416872542 221889  I WSM 8 + 8 [kworker/24:1] >   8,144 24        9    19.416875458  1107  D WSM 8 + 8 [kworker/24:1H] >   8,144 24       10    19.417076935     0  C WSM 8 + 8 [0] > ``` > > > > On Mon, 2 Dec 2019 at 19:09, Coly Li > wrote: > > On 2019/12/2 6:24 下午, kungf wrote: > > data may lost when in the follow scene of writeback mode: > > 1. client write data1 to bcache > > 2. client fdatasync > > 3. bcache flush cache set and backing device > > if now data1 was not writed back to backing, it was only > guaranteed safe in cache. > > 4.then cache writeback data1 to backing with only REQ_OP_WRITE > > So data1 was not guaranteed in non-volatile storage,  it may lost > if  power interruption  > > > > Hi, > > Do you encounter such problem in real work load ? With bcache journal, I > don't see the possibility of data lost with your description. > > Correct me if I am wrong. > > Coly Li > > > Signed-off-by: kungf > > > --- > >  drivers/md/bcache/writeback.c | 2 +- > >  1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/md/bcache/writeback.c > b/drivers/md/bcache/writeback.c > > index 4a40f9eadeaf..e5cecb60569e 100644 > > --- a/drivers/md/bcache/writeback.c > > +++ b/drivers/md/bcache/writeback.c > > @@ -357,7 +357,7 @@ static void write_dirty(struct closure *cl) > >        */ > >       if (KEY_DIRTY(&w->key)) { > >               dirty_init(w); > > -             bio_set_op_attrs(&io->bio, REQ_OP_WRITE, 0); > > +             bio_set_op_attrs(&io->bio, REQ_OP_WRITE | REQ_FUA, 0); > >               io->bio.bi_iter.bi_sector = KEY_START(&w->key); > >               bio_set_dev(&io->bio, io->dc->bdev); > >               io->bio.bi_end_io       = dirty_endio; > > >