Received: by 2002:a05:6602:2086:0:0:0:0 with SMTP id a6csp4470334ioa; Wed, 27 Apr 2022 04:42:28 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxGByf3jfIEDiCUfZ3dbhdp/UornsIg0qr/M5Upq6xCfjWpzDkrTK4mJkY3toiQfA6Y0/h4 X-Received: by 2002:a17:902:a502:b0:151:8289:b19 with SMTP id s2-20020a170902a50200b0015182890b19mr28245498plq.149.1651059748396; Wed, 27 Apr 2022 04:42:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651059748; cv=none; d=google.com; s=arc-20160816; b=a79nqm7JUHlQDlZBn1KL1JcXAov0vH0zjLW18vFKzpd15ARPXuXyTE2QBF1yC1FRtH DjWNxzfSZvge8Zhx0QuzHQnu7wqTVSfMMhE7kDf+spaHoKCYJAnhW2NAz+OEdQmi1s0y Kkaut0aw56hlPrKDM7Itnn9ILqeARnHxHQuLbU6klCMtyv6ZEoia2fOFoHYkwL3kxymz TtIqY18zbWBHLf2nTUsCS0sA4ej8/VjtRS1cM/8OvGHSwf1MY6tDMAi883yOSdaevZO+ t+8qRJl4UkzZhEI8FNzOMIaikJrvNaqjwGrjuRplrzrWgyRoKsnRoI49X/Rx8wZQqtaG fxqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:subject :from:references:cc:to:content-language:user-agent:mime-version:date :message-id:dkim-signature:dkim-signature; bh=rJ68u8jd1pGMMpvwjEOkdhZXSNG8xLLCdO1Cn+Xq+gA=; b=nKNE5UQ4prqTFfiPtyWIJIpZToqjZwuOnbr+32yNfT2zZ0vZdX7trtdGh/dij1VdfG YOxo1MFG7WKw5VDbRaLe48tFOKmd0LsXwmd1tyL1BLzLQkWJ6D1wLqk1YAB/co+CI421 6NS4OgTeLUdh6usmvK0a6FBufgpvZnY/Bkn1Pj9E5JOmSfAyhD7ySx7VCyMifOksXg1i 1kTtEp5xZ/OypFox/dvdMg40DLTOKEPdAv6r6itbNe/T6fYY9fo2WdNR2SaxsN8CUE4f pyNTOSZe5Cno6/Ef5zzl7ztbcpxoN2nGFc6/NA+HmM4MaIY3qESWWKRsBdHOfQe/mSsj 3R2g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=gDV+xrS2; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519 header.b=EkM4US2U; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id i2-20020a17090332c200b00158657a3b03si1484120plr.214.2022.04.27.04.42.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Apr 2022 04:42:28 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=gDV+xrS2; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519 header.b=EkM4US2U; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 5222A3F141D; Wed, 27 Apr 2022 03:42:46 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229872AbiD0Kpw (ORCPT + 99 others); Wed, 27 Apr 2022 06:45:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51314 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229805AbiD0Kpu (ORCPT ); Wed, 27 Apr 2022 06:45:50 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1A499284D47; Wed, 27 Apr 2022 03:29:18 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 63E961F388; Wed, 27 Apr 2022 10:29:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1651055356; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rJ68u8jd1pGMMpvwjEOkdhZXSNG8xLLCdO1Cn+Xq+gA=; b=gDV+xrS2GbcllQPvhcHYRsIZ5JmMF2CTqkJH2hZzzFtDP0TXmtJZrpUuuojh3t96ebXcco MAc0BQ67VewzXUn+zMa2QyCdDcBvaXF2/Fq9ddaIFETQCt9fte7B7E4o+krYi72nEXZLYM 2ZEuZ2TIjVXu6V9ocZWj7LvaM3Bftfc= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1651055356; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rJ68u8jd1pGMMpvwjEOkdhZXSNG8xLLCdO1Cn+Xq+gA=; b=EkM4US2Ur6JYETOSv0kVXMTGm4wfX9NWB6vWzYaHSmzhLafSzMfv1PShz4hn4/DwHAF+XD EYzGw+r4D0s0dBCQ== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 193711323E; Wed, 27 Apr 2022 10:29:16 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id 24OHBfwaaWJ/AQAAMHmgww (envelope-from ); Wed, 27 Apr 2022 10:29:16 +0000 Message-ID: <2082148f-890f-e5f4-c304-b99212aa377e@suse.de> Date: Wed, 27 Apr 2022 12:29:15 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.4.0 Content-Language: en-US To: Nitesh Shetty Cc: chaitanyak@nvidia.com, linux-block@vger.kernel.org, linux-scsi@vger.kernel.org, dm-devel@redhat.com, linux-nvme@lists.infradead.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, msnitzer@redhat.com, bvanassche@acm.org, martin.petersen@oracle.com, kbusch@kernel.org, hch@lst.de, Frederick.Knight@netapp.com, osandov@fb.com, lsf-pc@lists.linux-foundation.org, djwong@kernel.org, josef@toxicpanda.com, clm@fb.com, dsterba@suse.com, tytso@mit.edu, jack@suse.com, nitheshshetty@gmail.com, gost.dev@samsung.com, Arnav Dawn , Alasdair Kergon , Mike Snitzer , Sagi Grimberg , James Smart , Chaitanya Kulkarni , Damien Le Moal , Naohiro Aota , Johannes Thumshirn , Alexander Viro , linux-kernel@vger.kernel.org References: <20220426101241.30100-1-nj.shetty@samsung.com> <20220426101241.30100-3-nj.shetty@samsung.com> From: Hannes Reinecke Subject: Re: [PATCH v4 02/10] block: Add copy offload support infrastructure In-Reply-To: <20220426101241.30100-3-nj.shetty@samsung.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-3.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,RDNS_NONE,SPF_HELO_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4/26/22 12:12, Nitesh Shetty wrote: > Introduce blkdev_issue_copy which supports source and destination bdevs, > and an array of (source, destination and copy length) tuples. > Introduce REQ_COPY copy offload operation flag. Create a read-write > bio pair with a token as payload and submitted to the device in order. > Read request populates token with source specific information which > is then passed with write request. > This design is courtesy Mikulas Patocka's token based copy > > Larger copy will be divided, based on max_copy_sectors, > max_copy_range_sector limits. > > Signed-off-by: Nitesh Shetty > Signed-off-by: Arnav Dawn > --- > block/blk-lib.c | 232 ++++++++++++++++++++++++++++++++++++++ > block/blk.h | 2 + > include/linux/blk_types.h | 21 ++++ > include/linux/blkdev.h | 2 + > include/uapi/linux/fs.h | 14 +++ > 5 files changed, 271 insertions(+) > > diff --git a/block/blk-lib.c b/block/blk-lib.c > index 09b7e1200c0f..ba9da2d2f429 100644 > --- a/block/blk-lib.c > +++ b/block/blk-lib.c > @@ -117,6 +117,238 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector, > } > EXPORT_SYMBOL(blkdev_issue_discard); > > +/* > + * Wait on and process all in-flight BIOs. This must only be called once > + * all bios have been issued so that the refcount can only decrease. > + * This just waits for all bios to make it through bio_copy_end_io. IO > + * errors are propagated through cio->io_error. > + */ > +static int cio_await_completion(struct cio *cio) > +{ > + int ret = 0; > + unsigned long flags; > + > + spin_lock_irqsave(&cio->lock, flags); > + if (cio->refcount) { > + cio->waiter = current; > + __set_current_state(TASK_UNINTERRUPTIBLE); > + spin_unlock_irqrestore(&cio->lock, flags); > + blk_io_schedule(); > + /* wake up sets us TASK_RUNNING */ > + spin_lock_irqsave(&cio->lock, flags); > + cio->waiter = NULL; > + ret = cio->io_err; > + } > + spin_unlock_irqrestore(&cio->lock, flags); > + kvfree(cio); > + > + return ret; > +} > + > +static void bio_copy_end_io(struct bio *bio) > +{ > + struct copy_ctx *ctx = bio->bi_private; > + struct cio *cio = ctx->cio; > + sector_t clen; > + int ri = ctx->range_idx; > + unsigned long flags; > + bool wake = false; > + > + if (bio->bi_status) { > + cio->io_err = bio->bi_status; > + clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - ctx->start_sec; > + cio->rlist[ri].comp_len = min_t(sector_t, clen, cio->rlist[ri].comp_len); > + } > + __free_page(bio->bi_io_vec[0].bv_page); > + kfree(ctx); > + bio_put(bio); > + > + spin_lock_irqsave(&cio->lock, flags); > + if (((--cio->refcount) <= 0) && cio->waiter) > + wake = true; > + spin_unlock_irqrestore(&cio->lock, flags); > + if (wake) > + wake_up_process(cio->waiter); > +} > + > +/* > + * blk_copy_offload - Use device's native copy offload feature > + * Go through user provide payload, prepare new payload based on device's copy offload limits. > + */ > +int blk_copy_offload(struct block_device *src_bdev, int nr_srcs, > + struct range_entry *rlist, struct block_device *dst_bdev, gfp_t gfp_mask) > +{ > + struct request_queue *sq = bdev_get_queue(src_bdev); > + struct request_queue *dq = bdev_get_queue(dst_bdev); > + struct bio *read_bio, *write_bio; > + struct copy_ctx *ctx; > + struct cio *cio; > + struct page *token; > + sector_t src_blk, copy_len, dst_blk; > + sector_t remaining, max_copy_len = LONG_MAX; > + unsigned long flags; > + int ri = 0, ret = 0; > + > + cio = kzalloc(sizeof(struct cio), GFP_KERNEL); > + if (!cio) > + return -ENOMEM; > + cio->rlist = rlist; > + spin_lock_init(&cio->lock); > + > + max_copy_len = min_t(sector_t, sq->limits.max_copy_sectors, dq->limits.max_copy_sectors); > + max_copy_len = min3(max_copy_len, (sector_t)sq->limits.max_copy_range_sectors, > + (sector_t)dq->limits.max_copy_range_sectors) << SECTOR_SHIFT; > + > + for (ri = 0; ri < nr_srcs; ri++) { > + cio->rlist[ri].comp_len = rlist[ri].len; > + src_blk = rlist[ri].src; > + dst_blk = rlist[ri].dst; > + for (remaining = rlist[ri].len; remaining > 0; remaining -= copy_len) { > + copy_len = min(remaining, max_copy_len); > + > + token = alloc_page(gfp_mask); > + if (unlikely(!token)) { > + ret = -ENOMEM; > + goto err_token; > + } > + > + ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask); > + if (!ctx) { > + ret = -ENOMEM; > + goto err_ctx; > + } > + ctx->cio = cio; > + ctx->range_idx = ri; > + ctx->start_sec = dst_blk; > + > + read_bio = bio_alloc(src_bdev, 1, REQ_OP_READ | REQ_COPY | REQ_NOMERGE, > + gfp_mask); > + if (!read_bio) { > + ret = -ENOMEM; > + goto err_read_bio; > + } > + read_bio->bi_iter.bi_sector = src_blk >> SECTOR_SHIFT; > + __bio_add_page(read_bio, token, PAGE_SIZE, 0); > + /*__bio_add_page increases bi_size by len, so overwrite it with copy len*/ > + read_bio->bi_iter.bi_size = copy_len; > + ret = submit_bio_wait(read_bio); > + bio_put(read_bio); > + if (ret) > + goto err_read_bio; > + > + write_bio = bio_alloc(dst_bdev, 1, REQ_OP_WRITE | REQ_COPY | REQ_NOMERGE, > + gfp_mask); > + if (!write_bio) { > + ret = -ENOMEM; > + goto err_read_bio; > + } > + write_bio->bi_iter.bi_sector = dst_blk >> SECTOR_SHIFT; > + __bio_add_page(write_bio, token, PAGE_SIZE, 0); > + /*__bio_add_page increases bi_size by len, so overwrite it with copy len*/ > + write_bio->bi_iter.bi_size = copy_len; > + write_bio->bi_end_io = bio_copy_end_io; > + write_bio->bi_private = ctx; > + > + spin_lock_irqsave(&cio->lock, flags); > + ++cio->refcount; > + spin_unlock_irqrestore(&cio->lock, flags); > + > + submit_bio(write_bio); > + src_blk += copy_len; > + dst_blk += copy_len; > + } > + } > + Hmm. I'm not sure if I like the copy loop. What I definitely would do is to allocate the write bio before reading data; after all, if we can't allocate the write bio reading is pretty much pointless. But the real issue I have with this is that it's doing synchronous reads, thereby limiting the performance. Can't you submit the write bio from the end_io function of the read bio? That would disentangle things, and we should be getting a better performance. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer