Received: by 2002:a25:1506:0:0:0:0:0 with SMTP id 6csp1178133ybv; Thu, 20 Feb 2020 14:54:43 -0800 (PST) X-Google-Smtp-Source: APXvYqybtFkEE+MVm5f/zD1f0Mx0zcz5Vw3gMA/Kj1r+qwnwDLwo2GyWuQPTE33cqD5cGSF5rKLJ X-Received: by 2002:a9d:1ca3:: with SMTP id l35mr25007540ota.271.1582239283146; Thu, 20 Feb 2020 14:54:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582239283; cv=none; d=google.com; s=arc-20160816; b=HGSWsbMTd8umFZG/97IiImjz0MDU8t6CECQ5T2kC4OsGYZPIL8/p7PHR2Po5oZrK4l dGKOC9AJvsh5wYpc0lDseYn7P0uMNHbRZC3c17j4iO0q+KaOjDUXqjA4k+RC8keVzyZg axnmDJb57Xk+VZHhfNrPHOXFEUvIlRZNnKFbAwtZfe4efO9MKk9pkDlnbkqAGapZzMmR QNzbISqlHenpi4WFLDKdY07OEWn8RJKNiCi6P18dCFXeuuevFrC6b1KZrRB2ENIwXWZ4 2q4lQQx+Nx5U8N9i4L9/AXx0FrQfAhBChZ8Gr44WOxaaxQdxVVWrMa77C72CawrlgBgl yl7w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=iGHaCmxjzflNYFUBQGwlm53aD7hIa4iuT3KqbWHBGDw=; b=Zwr/sO7ez9wfgIFNt/3GViI0fD+zBfe6T1KY0tSTzP+gZtUtKsHngtCussDuMJ1wNi oTwLaQ8aCRkLYoss6kZECVSJn6CX6nZnBw4R3o5i8ITnehev8BCImePXYiHt7bRP9bQu xyadel6eqDVyzeW3/rGyGjXx6/i5zHs/mZjsC50td6yQ6Dmpefb9J63qSfTFBsRcJd6U qH+rd4Z/06hnl8MTWS2EU1hPNdKoxhFT5Hpv0pPIOAUDho72p7aD5UTyaJaPugiVzS1F tDjyDVLG+srfw9wYozflpK6fz9BDCW2Ojp/fuX12ryPvruOZQTE+qVvdaDBLgWsegsPq VM5w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=uncKOvqN; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w142si353382oia.132.2020.02.20.14.54.29; Thu, 20 Feb 2020 14:54:43 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=uncKOvqN; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729258AbgBTWyR (ORCPT + 99 others); Thu, 20 Feb 2020 17:54:17 -0500 Received: from mail.kernel.org ([198.145.29.99]:48218 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729130AbgBTWyQ (ORCPT ); Thu, 20 Feb 2020 17:54:16 -0500 Received: from tleilax.poochiereds.net (68-20-15-154.lightspeed.rlghnc.sbcglobal.net [68.20.15.154]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id B657C206F4; Thu, 20 Feb 2020 22:54:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1582239255; bh=R6eSMc5PEQgxhx/tlA0MnSDDRu61U4/oSExAr1zJ7GU=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=uncKOvqN8bwYkq4Yipcl0PjQzzuTR5u4GskDFghYkREICuwUa8dJjumXJxMrhUMz8 A8P/1jSfLn0Ytvmuold2xRLi9vfrEpryXYaTxZ87RDyj6hMUlGZC6kgV6NqZAyysAl 7KrnEl5DD9/uyGnD8+Bv+F9wimMuBjmdSfwdCMo0= Message-ID: <98baa0940d2ffea3017a08ce31358f3326e38a1e.camel@kernel.org> Subject: Re: [PATCH] ceph: re-org copy_file_range and fix error handling paths From: Jeff Layton To: Luis Henriques Cc: Sage Weil , Ilya Dryomov , "Yan, Zheng" , ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org Date: Thu, 20 Feb 2020 17:54:03 -0500 In-Reply-To: <20200220223630.GA9748@suse.com> References: <20200217123649.12316-1-lhenriques@suse.com> <9282539b8efcaccdd8d19197edb289828803d39c.camel@kernel.org> <20200220223630.GA9748@suse.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.34.4 (3.34.4-1.fc31) MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2020-02-20 at 22:36 +0000, Luis Henriques wrote: > On Thu, Feb 20, 2020 at 03:41:14PM -0500, Jeff Layton wrote: > > On Mon, 2020-02-17 at 12:36 +0000, Luis Henriques wrote: > > > This patch re-organizes copy_file_range, trying to fix a few issues in > > > error handling. Here's the summary: > > > > > > - Abort copy if initial do_splice_direct() returns fewer bytes than > > > requested. > > > > > > - Move the 'size' initialization (with i_size_read()) further down in the > > > code, after the initial call to do_splice_direct(). This avoids issues > > > with a possibly stale value if a manual copy is done. > > > > > > - Move the object copy loop into a separate function. This makes it > > > easier to handle errors (e.g, dirtying caps and updating the MDS > > > metadata if only some objects have been copied before an error has > > > occurred). > > > > > > - Added calls to ceph_oloc_destroy() to avoid leaking memory with src_oloc > > > and dst_oloc > > > > > > - After the object copy loop, the new file size to be reported to the MDS > > > (if there's file size change) is now the actual file size, and not the > > > size after an eventual extra manual copy. > > > > > > - Added a few dout() to show the number of bytes copied in the two manual > > > copies and in the object copy loop. > > > > > > Signed-off-by: Luis Henriques > > > --- > > > Hi! > > > > > > Initially I was going to have this patch split in a series, but then I > > > decided not to do that as this big patch allows (IMO) to better see the > > > whole picture. But please let me know if you think otherwise and I can > > > split it in a few smaller patches. > > > > > > I tried to cover all the issues that have been pointed out by Ilya, but I > > > may have missed something or, more likely, introduced new bugs ;-) > > > > > > Cheers, > > > -- > > > Luis > > > > > > > Sorry for the delay in review! > > No worries, I appreciate the feedback but I obviously don't expect it to > happen immediately :-) > > > > fs/ceph/file.c | 169 ++++++++++++++++++++++++++++--------------------- > > > 1 file changed, 96 insertions(+), 73 deletions(-) > > > > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c > > > index c3b8e8e0bf17..4d90a275f9a5 100644 > > > --- a/fs/ceph/file.c > > > +++ b/fs/ceph/file.c > > > @@ -1931,6 +1931,71 @@ static int is_file_size_ok(struct inode *src_inode, struct inode *dst_inode, > > > return 0; > > > } > > > > > > +static ssize_t ceph_do_objects_copy(struct ceph_inode_info *src_ci, u64 *src_off, > > > + struct ceph_inode_info *dst_ci, u64 *dst_off, > > > + struct ceph_fs_client *fsc, > > > + size_t len, unsigned int flags) > > > +{ > > > + struct ceph_object_locator src_oloc, dst_oloc; > > > + struct ceph_object_id src_oid, dst_oid; > > > + size_t bytes = 0; > > > + u64 src_objnum, src_objoff, dst_objnum, dst_objoff; > > > + u32 src_objlen, dst_objlen; > > > + u32 object_size = src_ci->i_layout.object_size; > > > + int ret; > > > + > > > + src_oloc.pool = src_ci->i_layout.pool_id; > > > + src_oloc.pool_ns = ceph_try_get_string(src_ci->i_layout.pool_ns); > > > + dst_oloc.pool = dst_ci->i_layout.pool_id; > > > + dst_oloc.pool_ns = ceph_try_get_string(dst_ci->i_layout.pool_ns); > > > + > > > + while (len >= object_size) { > > > + ceph_calc_file_object_mapping(&src_ci->i_layout, *src_off, > > > + object_size, &src_objnum, > > > + &src_objoff, &src_objlen); > > > + ceph_calc_file_object_mapping(&dst_ci->i_layout, *dst_off, > > > + object_size, &dst_objnum, > > > + &dst_objoff, &dst_objlen); > > > + ceph_oid_init(&src_oid); > > > + ceph_oid_printf(&src_oid, "%llx.%08llx", > > > + src_ci->i_vino.ino, src_objnum); > > > + ceph_oid_init(&dst_oid); > > > + ceph_oid_printf(&dst_oid, "%llx.%08llx", > > > + dst_ci->i_vino.ino, dst_objnum); > > > + /* Do an object remote copy */ > > > + ret = ceph_osdc_copy_from(&fsc->client->osdc, > > > + src_ci->i_vino.snap, 0, > > > + &src_oid, &src_oloc, > > > + CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL | > > > + CEPH_OSD_OP_FLAG_FADVISE_NOCACHE, > > > + &dst_oid, &dst_oloc, > > > + CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL | > > > + CEPH_OSD_OP_FLAG_FADVISE_DONTNEED, > > > + dst_ci->i_truncate_seq, > > > + dst_ci->i_truncate_size, > > > + CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ); > > > + if (ret) { > > > + if (ret == -EOPNOTSUPP) { > > > + fsc->have_copy_from2 = false; > > > + pr_notice("OSDs don't support copy-from2; disabling copy offload\n"); > > > + } > > > + dout("ceph_osdc_copy_from returned %d\n", ret); > > > + if (!bytes) > > > + bytes = ret; > > > + goto out; > > > + } > > > + len -= object_size; > > > + bytes += object_size; > > > + *src_off += object_size; > > > + *dst_off += object_size; > > > + } > > > + > > > +out: > > > + ceph_oloc_destroy(&src_oloc); > > > + ceph_oloc_destroy(&dst_oloc); > > > + return bytes; > > > +} > > > + > > > static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, > > > struct file *dst_file, loff_t dst_off, > > > size_t len, unsigned int flags) > > > @@ -1941,14 +2006,11 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, > > > struct ceph_inode_info *dst_ci = ceph_inode(dst_inode); > > > struct ceph_cap_flush *prealloc_cf; > > > struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode); > > > - struct ceph_object_locator src_oloc, dst_oloc; > > > - struct ceph_object_id src_oid, dst_oid; > > > - loff_t endoff = 0, size; > > > - ssize_t ret = -EIO; > > > + loff_t size; > > > + ssize_t ret = -EIO, bytes; > > > u64 src_objnum, dst_objnum, src_objoff, dst_objoff; > > > - u32 src_objlen, dst_objlen, object_size; > > > + u32 src_objlen, dst_objlen; > > > int src_got = 0, dst_got = 0, err, dirty; > > > - bool do_final_copy = false; > > > > > > if (src_inode->i_sb != dst_inode->i_sb) { > > > struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode); > > > @@ -2026,22 +2088,14 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, > > > if (ret < 0) > > > goto out_caps; > > > > > > - size = i_size_read(dst_inode); > > > - endoff = dst_off + len; > > > - > > > /* Drop dst file cached pages */ > > > ret = invalidate_inode_pages2_range(dst_inode->i_mapping, > > > dst_off >> PAGE_SHIFT, > > > - endoff >> PAGE_SHIFT); > > > + (dst_off + len) >> PAGE_SHIFT); > > > if (ret < 0) { > > > dout("Failed to invalidate inode pages (%zd)\n", ret); > > > ret = 0; /* XXX */ > > > } > > > - src_oloc.pool = src_ci->i_layout.pool_id; > > > - src_oloc.pool_ns = ceph_try_get_string(src_ci->i_layout.pool_ns); > > > - dst_oloc.pool = dst_ci->i_layout.pool_id; > > > - dst_oloc.pool_ns = ceph_try_get_string(dst_ci->i_layout.pool_ns); > > > - > > > ceph_calc_file_object_mapping(&src_ci->i_layout, src_off, > > > src_ci->i_layout.object_size, > > > &src_objnum, &src_objoff, &src_objlen); > > > @@ -2060,6 +2114,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, > > > * starting at the src_off > > > */ > > > if (src_objoff) { > > > + dout("Initial partial copy of %u bytes\n", src_objlen); > > > + > > > /* > > > * we need to temporarily drop all caps as we'll be calling > > > * {read,write}_iter, which will get caps again. > > > @@ -2067,8 +2123,9 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, > > > put_rd_wr_caps(src_ci, src_got, dst_ci, dst_got); > > > ret = do_splice_direct(src_file, &src_off, dst_file, > > > &dst_off, src_objlen, flags); > > > - if (ret < 0) { > > > - dout("do_splice_direct returned %d\n", err); > > > + /* Abort on short copies or on error */ > > > + if (ret < src_objlen) { > > > + dout("Failed partial copy (%zd)\n", ret); > > > goto out; > > > } > > > len -= ret; > > > @@ -2081,62 +2138,29 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, > > > if (err < 0) > > > goto out_caps; > > > } > > > - object_size = src_ci->i_layout.object_size; > > > - while (len >= object_size) { > > > - ceph_calc_file_object_mapping(&src_ci->i_layout, src_off, > > > - object_size, &src_objnum, > > > - &src_objoff, &src_objlen); > > > - ceph_calc_file_object_mapping(&dst_ci->i_layout, dst_off, > > > - object_size, &dst_objnum, > > > - &dst_objoff, &dst_objlen); > > > - ceph_oid_init(&src_oid); > > > - ceph_oid_printf(&src_oid, "%llx.%08llx", > > > - src_ci->i_vino.ino, src_objnum); > > > - ceph_oid_init(&dst_oid); > > > - ceph_oid_printf(&dst_oid, "%llx.%08llx", > > > - dst_ci->i_vino.ino, dst_objnum); > > > - /* Do an object remote copy */ > > > - err = ceph_osdc_copy_from( > > > - &src_fsc->client->osdc, > > > - src_ci->i_vino.snap, 0, > > > - &src_oid, &src_oloc, > > > - CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL | > > > - CEPH_OSD_OP_FLAG_FADVISE_NOCACHE, > > > - &dst_oid, &dst_oloc, > > > - CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL | > > > - CEPH_OSD_OP_FLAG_FADVISE_DONTNEED, > > > - dst_ci->i_truncate_seq, dst_ci->i_truncate_size, > > > - CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ); > > > - if (err) { > > > - if (err == -EOPNOTSUPP) { > > > - src_fsc->have_copy_from2 = false; > > > - pr_notice("OSDs don't support copy-from2; disabling copy offload\n"); > > > - } > > > - dout("ceph_osdc_copy_from returned %d\n", err); > > > - if (!ret) > > > - ret = err; > > > - goto out_caps; > > > - } > > > - len -= object_size; > > > - src_off += object_size; > > > - dst_off += object_size; > > > - ret += object_size; > > > - } > > > > > > - if (len) > > > - /* We still need one final local copy */ > > > - do_final_copy = true; > > > + size = i_size_read(dst_inode); > > > + bytes = ceph_do_objects_copy(src_ci, &src_off, dst_ci, &dst_off, > > > + src_fsc, len, flags); > > > + if (bytes <= 0) { > > > + if (!ret) > > > + ret = bytes; > > > + goto out_caps; > > > + } > > > > Suppose we did the front part with do_splice_direct (ret > 0), but then > > ceph_do_objects_copy fails (bytes < 0). We "goto out_caps" but then... > > > > [...] > > > > > @@ -2152,15 +2176,14 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, > > > out_caps: > > > put_rd_wr_caps(src_ci, src_got, dst_ci, dst_got); > > > > > > - if (do_final_copy) { > > > - err = do_splice_direct(src_file, &src_off, dst_file, > > > - &dst_off, len, flags); > > > - if (err < 0) { > > > - dout("do_splice_direct returned %d\n", err); > > > - goto out; > > > - } > > > - len -= err; > > > - ret += err; > > > + if (len && (ret >= 0)) { > > > > ...len is still positive and we do_splice_direct again (probably at the > > wrong offset?), instead of just returning a short copy. I think we > > probably want to just stop any copying if it fails at any point along > > the way, right? > > Well... To be honest I deliberately wanted to try a do_splice_direct in > case the remote object copies fail (basically, reverting to something > similar to the generic_copy_file_range). Now, I've been staring at this > code for some time and I may be missing the obvious (again!). But: > > - the offsets should be OK because ceph_do_objects_copy() only updates > them after each successful object copy > > - len should also be consistent: > * If 'bytes' was <= 0, it should contain what was written by > do_splice_direct > * if 'bytes' was > 0, but possibly < expected (e.g. an OSD returned an > error after a few object copies), len should still be consistent > > Anyway, I'm not too attached to this approach, and if you rather have this > function to return in the scenario you've described (and eventually have > the user to retry the operation) I'm OK with that. > Yes, sorry. I wasn't disputing whether this would fall over, but whether it was intended behavior. I'm not sure we gain anything by doing a second splice once we've had a failure though. I think if this isn't going to (largely) use copy offloading then we probably ought to stop and just return to userland as quickly as we can. > > > + dout("Final partial copy of %zu bytes\n", len); > > > + bytes = do_splice_direct(src_file, &src_off, dst_file, > > > + &dst_off, len, flags); > > > + if (bytes > 0) > > > + ret += bytes; > > > + else > > > + dout("Failed partial copy (%zd)\n", bytes); > > > } > > > > > > out: > > > > -- > > Jeff Layton > > -- Jeff Layton