Received: by 2002:a25:1506:0:0:0:0:0 with SMTP id 6csp2462314ybv; Mon, 24 Feb 2020 05:44:51 -0800 (PST) X-Google-Smtp-Source: APXvYqxC/mrS1fUQKIE1UBoqiUzH3SK4gE3YzFNOWYB1fRoIZP0TmQU15IKTkARPsPALzOaWWoUj X-Received: by 2002:a05:6830:22ca:: with SMTP id q10mr1374456otc.280.1582551890917; Mon, 24 Feb 2020 05:44:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582551890; cv=none; d=google.com; s=arc-20160816; b=lJL2dMP4Fvlt/iMccp8wfrXjpgoDEYwbSLvq/vdNsWyLeRf3cLcYXP8MLGVeFW/yxT wnxOQHZ6J3DWp2frHboZ2cyfZi4ClGpcUXa3R0Rz3IY8mJ91LLTmD1kDy7pjciUpb33D UpCbg+tQ0Dhbb1/uvkQmx0q2658x5Xkk7rpyYlaHQgeE2px5t+6Dyyl3bubNjZ5ub6Lf g0c2QVadE9kT8+m+ZtjqB0ie1JHOi/pjnMU9XiBLlt0Ae5UmWsAO/G5OlKOOz8yCxJKR 16mWySmkBk2fO0i5nZidh2QMYYSgpLPKNr6MJnDdBdJ5sXdycoD+Hk8c3HpZlMBsnA34 Ul0w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=CrC/4LQWTTkfuZhAnHO/GCDu2ijESpEYjTOAiVOzLp4=; b=S1T2FPgkbWkDNXqT7mwd3CWkbeC1FYZiHELb1tp+iSSc8Qb1DuD3skztaLgDzZkFsy EtakOWsrO4Vu7cTot+NWIlLNSRUqCVTR6VzFmRTnft1TnHIM0lzwa/lg1yF9absJfscv F/jeZsYvT+vlc4ZUxz5K11yDSqVf7cNRZzzKx9pVJm/+tXjglvbwDifNKRWdJu+5Bb5D naxWZ3kLfDO42dip6cBKwtOekuAAV5sCZghke5w/8K6dq8Lgmju/trYOM6JWJGeFoPSK /bCQ6LlSevHqTEDC0OAlWFKaOH9ic06iKGb7F+PWDmiFzcr1sI1NFUf4TPL/acGXKSyB J4nA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p26si5959555oto.240.2020.02.24.05.44.37; Mon, 24 Feb 2020 05:44:50 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727326AbgBXNob (ORCPT + 99 others); Mon, 24 Feb 2020 08:44:31 -0500 Received: from mx2.suse.de ([195.135.220.15]:47080 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725535AbgBXNob (ORCPT ); Mon, 24 Feb 2020 08:44:31 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 02C22AC67; Mon, 24 Feb 2020 13:44:28 +0000 (UTC) From: Luis Henriques To: Jeff Layton , Sage Weil , Ilya Dryomov , "Yan, Zheng" Cc: ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org, Luis Henriques Subject: [PATCH v2] ceph: re-org copy_file_range and fix some error paths Date: Mon, 24 Feb 2020 13:44:32 +0000 Message-Id: <20200224134432.25888-1-lhenriques@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch re-organizes copy_file_range, trying to fix a few issues in the error handling. Here's the summary: - Abort copy if initial do_splice_direct() returns fewer bytes than requested. - Move the 'size' initialization (with i_size_read()) further down in the code, after the initial call to do_splice_direct(). This avoids issues with a possibly stale value if a manual copy is done. - Move the object copy loop into a separate function. This makes it easier to handle errors (e.g, dirtying caps and updating the MDS metadata if only some objects have been copied before an error has occurred). - Added calls to ceph_oloc_destroy() to avoid leaking memory with src_oloc and dst_oloc - After the object copy loop, the new file size to be reported to the MDS (if there's file size change) is now the actual file size, and not the size after an eventual extra manual copy. - Added a few dout() to show the number of bytes copied in the two manual copies and in the object copy loop. Signed-off-by: Luis Henriques --- Hi, Just a respin including Jeff's suggestions from initial post. Changes since v1: - Don't bother trying a second splice once we fail during the remote object copies; let user-space retry instead. Cheers, -- Luis fs/ceph/file.c | 173 ++++++++++++++++++++++++++++--------------------- 1 file changed, 100 insertions(+), 73 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index c3b8e8e0bf17..e0bae6b71d7b 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -1931,6 +1931,71 @@ static int is_file_size_ok(struct inode *src_inode, struct inode *dst_inode, return 0; } +static ssize_t ceph_do_objects_copy(struct ceph_inode_info *src_ci, u64 *src_off, + struct ceph_inode_info *dst_ci, u64 *dst_off, + struct ceph_fs_client *fsc, + size_t len, unsigned int flags) +{ + struct ceph_object_locator src_oloc, dst_oloc; + struct ceph_object_id src_oid, dst_oid; + size_t bytes = 0; + u64 src_objnum, src_objoff, dst_objnum, dst_objoff; + u32 src_objlen, dst_objlen; + u32 object_size = src_ci->i_layout.object_size; + int ret; + + src_oloc.pool = src_ci->i_layout.pool_id; + src_oloc.pool_ns = ceph_try_get_string(src_ci->i_layout.pool_ns); + dst_oloc.pool = dst_ci->i_layout.pool_id; + dst_oloc.pool_ns = ceph_try_get_string(dst_ci->i_layout.pool_ns); + + while (len >= object_size) { + ceph_calc_file_object_mapping(&src_ci->i_layout, *src_off, + object_size, &src_objnum, + &src_objoff, &src_objlen); + ceph_calc_file_object_mapping(&dst_ci->i_layout, *dst_off, + object_size, &dst_objnum, + &dst_objoff, &dst_objlen); + ceph_oid_init(&src_oid); + ceph_oid_printf(&src_oid, "%llx.%08llx", + src_ci->i_vino.ino, src_objnum); + ceph_oid_init(&dst_oid); + ceph_oid_printf(&dst_oid, "%llx.%08llx", + dst_ci->i_vino.ino, dst_objnum); + /* Do an object remote copy */ + ret = ceph_osdc_copy_from(&fsc->client->osdc, + src_ci->i_vino.snap, 0, + &src_oid, &src_oloc, + CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL | + CEPH_OSD_OP_FLAG_FADVISE_NOCACHE, + &dst_oid, &dst_oloc, + CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL | + CEPH_OSD_OP_FLAG_FADVISE_DONTNEED, + dst_ci->i_truncate_seq, + dst_ci->i_truncate_size, + CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ); + if (ret) { + if (ret == -EOPNOTSUPP) { + fsc->have_copy_from2 = false; + pr_notice("OSDs don't support copy-from2; disabling copy offload\n"); + } + dout("ceph_osdc_copy_from returned %d\n", ret); + if (!bytes) + bytes = ret; + goto out; + } + len -= object_size; + bytes += object_size; + *src_off += object_size; + *dst_off += object_size; + } + +out: + ceph_oloc_destroy(&src_oloc); + ceph_oloc_destroy(&dst_oloc); + return bytes; +} + static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, struct file *dst_file, loff_t dst_off, size_t len, unsigned int flags) @@ -1941,14 +2006,11 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, struct ceph_inode_info *dst_ci = ceph_inode(dst_inode); struct ceph_cap_flush *prealloc_cf; struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode); - struct ceph_object_locator src_oloc, dst_oloc; - struct ceph_object_id src_oid, dst_oid; - loff_t endoff = 0, size; - ssize_t ret = -EIO; + loff_t size; + ssize_t ret = -EIO, bytes; u64 src_objnum, dst_objnum, src_objoff, dst_objoff; - u32 src_objlen, dst_objlen, object_size; + u32 src_objlen, dst_objlen; int src_got = 0, dst_got = 0, err, dirty; - bool do_final_copy = false; if (src_inode->i_sb != dst_inode->i_sb) { struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode); @@ -2026,22 +2088,14 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, if (ret < 0) goto out_caps; - size = i_size_read(dst_inode); - endoff = dst_off + len; - /* Drop dst file cached pages */ ret = invalidate_inode_pages2_range(dst_inode->i_mapping, dst_off >> PAGE_SHIFT, - endoff >> PAGE_SHIFT); + (dst_off + len) >> PAGE_SHIFT); if (ret < 0) { dout("Failed to invalidate inode pages (%zd)\n", ret); ret = 0; /* XXX */ } - src_oloc.pool = src_ci->i_layout.pool_id; - src_oloc.pool_ns = ceph_try_get_string(src_ci->i_layout.pool_ns); - dst_oloc.pool = dst_ci->i_layout.pool_id; - dst_oloc.pool_ns = ceph_try_get_string(dst_ci->i_layout.pool_ns); - ceph_calc_file_object_mapping(&src_ci->i_layout, src_off, src_ci->i_layout.object_size, &src_objnum, &src_objoff, &src_objlen); @@ -2060,6 +2114,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, * starting at the src_off */ if (src_objoff) { + dout("Initial partial copy of %u bytes\n", src_objlen); + /* * we need to temporarily drop all caps as we'll be calling * {read,write}_iter, which will get caps again. @@ -2067,8 +2123,9 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, put_rd_wr_caps(src_ci, src_got, dst_ci, dst_got); ret = do_splice_direct(src_file, &src_off, dst_file, &dst_off, src_objlen, flags); - if (ret < 0) { - dout("do_splice_direct returned %d\n", err); + /* Abort on short copies or on error */ + if (ret < src_objlen) { + dout("Failed partial copy (%zd)\n", ret); goto out; } len -= ret; @@ -2081,62 +2138,29 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, if (err < 0) goto out_caps; } - object_size = src_ci->i_layout.object_size; - while (len >= object_size) { - ceph_calc_file_object_mapping(&src_ci->i_layout, src_off, - object_size, &src_objnum, - &src_objoff, &src_objlen); - ceph_calc_file_object_mapping(&dst_ci->i_layout, dst_off, - object_size, &dst_objnum, - &dst_objoff, &dst_objlen); - ceph_oid_init(&src_oid); - ceph_oid_printf(&src_oid, "%llx.%08llx", - src_ci->i_vino.ino, src_objnum); - ceph_oid_init(&dst_oid); - ceph_oid_printf(&dst_oid, "%llx.%08llx", - dst_ci->i_vino.ino, dst_objnum); - /* Do an object remote copy */ - err = ceph_osdc_copy_from( - &src_fsc->client->osdc, - src_ci->i_vino.snap, 0, - &src_oid, &src_oloc, - CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL | - CEPH_OSD_OP_FLAG_FADVISE_NOCACHE, - &dst_oid, &dst_oloc, - CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL | - CEPH_OSD_OP_FLAG_FADVISE_DONTNEED, - dst_ci->i_truncate_seq, dst_ci->i_truncate_size, - CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ); - if (err) { - if (err == -EOPNOTSUPP) { - src_fsc->have_copy_from2 = false; - pr_notice("OSDs don't support copy-from2; disabling copy offload\n"); - } - dout("ceph_osdc_copy_from returned %d\n", err); - if (!ret) - ret = err; - goto out_caps; - } - len -= object_size; - src_off += object_size; - dst_off += object_size; - ret += object_size; - } - if (len) - /* We still need one final local copy */ - do_final_copy = true; + size = i_size_read(dst_inode); + bytes = ceph_do_objects_copy(src_ci, &src_off, dst_ci, &dst_off, + src_fsc, len, flags); + if (bytes <= 0) { + if (!ret) + ret = bytes; + goto out_caps; + } + dout("Copied %zu bytes out of %zu\n", bytes, len); + len -= bytes; + ret += bytes; file_update_time(dst_file); inode_inc_iversion_raw(dst_inode); - if (endoff > size) { + if (dst_off > size) { int caps_flags = 0; /* Let the MDS know about dst file size change */ - if (ceph_quota_is_max_bytes_approaching(dst_inode, endoff)) + if (ceph_quota_is_max_bytes_approaching(dst_inode, dst_off)) caps_flags |= CHECK_CAPS_NODELAY; - if (ceph_inode_set_size(dst_inode, endoff)) + if (ceph_inode_set_size(dst_inode, dst_off)) caps_flags |= CHECK_CAPS_AUTHONLY; if (caps_flags) ceph_check_caps(dst_ci, caps_flags, NULL); @@ -2152,15 +2176,18 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, out_caps: put_rd_wr_caps(src_ci, src_got, dst_ci, dst_got); - if (do_final_copy) { - err = do_splice_direct(src_file, &src_off, dst_file, - &dst_off, len, flags); - if (err < 0) { - dout("do_splice_direct returned %d\n", err); - goto out; - } - len -= err; - ret += err; + /* + * Do the final manual copy if we still have some bytes left, unless + * there were errors in remote object copies (len >= object_size). + */ + if (len && (len < src_ci->i_layout.object_size)) { + dout("Final partial copy of %zu bytes\n", len); + bytes = do_splice_direct(src_file, &src_off, dst_file, + &dst_off, len, flags); + if (bytes > 0) + ret += bytes; + else + dout("Failed partial copy (%zd)\n", bytes); } out: