Received: by 2002:a25:31c3:0:0:0:0:0 with SMTP id x186csp2763662ybx; Fri, 8 Nov 2019 09:01:15 -0800 (PST) X-Google-Smtp-Source: APXvYqwbga9xplMImPcnSm1pJd75EiiIEpGeFiT/lZ6CXr7smdv06kGV7eVgH9FsA1Ko8I8BjMSy X-Received: by 2002:a1c:740a:: with SMTP id p10mr9110759wmc.121.1573232475833; Fri, 08 Nov 2019 09:01:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1573232475; cv=none; d=google.com; s=arc-20160816; b=Fut/xbg4/toOZjpDA+63NXcc1Zsoi5Pe/ylbNyzTHVch63z10vQgzOK7YmuRh7pzPX jwwvoWEvc2Zo7eQhPePcpovxzkGG9/jzQruRPw5IxcCpbS2bcboHNmPe93Z2fa7J2+tI oJvJJBIxUvJExDkdezCcA7WAsYrVL3/Vj7gv3nB888K5pRCSg6kRG9/xqQPogtRPzZg+ BKSLgeb5MdwoO7cII3P4MNa1m2y7ZVoK+wQ3omcDFl/ArHcyTszHxgvrl3Fcxigle9N8 SzxXn7jo0rJljuanPefLrlqpbOj+7Lr0gnuQPeyMbrXXrEc2HLjCzbRwtqxwwwHeD9UL 9E/Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=rbcdpktrbXS6Yowcc4gjz10LUAcVoUBnf3DKhGRoD9A=; b=AFduXeJ+D5iOrJ7p9qzbzlJH8YYDafkCg+ZP6HbiH4n5dXre1gTAU8vqAYgb+0vrqj GNcVOd0CWEMe8Y61CWcrKQPjczjY1taHphxhmCMp/0oI6+lkFeQGm66acydhkGUI5l2A lMuxbQGYLpM0ImGY5myK5ApbnvOUaxAG8KGNLed7wzvhL+ET3zzSXvJpWsa9mz6XSdlM KnPZEpwBeI6leIQGjVj7is9FfT569x2OMnZaDr/WoxayI6ztIZGqABQIHRXruuveDrLs g3bNpVTFQay3R18c63rLd/iPAsDg42guDMYe66sEs6EA5jep8FNX/avDF9Ov9mUQmvhb FEkA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=pCNhQjYf; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r21si4077287ejo.151.2019.11.08.09.00.51; Fri, 08 Nov 2019 09:01:15 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=pCNhQjYf; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727767AbfKHQ6v (ORCPT + 99 others); Fri, 8 Nov 2019 11:58:51 -0500 Received: from mail-il1-f194.google.com ([209.85.166.194]:36309 "EHLO mail-il1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726152AbfKHQ6v (ORCPT ); Fri, 8 Nov 2019 11:58:51 -0500 Received: by mail-il1-f194.google.com with SMTP id s75so5739305ilc.3; Fri, 08 Nov 2019 08:58:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=rbcdpktrbXS6Yowcc4gjz10LUAcVoUBnf3DKhGRoD9A=; b=pCNhQjYf2wxGkF/0uM6a+eXLDUNo2L9AQNAo574430FFUigplcVyNU//0bZzPOwzHs T8ntbZc/S2tBxjqt2dSkn3PV6xJo/Vi0+1DJ5wn+YrRgxi7FWRS6BduIPfayoOhRG575 NvJQNm08a6sdqX2keVjIuxvf6aSPXWRzKhnrP88vdianXd7k0LIzzEbkBwNLlM3WpIc2 fZWysjLs6sKuSfjWEVFqQIwyTr29lwLPbm7ZR9t9RoyqnrEZzNdc5h5JbbZXia3WK510 85I8qAON8XhgGVlLu9hufe7ZHbWchcrafGv26g2zxbFXOjhpvlel1paWNJeW852KO9on O27g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=rbcdpktrbXS6Yowcc4gjz10LUAcVoUBnf3DKhGRoD9A=; b=RwLQaiPTxTXkBTmKtdWfIoP4a3LmwY4Xbua9mkPSh2KdKNcsO3+KebRjSi4FuJDzzH YoI5qBDzyGaHv9SfGY4j7N8bIlnmq4vpXqZiRNHNXb5k9xCC9POMJwNjAzbIppqS2N5z NOlW3nMefO5qJ52JBCpWfqewNoRMwJEr6/mGY2qcxqH2ANgrmH62wueBtdWFwkrHM4QP DJlOzQ/CfME65A5cxXmqmm7DLrzkx+p3CSOc9Y8yih5/CZWsVoPCPHB7mFrWXTMnZqdL mW/9HOlaUIUOKcR/IeBlY/LasX30HEQBlPKYP8M7E46QBly9yZUeUl+ujTzmeKr0J1BZ ErQg== X-Gm-Message-State: APjAAAWbdwguZ9TKa/JnGi2c27/t5u6z9YyHR1r9ceipFwIuVfX7WOxH hdKRxv095n7RFX8bcEyOywaETEv/eR9JtuTMjmo= X-Received: by 2002:a92:7945:: with SMTP id u66mr12713157ilc.215.1573232329842; Fri, 08 Nov 2019 08:58:49 -0800 (PST) MIME-Version: 1.0 References: <20191108141555.31176-1-lhenriques@suse.com> <20191108164758.GA1760@hermes.olymp> In-Reply-To: <20191108164758.GA1760@hermes.olymp> From: Ilya Dryomov Date: Fri, 8 Nov 2019 17:59:12 +0100 Message-ID: Subject: Re: [RFC PATCH 0/2] ceph: safely use 'copy-from' Op on Octopus OSDs To: Luis Henriques Cc: Jeff Layton , Sage Weil , "Yan, Zheng" , Ceph Development , LKML Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 8, 2019 at 5:48 PM Luis Henriques wrote: > > On Fri, Nov 08, 2019 at 04:15:35PM +0100, Ilya Dryomov wrote: > > On Fri, Nov 8, 2019 at 3:15 PM Luis Henriques wrote: > > > > > > Hi! > > > > > > (Sorry for the long cover letter!) > > > > This is exactly what cover letters are for! > > > > > > > > Since the fix for [1] has finally been merged and should be available in > > > the next (Octopus) ceph release, I'm trying to clean-up my kernel client > > > patch that tries to find out whether or not it's safe to use the > > > 'copy-from' RADOS operation for copy_file_range. > > > > > > So, the fix for [1] was to modify the 'copy-from' operation to allow > > > clients to optionally (using the CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ > > > flag) send the extra truncate_seq and truncate_size parameters. Since > > > only Octopus will have this fix (no backports planned), the client > > > simply needs to ensure the OSDs being used have SERVER_OCTOPUS in their > > > features. > > > > > > My initial solution was to add an extra test in __submit_request, > > > looping all the request ops and checking if the connection has the > > > required features for that operation. Obviously, at the moment only the > > > copy-from operation has a restriction but I guess others may be added in > > > the future. I believe that doing this at this point (__submit_request) > > > allows to cover cases where a cluster is being upgraded to Octopus and > > > we have different OSDs running with different feature bits. > > > > > > Unfortunately, this solution is racy because the connection state > > > machine may be changing and the peer_features field isn't yet set. For > > > example: if the connection to an OSD is being re-open when we're about > > > to check the features, the con->state will be CON_STATE_PREOPEN and the > > > con->peer_features will be 0. I tried to find ways to move the feature > > > check further down in the stack, but that can't be easily done without > > > adding more infrastructure. A solution that came to my mind was to add > > > a new con->ops, invoked in the context of ceph_con_workfn, under the > > > con->mutex. This callback could then verify the available features, > > > aborting the operation if needed. > > > > > > Note that the race in this patchset doesn't seem to be a huge problem, > > > other than occasionally reverting to a VFS generic copy_file_range, as > > > -EOPNOTSUPP will be returned here. But it's still a race, and there are > > > probably other cases that I'm missing. > > > > > > Anyway, maybe I'm missing an obvious solution for checking these OSD > > > features, but I'm open to any suggestions on other options (or some > > > feedback on the new callback in ceph_connection_operations option). > > > > > > [1] https://tracker.ceph.com/issues/37378 > > > > If the OSD checked for unknown flags, like newer syscalls do, it would > > be super easy, but it looks like it doesn't. > > > > An obvious solution is to look at require_osd_release in osdmap, but we > > don't decode that in the kernel because it lives the OSD portion of the > > osdmap. We could add that and consider the fact that the client now > > needs to decode more than just the client portion a design mistake. > > I'm not sure what can of worms does that open and if copy-from alone is > > worth it though. Perhaps that field could be moved to (or a copy of it > > be replicated in) the client portion of the osdmap starting with > > octopus? We seem to be running into it on the client side more and > > more... > > I can't say I'm thrilled with the idea of going back to hack into the > OSDs code again, I was hoping to be able to handle this with the > information we already have on the connection peer_features field. It > took me *months* to have the OSD fix merged in so I'm not really > convinced a change to the osdmap would make it into Octopus :-) > > (But I'll have a look at this and see if I can understand what moving or > replicating the field in the osdmap would really entail.) Just to be clear: I'm not suggesting that you do it ;) More of an observation that something that is buried deep in the OSD portion of the osdmap is being needed increasingly by the clients. Thanks, Ilya