Received: by 2002:a05:7412:f690:b0:e2:908c:2ebd with SMTP id ej16csp295236rdb; Thu, 19 Oct 2023 05:02:31 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEpqWFTDiUIXvwvkNChqSc32V5saSFoQ8ygRZYHzR5KZlsV5Tt2F6xp32DWTuDgO+7tXUYu X-Received: by 2002:a17:902:fb44:b0:1ca:18a3:a49b with SMTP id lf4-20020a170902fb4400b001ca18a3a49bmr1853622plb.37.1697716950925; Thu, 19 Oct 2023 05:02:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697716950; cv=none; d=google.com; s=arc-20160816; b=Ww76kydUfOIK3xsHVRAWDt9D8OUFkPqyVPJswL+JBRPhZgyoCuN0EL8JoXMN6JTfTm bO1ThyWRoayX/EfIoI1k7+SCDpggxbC/r0p/6UOM4DPrtJa8JhtvA+5ICFtH44/GvJbo n62ug6q8+w+H8OnsUseeUTVgdNABOrIuVbhNFa/3BgViOxnav2NZvBrAwPzR/n4FQwOd hESHZ/xU/eWWASKle1axTkU4K4JZyhrRGkrBR1Yc7k5xW5KdqI38TnZpTH/G1HjvXuDn WYLpDwLx+aBZhnQQCY/kPgooywYl4eKyTllaXEqMQbm0jD7NXx0t6opz12oN2mPhspTM L+tg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=p1Z08uY9OGVA+D31Z41vTFCn4cG4kScVQCZmJx/DDro=; fh=l5xoGkalwZ3OR+kiX1qUeYD0FGoCZo9WgMsV3T8emFY=; b=ghRpGqrSwOr/t+tYlYNjmPWo6ee+bSDLiYAei2Ue5w2VhWXkFVenFoZIAmLq7GHPlM VlxpZyHADHFVOctpz28ImdnouiYOwTqRmbsvzpq2tQI2CoQ/O461AVgwH8ipU8rlmpx5 0GsfYy94ltb+Hd5KAKw9t0ST0v/tCKCZG/HWrznyU1Pg5nn9DZYY4hwYmWKCsrAh59Yu XyAofEpa95jF1Snu7zGUDB15n4D6X+LKtgz4lrLDluV514EqM0BuUn+H7456CwinMvmw IGX/uRZdHFLn/98mQtlLuXVZcSfgPTxlfztuW9k/I8a3BnosVYVUFUDl3PqqE+FGmiTD vsgQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id d8-20020a170902654800b001b3d6c68bd1si1871018pln.643.2023.10.19.05.02.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 Oct 2023 05:02:30 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 602298126714; Thu, 19 Oct 2023 05:02:27 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345484AbjJSMCB (ORCPT + 99 others); Thu, 19 Oct 2023 08:02:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49698 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345471AbjJSMB6 (ORCPT ); Thu, 19 Oct 2023 08:01:58 -0400 Received: from 1wt.eu (ded1.1wt.eu [163.172.96.212]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 9DE85138; Thu, 19 Oct 2023 05:01:54 -0700 (PDT) Received: (from willy@localhost) by mail.home.local (8.17.1/8.17.1/Submit) id 39JC0qTV004618; Thu, 19 Oct 2023 14:00:52 +0200 Date: Thu, 19 Oct 2023 14:00:52 +0200 From: Willy Tarreau To: Alexey Dobriyan Cc: Kees Cook , Christoph Hellwig , Justin Stitt , Keith Busch , Jens Axboe , Sagi Grimberg , linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, linux-hardening@vger.kernel.org, ksummit@lists.linux.dev Subject: Re: the nul-terminated string helper desk chair rearrangement Message-ID: References: <20231018-strncpy-drivers-nvme-host-fabrics-c-v1-1-b6677df40a35@google.com> <20231019054642.GF14346@lst.de> <202310182248.9E197FFD5@keescook> <50ad206e-8a6a-4223-8050-0880e2b1581c@p183> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50ad206e-8a6a-4223-8050-0880e2b1581c@p183> X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Thu, 19 Oct 2023 05:02:27 -0700 (PDT) On Thu, Oct 19, 2023 at 02:40:52PM +0300, Alexey Dobriyan wrote: > On Thu, Oct 19, 2023 at 09:01:53AM +0200, Willy Tarreau wrote: > > On Wed, Oct 18, 2023 at 11:01:54PM -0700, Kees Cook wrote: > > > On Thu, Oct 19, 2023 at 07:46:42AM +0200, Christoph Hellwig wrote: > > > > On Wed, Oct 18, 2023 at 10:48:49PM +0000, Justin Stitt wrote: > > > > > strncpy() is deprecated for use on NUL-terminated destination strings > > > > > [1] and as such we should prefer more robust and less ambiguous string > > > > > interfaces. > > > > > > > > If we want that we need to stop pretendening direct manipulation of > > > > nul-terminate strings is a good idea. I suspect the churn of replacing > > > > one helper with another, maybe slightly better, one probably > > > > introduces more bugs than it fixes. > > > > > > > > If we want to attack the issue for real we need to use something > > > > better. > > > > > > > > lib/seq_buf.c is a good start for a lot of simple cases that just > > > > append to strings including creating complex ones. Kent had a bunch > > > > of good ideas on how to improve it, but couldn't be convinced to > > > > contribute to it instead of duplicating the functionality which > > > > is a bit sad, but I think we need to switch to something like > > > > seq_buf that actually has a counted string instead of all this messing > > > > around with the null-terminated strings. > > > > > > When doing more complex string creation, I agree. I spent some time > > > doing this while I was looking at removing strcat() and strlcat(); this > > > is where seq_buf shines. (And seq_buf is actually both: it maintains its > > > %NUL termination _and_ does the length counting.) The only thing clunky > > > about it was initialization, but all the conversions I experimented with > > > were way cleaner using seq_buf. > > (...) > > > > I also agree. I'm using several other schemes based on pointer+length in > > other projects and despite not being complete in terms of API (due to the > > slow migration of old working code), over time it proves much easier to > > use and requires far less controls. > > > > With NUL-teminated strings you need to perform checks for each and every > > operation. When the length is known and controlled, most often you can > > get rid of many tests on intermediate operations and perform a check at > > the end, thus you end up with less "if" and "goto fail" in the code, > > because the checks are no longer for "not crashing nor introducing > > vulnerabilities", but just "returning a correct result", which can often > > be detected more easily. > > > > Another benefit I found by accident is that when you need to compare some > > tokens against multiple ones (say some keywords for example), it becomes > > much faster than strcmp()-based if/else series because in this case you > > start by comparing lengths instead of comparing contents. And when your > > macros allow you to constify string constants, the compiler will replace > > long "if" series with checks against constant values, and may even arrange > > them as a tree since all are constants, sometimes mixing with the first > > char as the discriminator. Typically on the test below I observe a 10x > > speedup at -O3 and ~5x at -O2 when I convert this: > > > > if (!strcmp(name, "host") || > > !strcmp(name, "content-length") || > > !strcmp(name, "connection") || > > !strcmp(name, "proxy-connection") || > > !strcmp(name, "keep-alive") || > > !strcmp(name, "upgrade") || > > !strcmp(name, "te") || > > !strcmp(name, "transfer-encoding")) > > return 1; > > > > to this: > > > > if (isteq(name, ist("host")) || > > isteq(name, ist("content-length")) || > > isteq(name, ist("connection")) || > > isteq(name, ist("proxy-connection")) || > > isteq(name, ist("keep-alive")) || > > isteq(name, ist("upgrade")) || > > isteq(name, ist("te")) || > > isteq(name, ist("transfer-encoding"))) > > return 1; > > > > The code is larger but when compiled at -Os, it instead becomes smaller. > > > > Another interesting property I'm using in the API above, that might or > > might not apply there is that for most archs we care about, functions > > can take a struct of two words passed as registers, and can return > > such a struct as a pair of registers as well. This allows to chain > > functions by passing one function's return as the argument to another > > one, which is what users often want to do to avoid intermediate > > variables. > > Chaining should be nice cherry on top for very specific cases but certainly > not promoted or advertised. Deleting intermediate variables promotes > implementation-defined behaviour because of unspecified order of evaluation > of function arguments. Second, debuggers still operate with lines in mind, > so jumping to the next statement written like this > > f(g(), h()) > > can be problematic. It obviously depends what these functions do, but that remains true for lots of other use cases applying to a shared memory location, if that's the case. Also it happens that a lot of string functions that are used as arguments to other ones are in fact lookups, skip, trim etc which only manipulate the pointer and the length and not the contents. > Intermediate variables are much less of a problem now > that -Wdeclaration-after-statement has been finally abolished! They don't > consume LOC anymore. Intermediate variables declared after statements remain an abomination which turn a visual lookup from O(indent_levels) to O(lines) because normally you only have to quickly glance a the previous opening brace and if you don't find, you repeat, but with them you have to visually scan every single line. They're now allowed for macros and iterators which can make a good use of them but it's not a reason for abusing them in code supposed to be reviewable by humans. > > All this to say that length-based strings do offer quite a lot of > > benefits over the long term. > > As long as they are named kstring :-) Or std_string, he-he. That point is the last of my concerns ;-) Willy