Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp4856696rdb; Tue, 12 Dec 2023 11:07:41 -0800 (PST) X-Google-Smtp-Source: AGHT+IGaTKOrV+nknK24lRbu7ozulgzVYP6lYbGr2HU7e0pnSo1wjPgLjDkgHLwyiBlDKyIrfV2P X-Received: by 2002:a05:6a00:9382:b0:6cd:ecdf:b244 with SMTP id ka2-20020a056a00938200b006cdecdfb244mr4706455pfb.5.1702408060911; Tue, 12 Dec 2023 11:07:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702408060; cv=none; d=google.com; s=arc-20160816; b=irScX8bdsMM1NdFHRX7tyPtq+mbviQeNKpiiWiE80F0sKzLsBwlAyTJuidlvbDiKZ8 IMnsM3IqgxdgW1m37o4UObSNwi9A/deEkoKUy777fn8pmPItyjqz80QtDtw5Fjba4TFk YD8GFxq5qadagq2Cv1QWwEGrE3TaAG9NlrS+EIThCX8+EutRadhCFp4ce2IDHVZRGq18 uARliXVT23Dj8wx0bLTpIav46gH9tQao9gAllCkc/6VA8nAKdpzy7DNyT1wK0aghqob1 e2REQzgVjTtxjR6OTocNtYYmnHil9fsoa/l0M4Avx/O+qaQxqHJSeBhAOAQHzsOM/kU2 +Ufw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent :content-transfer-encoding:references:in-reply-to:date:cc:to:from :subject:message-id; bh=k9kH8WmNVNpeLysRVldF3yeq7oWvaviFoBGPYnjunCw=; fh=FHGmdzG/wp577GOMSEyHxwHPDx2oWwmdxrNMy7kmy8I=; b=IlZPrjX7fc0OUSA5NUnTPDnGUl69pGyyZPcAYR+PCS4K5Y5roTFaZFViujw3qfDoAT sXv7tjy+Zg4YqA4JiVBO7bBwG9w89dYH9/TZbyw3tE1KrZr6OrG8daNjQjVJA5wf+9Ae 2r/D5IYhPeh+M3tcNt6WcpjGqA9TPYp3Tmg3bgx0l5FaWCZBp2S5qrF+zDZqw6+VWGnT +5OKIz/YqfuxDSwYLYOUeglpkL1NStNnoq4S1nkCpB2VrlykODDMN6aw5frfIMk+eKsj aUWF3AfMbslEeInJl9xqxNoQRGFbT8ooUbtFa11zhzLgdR8CCL9m7KOuEZQmyU2PH/HW 8Wbw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id u26-20020a62d45a000000b006ce99cc58a7si8199254pfl.369.2023.12.12.11.07.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Dec 2023 11:07:40 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 8C01A80239C6; Tue, 12 Dec 2023 11:07:38 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376639AbjLLTHY convert rfc822-to-8bit (ORCPT + 99 others); Tue, 12 Dec 2023 14:07:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41618 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230181AbjLLTHX (ORCPT ); Tue, 12 Dec 2023 14:07:23 -0500 Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8764D93 for ; Tue, 12 Dec 2023 11:07:29 -0800 (PST) Received: from omf01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 83586A0AA6; Tue, 12 Dec 2023 19:07:27 +0000 (UTC) Received: from [HIDDEN] (Authenticated sender: joe@perches.com) by omf01.hostedemail.com (Postfix) with ESMTPA id AC6D460009; Tue, 12 Dec 2023 19:07:23 +0000 (UTC) Message-ID: <74126e6e301d2f4a0e5a546caa54961dbc2d492c.camel@perches.com> Subject: Re: [PATCH] checkpatch: use utf-8 match for spell checking From: Joe Perches To: Antonio Borneo , Andy Whitcroft , Dwaipayan Ray , Lukas Bulwahn , Andrew Morton Cc: linux-kernel@vger.kernel.org, =?ISO-8859-1?Q?Cl=E9ment_L=E9ger?= , =?ISO-8859-1?Q?Cl=E9ment?= Le Goffic , linux-stm32@st-md-mailman.stormreply.com Date: Tue, 12 Dec 2023 11:07:22 -0800 In-Reply-To: <20231212094310.3633-1-antonio.borneo@foss.st.com> References: <20231212094310.3633-1-antonio.borneo@foss.st.com> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 8BIT User-Agent: Evolution 3.48.4 (3.48.4-1.fc38) MIME-Version: 1.0 X-Rspamd-Queue-Id: AC6D460009 X-Stat-Signature: w5n7g444nosr9atuewo16cpfkdo4myh4 X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, UNPARSEABLE_RELAY autolearn=unavailable autolearn_force=no version=3.4.6 X-Rspamd-Server: rspamout01 X-Session-Marker: 6A6F6540706572636865732E636F6D X-Session-ID: U2FsdGVkX19HSExzuAnmITnpGMOOhmfORfXhjn61ikg= X-HE-Tag: 1702408043-976277 X-HE-Meta: U2FsdGVkX1+qu2fsds+XWX0ynq9IUdQllRFbwE5CL+N19JiQH2TZmrxd8yiWLMztmVJnsfG6mCuWKsrRztDBPx9H9Q68VfHgwSKwHMgdoWmW6aApliF1xHTKVtl7TS3tN2qV+tVrWr5z3vqSMiul5Xxwx40GLUZ2ty3hkYPTYsddFNJlkkPPYXGJYEeArG5xo97Pyw1uSirZJifx5nUp0yP6CypEBIMsq5eBq5/b4aKibqMK3SKQ2rLhmHRpNMos1urMaPKn450vlPJmR1uVkG1TMACEF2mXt1hxHARHpJWZahcZYhgkdX9T4UeHilNYQxxV3wp9SVfvEJowHkJ7r3jz7ZX3ooqkq9q3ztFhk/QnPg6TM2jDzUxIMwPWU3LFdKC4EyoZDTtPUzTbcgGEzyTtXqpHTubMT/GLvqJSjZf9MvTkLOdeUTfZXAokxWsJ X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Tue, 12 Dec 2023 11:07:38 -0800 (PST) On Tue, 2023-12-12 at 10:43 +0100, Antonio Borneo wrote: > The current code that checks for misspelling verifies, in a more > complex regex, if $rawline matches [^\w]($misspellings)[^\w] > > Being $rawline a byte-string, a utf-8 character in $rawline can > match the non-word-char [^\w]. > E.g.: > ./script/checkpatch.pl --git 81c2f059ab9 > WARNING: 'ment' may be misspelled - perhaps 'meant'? > #36: FILE: MAINTAINERS:14360: > +M: Cl?ment L?ger > ^^^^ > > Use a utf-8 version of $rawline for spell checking. > > Signed-off-by: Antonio Borneo > Reported-by: Cl?ment Le Goffic Seems sensible, thanks, but: > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl [] > @@ -3477,7 +3477,8 @@ sub process { > # Check for various typo / spelling mistakes > if (defined($misspellings) && > ($in_commit_log || $line =~ /^(?:\+|Subject:)/i)) { > - while ($rawline =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) { > + my $rawline_utf8 = decode("utf8", $rawline); > + while ($rawline_utf8 =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) { > my $typo = $1; > my $blank = copy_spacing($rawline); Maybe this needs to use $rawline_utf8 ? > my $ptr = substr($blank, 0, $-[1]) . "^" x length($typo); And may now the $fix bit will not always work properly