Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp6083858rdb; Thu, 14 Dec 2023 07:58:14 -0800 (PST) X-Google-Smtp-Source: AGHT+IHBMzIwhuA05gl4c/SdBct/IL6pDpKdQi8YQ69nlyL5s5sdeYsPz0hAD/2KTSncI8ULe7Li X-Received: by 2002:a05:6a20:bb04:b0:181:fe7f:836b with SMTP id fc4-20020a056a20bb0400b00181fe7f836bmr9006391pzb.7.1702569494252; Thu, 14 Dec 2023 07:58:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702569494; cv=none; d=google.com; s=arc-20160816; b=NxqfrTAsDel3VbILz0ukwkdh0T+0+s/UUwtj7oDLkYAycPsDxTzwEr2WMLHaHhlCFR 8poUCBP2zUGzmCF17TpGP3sSerE9O4owBEltTFT2ITcB0wPqdDt8X638XgP2nCaqC001 avJZ0XdRI4QQ5ZU60buPF/X89P7tX2yPjToFLeD1tdV7dbfVqRbwpCcI99AhOddomdBl Z/TIdu4RAoGUI8o8WbJaUqfm0iRsFUMF1ygbetMUA+VyVZKPtWsKtXGc1MamDn/Hj9Ak K+2pnJfbIFa5sd4Zny5FxC/khGhrNIdon6Q/DUzUB8AeEOm2pxEz+7OeGH7r0af/fc4v y/Uw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent :content-transfer-encoding:references:in-reply-to:date:cc:to:from :subject:message-id; bh=PgFx5kTNYQ2YZ/7C6goLQY/9XkStVPdW4ioh60kz8Mo=; fh=wahwrXKXSO/ql7mkWl9/T1Kj5qaGND5fUHXAO94Smbw=; b=EXFbzXEQ/lIiqXrpS/3CQLs9avYuXN1aOSGVzPqleumfoY/F7NTEqFA5I/vKxmJR3D LlXW0bmuHrgowxuZoZecw80jiEn8VBJ/kLXQnefPMjnLqd3VmmIhgBKC5nLKFSyflY6i uc7Lkn5FVe9sn7Au9r7jXKOy2E8JiPB9QFtFeAKYZ5IP/vMjZAuxep87G11sylE6eF9E 8lDEAWySYk+tUjJOa1XETXhg9Wh7L5jdhP8x4EFnfSHcqU0PMQ/Mop8TBVzST+oBZNNo 3CWTKecBeGvRSBOudQMK4en63Dsd6+5Onil7u5mItrRc6SJ4/qqtBOt99R1H1FJSw5Bt OcLg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id by32-20020a056a0205a000b005b90b310e26si12127905pgb.403.2023.12.14.07.58.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Dec 2023 07:58:14 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 750CA81603A9; Thu, 14 Dec 2023 07:58:11 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1573919AbjLNP54 convert rfc822-to-8bit (ORCPT + 99 others); Thu, 14 Dec 2023 10:57:56 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55334 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230372AbjLNP5y (ORCPT ); Thu, 14 Dec 2023 10:57:54 -0500 Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 466C9E8 for ; Thu, 14 Dec 2023 07:58:01 -0800 (PST) Received: from omf08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 9121B140CA6; Thu, 14 Dec 2023 15:57:59 +0000 (UTC) Received: from [HIDDEN] (Authenticated sender: joe@perches.com) by omf08.hostedemail.com (Postfix) with ESMTPA id C937D20027; Thu, 14 Dec 2023 15:57:55 +0000 (UTC) Message-ID: Subject: Re: [PATCH v2] get_maintainer: correctly parse UTF-8 encoded names in files From: Joe Perches To: Alvin =?UTF-8?Q?=C5=A0ipraga?= , Linus Torvalds Cc: Duje =?UTF-8?Q?Mihanovi=C4=87?= , Konstantin Ryabitsev , linux-kernel@vger.kernel.org, Alvin =?UTF-8?Q?=C5=A0ipraga?= , Shawn Guo , Andrew Morton Date: Thu, 14 Dec 2023 07:57:54 -0800 In-Reply-To: <20231214-get-maintainers-utf8-v2-1-b188dc7042a4@bang-olufsen.dk> References: <20231214-get-maintainers-utf8-v2-1-b188dc7042a4@bang-olufsen.dk> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT User-Agent: Evolution 3.48.4 (3.48.4-1.fc38) MIME-Version: 1.0 X-Rspamd-Queue-Id: C937D20027 X-Stat-Signature: gzprjdghpxfeg6cy6iyagoo9msawof6c X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, UNPARSEABLE_RELAY autolearn=unavailable autolearn_force=no version=3.4.6 X-Rspamd-Server: rspamout01 X-Session-Marker: 6A6F6540706572636865732E636F6D X-Session-ID: U2FsdGVkX1/5xJ1vvWzGg4zdXvtfmVJr7Ek3eWE5S00= X-HE-Tag: 1702569475-258753 X-HE-Meta: U2FsdGVkX1/Ae4pwXkWwvrD5QTS79zApbgYmE4Y8tn25pYgx4Admx+Esrk75UyH/mdXXiGGh6xbWnv0mQ3eMUOrdN7hApdC5B+2boH9/OPNrgfiRLPgSkKu7eS6A7PCcAnFL2yZbfcQ4FLaEj+YhoSAW9zvfcC8uSczTTl/aYcGq9MuVJoaKeLDmPHMyrZtBXSq0+7FIpOBFUiZN4HM8Cl2cjG4XERD2HF3nNMyZVeGCVDb+jmOCxiACr1IHB9Pud/oVNMXi65nRM5+/fvGHx7/XNtUyPxyp7GwvFP1KPEYBlVM7nlcw9fCseop3YBZJZKvDWQqOHTsdt3vm+i8m6TIo77+1FnIhbKFXEfxieaRot1v6gu98NTYYmAozjbPi X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Thu, 14 Dec 2023 07:58:11 -0800 (PST) On Thu, 2023-12-14 at 16:06 +0100, Alvin Šipraga wrote: > From: Alvin Šipraga > > While the script correctly extracts UTF-8 encoded names from the > MAINTAINERS file, the regular expressions damage my name when parsing > from .yaml files. Fix this by replacing the Latin-1-compatible regular > expressions with the unicode property matcher \p{L}, which matches on > any letter according to the Unicode General Category of letters. OK > It's also necessary to instruct Perl to open all files with UTF-8 encoding. I doubt this. > --- > Changes in v2: > - use '\p{L}' rather than '\p{Latin}', so that matching is even more > inclusive (i.e. match also Greek letters, CJK, etc.) > - fix commit message to refer to tools mailing list, not b4 mailing list > - Link to v1: https://lore.kernel.org/r/20231014-get-maintainers-utf8-v1-1-3af8c7aeb239@bang-olufsen.dk OK > diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl [] > @@ -20,6 +20,7 @@ use Getopt::Long qw(:config no_auto_abbrev); > use Cwd; > use File::Find; > use File::Spec::Functions; > +use open qw(:std :encoding(UTF-8)); I think this global use is unnecessary. > @@ -442,7 +443,7 @@ sub maintainers_in_file { > my $text = do { local($/) ; <$f> }; > close($f); > > - my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g; > + my @poss_addr = $text =~ m$[\p{L}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g; > push(@file_emails, clean_file_emails(@poss_addr)); > } > } Rather than open _all_ files in utf-8, perhaps the block that opens a specific file to find maintainers sub maintainers_in_file { my ($file) = @_; return if ($file =~ m@\bMAINTAINERS$@); if (-f $file && ($email_file_emails || $file =~ /\.yaml$/)) { open(my $f, '<', $file) or die "$P: Can't open $file: $!\n"; my $text = do { local($/) ; <$f> }; close($f); ... should change the open(my $f... to use open qw(:std :encoding(UTF-8)); open(my $f... And unrelated and secondarily, perhaps the $file =~ /\.yaml$/ test should be $file =~ /\.(?:yaml|dtsi?)$/ to also find any maintainer address in the dts* files https://lore.kernel.org/lkml/20231028174656.GA3310672@bill-the-cat/T/