2023-12-19 01:25:59

by Alvin Šipraga

[permalink] [raw]
Subject: [PATCH v3 0/2] get_maintainer: correctly parse UTF-8 encoded names in files

Signed-off-by: Alvin Šipraga <[email protected]>
---
Changes in v3:
- add more rationale for opening everything with UTF-8 encoding
- fix a separate issue identified when introducing UTF-8 names, namely
that they would not get escaped with quotes as expected, due to Perl's
default behaviour being to match UTF-8 characters with \w
- add a second patch to fix an unrelated issue mentioned by Joe whereby
a mailing list might get the display name '-'
- Link to v2: https://lore.kernel.org/r/[email protected]

Changes in v2:
- use '\p{L}' rather than '\p{Latin}', so that matching is even more
inclusive (i.e. match also Greek letters, CJK, etc.)
- fix commit message to refer to tools mailing list, not b4 mailing list
- Link to v1: https://lore.kernel.org/r/[email protected]

---
Alvin Šipraga (2):
get_maintainer: correctly parse UTF-8 encoded names in files
get_maintainer: remove stray punctuation when cleaning file emails

scripts/get_maintainer.pl | 48 +++++++++++++++++++++++++++--------------------
1 file changed, 28 insertions(+), 20 deletions(-)
---
base-commit: 2cf4f94d8e8646803f8fb0facf134b0cd7fb691a
change-id: 20231014-get-maintainers-utf8-32c65c4d6f8a



2023-12-19 01:25:59

by Alvin Šipraga

[permalink] [raw]
Subject: [PATCH v3 2/2] get_maintainer: remove stray punctuation when cleaning file emails

From: Alvin Šipraga <[email protected]>

When parsing emails from .yaml files in particular, stray punctuation
such as a leading '-' can end up in the name. For example, consider a
common YAML section such as:

maintainers:
- [email protected]

This would previously be processed by get_maintainer.pl as:

- <[email protected]>

Make the logic in clean_file_emails more robust by deleting any
sub-names which consist of common single punctuation marks before
proceeding to the best-effort name extraction logic. The output is then
correct:

[email protected]

Some additional comments are added to the function to make things
clearer to future readers.

Link: https://lore.kernel.org/all/[email protected]/
Suggested-by: Joe Perches <[email protected]>
Signed-off-by: Alvin Šipraga <[email protected]>
---
scripts/get_maintainer.pl | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
index dac38c6e3b1c..ee1aed7e090c 100755
--- a/scripts/get_maintainer.pl
+++ b/scripts/get_maintainer.pl
@@ -2462,11 +2462,17 @@ sub clean_file_emails {
foreach my $email (@file_emails) {
$email =~ s/[\(\<\{]{0,1}([A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+)[\)\>\}]{0,1}/\<$1\>/g;
my ($name, $address) = parse_email($email);
- if ($name eq '"[,\.]"') {
- $name = "";
- }

+ # Strip quotes for easier processing, format_email will add them back
+ $name =~ s/^"(.*)"$/$1/;
+
+ # Split into name-like parts and remove stray punctuation particles
my @nw = split(/[^\p{L}\'\,\.\+-]/, $name);
+ @nw = grep(!/^[\'\,\.\+-]$/, @nw);
+
+ # Make a best effort to extract the name, and only the name, by taking
+ # only the last two names, or in the case of obvious initials, the last
+ # three names.
if (@nw > 2) {
my $first = $nw[@nw - 3];
my $middle = $nw[@nw - 2];
@@ -2480,18 +2486,16 @@ sub clean_file_emails {
} else {
$name = "$middle $last";
}
+ } else {
+ $name = "@nw";
}

if (substr($name, -1) =~ /[,\.]/) {
$name = substr($name, 0, length($name) - 1);
- } elsif (substr($name, -2) =~ /[,\.]"/) {
- $name = substr($name, 0, length($name) - 2) . '"';
}

if (substr($name, 0, 1) =~ /[,\.]/) {
$name = substr($name, 1, length($name) - 1);
- } elsif (substr($name, 0, 2) =~ /"[,\.]/) {
- $name = '"' . substr($name, 2, length($name) - 2);
}

my $fmt_email = format_email($name, $address, $email_usename);

--
2.43.0