From: Alvin Šipraga <[email protected]>
While the script correctly extracts UTF-8 encoded names from the
MAINTAINERS file, the regular expressions damage my name when parsing
from .yaml files. Fix this by replacing the Latin-1-compatible regular
expressions with the unicode property matcher \p{L}, which matches on
any letter according to the Unicode General Category of letters. It's
also necessary to instruct Perl to open all files with UTF-8 encoding.
The issue was also identified on the tools mailing list [1]. This should
solve the observed side effects there as well.
Link: https://lore.kernel.org/tools/20230726-gush-slouching-a5cd41@meerkat/ [1]
Signed-off-by: Alvin Šipraga <[email protected]>
---
Changes in v2:
- use '\p{L}' rather than '\p{Latin}', so that matching is even more
inclusive (i.e. match also Greek letters, CJK, etc.)
- fix commit message to refer to tools mailing list, not b4 mailing list
- Link to v1: https://lore.kernel.org/r/[email protected]
---
scripts/get_maintainer.pl | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
index ab123b498fd9..344d0cda9854 100755
--- a/scripts/get_maintainer.pl
+++ b/scripts/get_maintainer.pl
@@ -20,6 +20,7 @@ use Getopt::Long qw(:config no_auto_abbrev);
use Cwd;
use File::Find;
use File::Spec::Functions;
+use open qw(:std :encoding(UTF-8));
my $cur_path = fastgetcwd() . '/';
my $lk_path = "./";
@@ -442,7 +443,7 @@ sub maintainers_in_file {
my $text = do { local($/) ; <$f> };
close($f);
- my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
+ my @poss_addr = $text =~ m$[\p{L}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
push(@file_emails, clean_file_emails(@poss_addr));
}
}
@@ -2460,13 +2461,13 @@ sub clean_file_emails {
$name = "";
}
- my @nw = split(/[^A-Za-zÀ-ÿ\'\,\.\+-]/, $name);
+ my @nw = split(/[^\p{L}\'\,\.\+-]/, $name);
if (@nw > 2) {
my $first = $nw[@nw - 3];
my $middle = $nw[@nw - 2];
my $last = $nw[@nw - 1];
- if (((length($first) == 1 && $first =~ m/[A-Za-z]/) ||
+ if (((length($first) == 1 && $first =~ m/\p{L}/) ||
(length($first) == 2 && substr($first, -1) eq ".")) ||
(length($middle) == 1 ||
(length($middle) == 2 && substr($middle, -1) eq "."))) {
---
base-commit: 70f8c6f8f8800d970b10676cceae42bba51a4899
change-id: 20231014-get-maintainers-utf8-32c65c4d6f8a
On Thu, 2023-12-14 at 16:06 +0100, Alvin Šipraga wrote:
> From: Alvin Šipraga <[email protected]>
>
> While the script correctly extracts UTF-8 encoded names from the
> MAINTAINERS file, the regular expressions damage my name when parsing
> from .yaml files. Fix this by replacing the Latin-1-compatible regular
> expressions with the unicode property matcher \p{L}, which matches on
> any letter according to the Unicode General Category of letters.
OK
> It's also necessary to instruct Perl to open all files with UTF-8 encoding.
I doubt this.
> ---
> Changes in v2:
> - use '\p{L}' rather than '\p{Latin}', so that matching is even more
> inclusive (i.e. match also Greek letters, CJK, etc.)
> - fix commit message to refer to tools mailing list, not b4 mailing list
> - Link to v1: https://lore.kernel.org/r/[email protected]
OK
> diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
[]
> @@ -20,6 +20,7 @@ use Getopt::Long qw(:config no_auto_abbrev);
> use Cwd;
> use File::Find;
> use File::Spec::Functions;
> +use open qw(:std :encoding(UTF-8));
I think this global use is unnecessary.
> @@ -442,7 +443,7 @@ sub maintainers_in_file {
> my $text = do { local($/) ; <$f> };
> close($f);
>
> - my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> + my @poss_addr = $text =~ m$[\p{L}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> push(@file_emails, clean_file_emails(@poss_addr));
> }
> }
Rather than open _all_ files in utf-8, perhaps the block
that opens a specific file to find maintainers
sub maintainers_in_file {
my ($file) = @_;
return if ($file =~ m@\bMAINTAINERS$@);
if (-f $file && ($email_file_emails || $file =~ /\.yaml$/)) {
open(my $f, '<', $file)
or die "$P: Can't open $file: $!\n";
my $text = do { local($/) ; <$f> };
close($f);
...
should change the
open(my $f...
to
use open qw(:std :encoding(UTF-8));
open(my $f...
And unrelated and secondarily, perhaps the
$file =~ /\.yaml$/
test should be
$file =~ /\.(?:yaml|dtsi?)$/
to also find any maintainer address in the dts* files
https://lore.kernel.org/lkml/20231028174656.GA3310672@bill-the-cat/T/
On Thu, Dec 14, 2023 at 07:57:54AM -0800, Joe Perches wrote:
> On Thu, 2023-12-14 at 16:06 +0100, Alvin Šipraga wrote:
> > @@ -442,7 +443,7 @@ sub maintainers_in_file {
> > my $text = do { local($/) ; <$f> };
> > close($f);
> >
> > - my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> > + my @poss_addr = $text =~ m$[\p{L}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> > push(@file_emails, clean_file_emails(@poss_addr));
> > }
> > }
>
> Rather than open _all_ files in utf-8, perhaps the block
> that opens a specific file to find maintainers
>
> sub maintainers_in_file {
> my ($file) = @_;
>
> return if ($file =~ m@\bMAINTAINERS$@);
>
> if (-f $file && ($email_file_emails || $file =~ /\.yaml$/)) {
> open(my $f, '<', $file)
> or die "$P: Can't open $file: $!\n";
> my $text = do { local($/) ; <$f> };
> close($f);
> ...
>
> should change the
>
> open(my $f...
> to
> use open qw(:std :encoding(UTF-8));
> open(my $f...
Yes, this also works for parsing the name in an arbitrary file. But with the
change you suggest above, the script then corrupts my name when it is lifted
from MAINTAINERS (!?):
$ ./scripts/get_maintainer.pl -f drivers/net/dsa/realtek/ | grep alsi
"Alvin Å ipraga" <[email protected]> (maintainer:REALTEK RTL83xx SMI DSA ROUTER CHIPS)
I'm not entirely sure why that happens, since my name doesn't get corrupted when
coming from MAINTAINERS with the upstream state of the script. But anyway, with
your suggestion I would then also have to add it here:
@@ -347,6 +346,7 @@ my @mfiles = ();
my @self_test_info = ();
sub read_maintainer_file {
+ use open qw(:std :encoding(UTF-8));
my ($file) = @_;
open (my $maint, '<', "$file")
... and I guess there might be other cases too.
Rather than scattering it all over, don't you think it's more robust to open all
files in UTF-8? I tried to show in one of my replies to v1 [1] that this should
be compatible with basically all of the source tree.
[1] https://lore.kernel.org/all/dzn6uco4c45oaa3ia4u37uo5mlt33obecv7gghj2l756fr4hdh@mt3cprft3tmq/
If you are still unconvinced then I will gladly send a v3 patching the two cases
we have discussed (read_maintainer_file() and maintainers_in_file()).
>
>
> And unrelated and secondarily, perhaps the
> $file =~ /\.yaml$/
> test should be
> $file =~ /\.(?:yaml|dtsi?)$/
>
> to also find any maintainer address in the dts* files
>
> https://lore.kernel.org/lkml/20231028174656.GA3310672@bill-the-cat/T/
Is this supposed to parse the "Copyright (c) 20xx John Doe <[email protected]>" in
the .dts* files? But sure, I can do a resend of Shawn's original patch
separately if you like.
Kind regards,
Alvin
On Fri, 2023-12-15 at 10:30 +0000, Alvin Šipraga wrote:
> On Thu, Dec 14, 2023 at 07:57:54AM -0800, Joe Perches wrote:
> > On Thu, 2023-12-14 at 16:06 +0100, Alvin Šipraga wrote:
> > > @@ -442,7 +443,7 @@ sub maintainers_in_file {
> > > my $text = do { local($/) ; <$f> };
> > > close($f);
> > >
> > > - my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> > > + my @poss_addr = $text =~ m$[\p{L}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> > > push(@file_emails, clean_file_emails(@poss_addr));
Hi again Alvin.
Separate issue, but on the one .yaml file I tried:
$ ./scripts/get_maintainer.pl Documentation/devicetree/bindings/serial/8250.yaml
Greg Kroah-Hartman <[email protected]> (supporter:TTY LAYER AND SERIAL DRIVERS)
Jiri Slaby <[email protected]> (supporter:TTY LAYER AND SERIAL DRIVERS)
Rob Herring <[email protected]> (maintainer:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS)
Krzysztof Kozlowski <[email protected]> (maintainer:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS)
Conor Dooley <[email protected]> (maintainer:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS)
Lubomir Rintel <[email protected]> (in file)
- <[email protected]> (in file)
[email protected] (open list:TTY LAYER AND SERIAL DRIVERS)
[email protected] (open list:TTY LAYER AND SERIAL DRIVERS)
[email protected] (open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS)
Note the single '-' in the "name" portion of [email protected]
Maybe clean_file_emails needs some better name cleansing code.
> > Rather than open _all_ files in utf-8, perhaps the block
> > that opens a specific file to find maintainers
> >
> > sub maintainers_in_file {
> > my ($file) = @_;
> >
> > return if ($file =~ m@\bMAINTAINERS$@);
> >
> > if (-f $file && ($email_file_emails || $file =~ /\.yaml$/)) {
> > open(my $f, '<', $file)
> > or die "$P: Can't open $file: $!\n";
> > my $text = do { local($/) ; <$f> };
> > close($f);
> > ...
> >
> > should change the
> >
> > open(my $f...
> > to
> > use open qw(:std :encoding(UTF-8));
> > open(my $f...
>
> Yes, this also works for parsing the name in an arbitrary file. But with the
> change you suggest above, the script then corrupts my name when it is lifted
> from MAINTAINERS (!?):
>
> $ ./scripts/get_maintainer.pl -f drivers/net/dsa/realtek/ | grep alsi
> "Alvin Å ipraga" <[email protected]> (maintainer:REALTEK RTL83xx SMI DSA ROUTER CHIPS)
Curious. Let me see if I can figure out why that happens.
> If you are still unconvinced then I will gladly send a v3 patching the two cases
> we have discussed (read_maintainer_file() and maintainers_in_file()).
No rush.
> > And unrelated and secondarily, perhaps the
> > $file =~ /\.yaml$/
> > test should be
> > $file =~ /\.(?:yaml|dtsi?)$/
> >
> > to also find any maintainer address in the dts* files
> >
> > https://lore.kernel.org/lkml/20231028174656.GA3310672@bill-the-cat/T/
>
> Is this supposed to parse the "Copyright (c) 20xx John Doe <[email protected]>" in
> the .dts* files?
Yes, just as it would and does for .yaml files.
$ git grep -P -i 'copy.*\<\w+\@\w+\.\w+\>' -- '*.yaml'
Documentation/devicetree/bindings/display/bridge/chrontel,ch7033.yaml:# Copyright (C) 2019,2020 Lubomir Rintel <[email protected]>
Documentation/devicetree/bindings/media/marvell,mmp2-ccic.yaml:# Copyright 2019,2020 Lubomir Rintel <[email protected]>
Documentation/devicetree/bindings/misc/olpc,xo1.75-ec.yaml:# Copyright (C) 2019,2020 Lubomir Rintel <[email protected]>
Documentation/devicetree/bindings/phy/allwinner,sun50i-h6-usb3-phy.yaml:# Copyright 2019 Ondrej Jirman <[email protected]>
Documentation/devicetree/bindings/phy/marvell,mmp3-hsic-phy.yaml:# Copyright 2019 Lubomir Rintel <[email protected]>
Documentation/devicetree/bindings/phy/marvell,mmp3-usb-phy.yaml:# Copyright 2019,2020 Lubomir Rintel <[email protected]>
Documentation/devicetree/bindings/reset/bitmain,bm1880-reset.yaml:# Copyright 2019 Manivannan Sadhasivam <[email protected]>
Documentation/devicetree/bindings/reset/marvell,berlin2-reset.yaml:# Copyright 2015 Antoine Tenart <[email protected]>
Documentation/devicetree/bindings/reset/qca,ar7100-reset.yaml:# Copyright 2015 Alban Bedel <[email protected]>
Documentation/devicetree/bindings/serial/8250.yaml:# Copyright 2020 Lubomir Rintel <[email protected]>
Documentation/devicetree/bindings/spi/marvell,mmp2-ssp.yaml:# Copyright 2019,2020 Lubomir Rintel <[email protected]>
Documentation/devicetree/bindings/usb/marvell,pxau2o-ehci.yaml:# Copyright 2019,2020 Lubomir Rintel <[email protected]>
> But sure, I can do a resend of Shawn's original patch
> separately if you like.
Yes please. Make sure to cc Andrew Morton.
+Shawn, please have a look at the bottom of this mail about your patch
we would like to resend.
On Fri, Dec 15, 2023 at 10:30:52AM -0800, Joe Perches wrote:
> On Fri, 2023-12-15 at 10:30 +0000, Alvin Šipraga wrote:
> > On Thu, Dec 14, 2023 at 07:57:54AM -0800, Joe Perches wrote:
> > > On Thu, 2023-12-14 at 16:06 +0100, Alvin Šipraga wrote:
> > > > @@ -442,7 +443,7 @@ sub maintainers_in_file {
> > > > my $text = do { local($/) ; <$f> };
> > > > close($f);
> > > >
> > > > - my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> > > > + my @poss_addr = $text =~ m$[\p{L}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> > > > push(@file_emails, clean_file_emails(@poss_addr));
>
> Hi again Alvin.
>
> Separate issue, but on the one .yaml file I tried:
>
> $ ./scripts/get_maintainer.pl Documentation/devicetree/bindings/serial/8250.yaml
> Greg Kroah-Hartman <[email protected]> (supporter:TTY LAYER AND SERIAL DRIVERS)
> Jiri Slaby <[email protected]> (supporter:TTY LAYER AND SERIAL DRIVERS)
> Rob Herring <[email protected]> (maintainer:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS)
> Krzysztof Kozlowski <[email protected]> (maintainer:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS)
> Conor Dooley <[email protected]> (maintainer:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS)
> Lubomir Rintel <[email protected]> (in file)
> - <[email protected]> (in file)
> [email protected] (open list:TTY LAYER AND SERIAL DRIVERS)
> [email protected] (open list:TTY LAYER AND SERIAL DRIVERS)
> [email protected] (open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS)
>
> Note the single '-' in the "name" portion of [email protected]
>
> Maybe clean_file_emails needs some better name cleansing code.
OK, I had a look and made a patch to fix this as well. Please see v3
which is on its way to your inbox.
Regarding the .dts* patch, I need some feedback below before sending
anything, so I did not include it in the series.
>
> > > Rather than open _all_ files in utf-8, perhaps the block
> > > that opens a specific file to find maintainers
> > >
> > > sub maintainers_in_file {
> > > my ($file) = @_;
> > >
> > > return if ($file =~ m@\bMAINTAINERS$@);
> > >
> > > if (-f $file && ($email_file_emails || $file =~ /\.yaml$/)) {
> > > open(my $f, '<', $file)
> > > or die "$P: Can't open $file: $!\n";
> > > my $text = do { local($/) ; <$f> };
> > > close($f);
> > > ...
> > >
> > > should change the
> > >
> > > open(my $f...
> > > to
> > > use open qw(:std :encoding(UTF-8));
> > > open(my $f...
> >
> > Yes, this also works for parsing the name in an arbitrary file. But with the
> > change you suggest above, the script then corrupts my name when it is lifted
> > from MAINTAINERS (!?):
> >
> > $ ./scripts/get_maintainer.pl -f drivers/net/dsa/realtek/ | grep alsi
> > "Alvin Å ipraga" <[email protected]> (maintainer:REALTEK RTL83xx SMI DSA ROUTER CHIPS)
>
> Curious. Let me see if I can figure out why that happens.
>
>
> > If you are still unconvinced then I will gladly send a v3 patching the two cases
> > we have discussed (read_maintainer_file() and maintainers_in_file()).
>
> No rush.
>
> > > And unrelated and secondarily, perhaps the
> > > $file =~ /\.yaml$/
> > > test should be
> > > $file =~ /\.(?:yaml|dtsi?)$/
> > >
> > > to also find any maintainer address in the dts* files
> > >
> > > https://lore.kernel.org/lkml/20231028174656.GA3310672@bill-the-cat/T/
> >
> > Is this supposed to parse the "Copyright (c) 20xx John Doe <[email protected]>" in
> > the .dts* files?
>
> Yes, just as it would and does for .yaml files.
>
> $ git grep -P -i 'copy.*\<\w+\@\w+\.\w+\>' -- '*.yaml'
> Documentation/devicetree/bindings/display/bridge/chrontel,ch7033.yaml:# Copyright (C) 2019,2020 Lubomir Rintel <[email protected]>
> Documentation/devicetree/bindings/media/marvell,mmp2-ccic.yaml:# Copyright 2019,2020 Lubomir Rintel <[email protected]>
> Documentation/devicetree/bindings/misc/olpc,xo1.75-ec.yaml:# Copyright (C) 2019,2020 Lubomir Rintel <[email protected]>
> Documentation/devicetree/bindings/phy/allwinner,sun50i-h6-usb3-phy.yaml:# Copyright 2019 Ondrej Jirman <[email protected]>
> Documentation/devicetree/bindings/phy/marvell,mmp3-hsic-phy.yaml:# Copyright 2019 Lubomir Rintel <[email protected]>
> Documentation/devicetree/bindings/phy/marvell,mmp3-usb-phy.yaml:# Copyright 2019,2020 Lubomir Rintel <[email protected]>
> Documentation/devicetree/bindings/reset/bitmain,bm1880-reset.yaml:# Copyright 2019 Manivannan Sadhasivam <[email protected]>
> Documentation/devicetree/bindings/reset/marvell,berlin2-reset.yaml:# Copyright 2015 Antoine Tenart <[email protected]>
> Documentation/devicetree/bindings/reset/qca,ar7100-reset.yaml:# Copyright 2015 Alban Bedel <[email protected]>
> Documentation/devicetree/bindings/serial/8250.yaml:# Copyright 2020 Lubomir Rintel <[email protected]>
> Documentation/devicetree/bindings/spi/marvell,mmp2-ssp.yaml:# Copyright 2019,2020 Lubomir Rintel <[email protected]>
> Documentation/devicetree/bindings/usb/marvell,pxau2o-ehci.yaml:# Copyright 2019,2020 Lubomir Rintel <[email protected]>
Hmm, I ran this over the arch/ directory on all .dts* files and got some
false positives:
[email protected]
This one comes from a lore link with a Message-ID. So deleting URLs
before parsing should fix it.
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
These can also be avoided by strengthening the email parsing regex
to more strictly validate the TLD, which may not begin with a digit.
So something like this:
diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
index faa96801897a..e253053da967 100755
--- a/scripts/get_maintainer.pl
+++ b/scripts/get_maintainer.pl
@@ -446,7 +446,10 @@ sub maintainers_in_file {
my $text = do { local($/) ; <$f> };
close($f);
- my @poss_addr = $text =~ m$[\p{L}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
+ # Avoid mistaking URLs with Message-IDs as emails
+ $text =~ s/https?:[^\s]+//g;
+
+ my @poss_addr = $text =~ m$[\p{L}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z][A-Za-z0-9]+[\)\>\}]{0,1}$g;
push(@file_emails, clean_file_emails(@poss_addr));
}
}
>
> > But sure, I can do a resend of Shawn's original patch
> > separately if you like.
>
> Yes please. Make sure to cc Andrew Morton.
Given that some changes are needed, I want your input on how to send:
1. Should I send the above diff as a patch preceding Shawn's patch?
2. Should I send the above diff as a patch following Shawn's patch? Here
the justification is arguably more clear in the history.
3. Or should I roll it into Shawn's patch? If so, Shawn, may I have your
Signed-off-by?
Kind regards,
Alvin