2022-09-16 09:19:07

by Janne Grunau

[permalink] [raw]
Subject: [PATCH] get_maintainer: Extend matched name characters in maintainers_in_file()

Extend the regexp matching name characters to cover Unicode blocks Latin
Extended-A and Extended-B.
Fixes 'scripts/get_maintainer.pl -f' for
'Documentation/devicetree/bindings/clock/apple,nco.yaml'.

Signed-off-by: Janne Grunau <[email protected]>

---
This still excludes Greek and Cyrilic characters which should be
expected in names as well. I tried to use '\p{L}' to match all Unicode
letters but couldn't get it to work. Feel free understand this as bug
report with an incomplete fix.

best regards,
Janne

---
scripts/get_maintainer.pl | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
index ab123b498fd9..7c06f06dcbfa 100755
--- a/scripts/get_maintainer.pl
+++ b/scripts/get_maintainer.pl
@@ -442,7 +442,7 @@ sub maintainers_in_file {
my $text = do { local($/) ; <$f> };
close($f);

- my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
+ my @poss_addr = $text =~ m$[A-Za-zÀ-ɏ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
push(@file_emails, clean_file_emails(@poss_addr));
}
}
@@ -2460,7 +2460,7 @@ sub clean_file_emails {
$name = "";
}

- my @nw = split(/[^A-Za-zÀ-ÿ\'\,\.\+-]/, $name);
+ my @nw = split(/[^A-Za-zÀ-ɏ\'\,\.\+-]/, $name);
if (@nw > 2) {
my $first = $nw[@nw - 3];
my $middle = $nw[@nw - 2];
--
2.35.1


2022-09-16 15:58:29

by Martin Povišer

[permalink] [raw]
Subject: Re: [PATCH] get_maintainer: Extend matched name characters in maintainers_in_file()


> On 16. 9. 2022, at 10:47, Janne Grunau <[email protected]> wrote:
>
> Extend the regexp matching name characters to cover Unicode blocks Latin
> Extended-A and Extended-B.
> Fixes 'scripts/get_maintainer.pl -f' for
> 'Documentation/devicetree/bindings/clock/apple,nco.yaml'.
>
> Signed-off-by: Janne Grunau <[email protected]>

Applauded-and-tested-by: Martin Povišer <[email protected]>

On behalf of those not wanting to mangle our names to appease software,
let me thank you.

> This still excludes Greek and Cyrilic characters which should be
> expected in names as well. I tried to use '\p{L}' to match all Unicode
> letters but couldn't get it to work. Feel free understand this as bug
> report with an incomplete fix.
>
> best regards,
> Janne

2022-09-17 14:27:28

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] get_maintainer: Extend matched name characters in maintainers_in_file()

On Fri, 2022-09-16 at 10:47 +0200, Janne Grunau wrote:
> Extend the regexp matching name characters to cover Unicode blocks Latin
> Extended-A and Extended-B.
> Fixes 'scripts/get_maintainer.pl -f' for
> 'Documentation/devicetree/bindings/clock/apple,nco.yaml'.
>
> Signed-off-by: Janne Grunau <[email protected]>
>
> ---
> This still excludes Greek and Cyrilic characters which should be
> expected in names as well. I tried to use '\p{L}' to match all Unicode
> letters but couldn't get it to work. Feel free understand this as bug
> report with an incomplete fix.

Maybe use \p{XPosixAlpha} ?

but I don't know what version of perl introduced this.

> diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
[]
> @@ -442,7 +442,7 @@ sub maintainers_in_file {
> my $text = do { local($/) ; <$f> };
> close($f);
>
> - my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> + my @poss_addr = $text =~ m$[A-Za-zÀ-ɏ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;

my @poss_addr = $text =~ m$[\p{XPosixAlpha}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;

?

> push(@file_emails, clean_file_emails(@poss_addr));
> }
> }
> @@ -2460,7 +2460,7 @@ sub clean_file_emails {
> $name = "";
> }
>
> - my @nw = split(/[^A-Za-zÀ-ÿ\'\,\.\+-]/, $name);
> + my @nw = split(/[^A-Za-zÀ-ɏ\'\,\.\+-]/, $name);

Maybe here too

> + my @nw = split(/[^\p{XPosixAlpha}\'\,\.\+-]/, $name);

Dunno haven't tested. Maybe you care to test?

2022-09-18 17:47:49

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] get_maintainer: Extend matched name characters in maintainers_in_file()

On Sat, 2022-09-17 at 07:11 -0700, Joe Perches wrote:
> On Fri, 2022-09-16 at 10:47 +0200, Janne Grunau wrote:
> > Extend the regexp matching name characters to cover Unicode blocks Latin
> > Extended-A and Extended-B.
> > Fixes 'scripts/get_maintainer.pl -f' for
> > 'Documentation/devicetree/bindings/clock/apple,nco.yaml'.
> >
> > Signed-off-by: Janne Grunau <[email protected]>
> >
> > ---
> > This still excludes Greek and Cyrilic characters which should be
> > expected in names as well. I tried to use '\p{L}' to match all Unicode
> > letters but couldn't get it to work. Feel free understand this as bug
> > report with an incomplete fix.
>
> Maybe use \p{XPosixAlpha} ?
>
> but I don't know what version of perl introduced this.
>
> > diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
> []
> > @@ -442,7 +442,7 @@ sub maintainers_in_file {
> > my $text = do { local($/) ; <$f> };
> > close($f);
> >
> > - my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> > + my @poss_addr = $text =~ m$[A-Za-zÀ-ɏ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
>
> my @poss_addr = $text =~ m$[\p{XPosixAlpha}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;

Using variations of \p{posix} doesn't seem to work for at least perl 5.34.

\p{print} seems to work for Documentation/devicetree/bindings/clock/apple,nco.yaml,
but I don't know how fragile it is.

\p{print} might be too greedy...

---
scripts/get_maintainer.pl | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
index ab123b498fd9..790112c3e1d7 100755
--- a/scripts/get_maintainer.pl
+++ b/scripts/get_maintainer.pl
@@ -442,7 +442,7 @@ sub maintainers_in_file {
my $text = do { local($/) ; <$f> };
close($f);

- my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
+ my @poss_addr = $text =~ m$[\p{print}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
push(@file_emails, clean_file_emails(@poss_addr));
}
}
@@ -2456,11 +2456,12 @@ sub clean_file_emails {
foreach my $email (@file_emails) {
$email =~ s/[\(\<\{]{0,1}([A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+)[\)\>\}]{0,1}/\<$1\>/g;
my ($name, $address) = parse_email($email);
+ $name =~ s/^\p{space}*\p{punct}*\p{space}*//;
if ($name eq '"[,\.]"') {
$name = "";
}

- my @nw = split(/[^A-Za-zÀ-ÿ\'\,\.\+-]/, $name);
+ my @nw = split(/[^\p{print}\'\,\.\+-]/, $name);
if (@nw > 2) {
my $first = $nw[@nw - 3];
my $middle = $nw[@nw - 2];

2022-09-18 21:17:48

by Janne Grunau

[permalink] [raw]
Subject: Re: [PATCH] get_maintainer: Extend matched name characters in maintainers_in_file()

On 2022-09-18 10:03:17 -0700, Joe Perches wrote:
> On Sat, 2022-09-17 at 07:11 -0700, Joe Perches wrote:
> > On Fri, 2022-09-16 at 10:47 +0200, Janne Grunau wrote:
> > > Extend the regexp matching name characters to cover Unicode blocks Latin
> > > Extended-A and Extended-B.
> > > Fixes 'scripts/get_maintainer.pl -f' for
> > > 'Documentation/devicetree/bindings/clock/apple,nco.yaml'.
> > >
> > > Signed-off-by: Janne Grunau <[email protected]>
> > >
> > > ---
> > > This still excludes Greek and Cyrilic characters which should be
> > > expected in names as well. I tried to use '\p{L}' to match all Unicode
> > > letters but couldn't get it to work. Feel free understand this as bug
> > > report with an incomplete fix.
> >
> > Maybe use \p{XPosixAlpha} ?
> >
> > but I don't know what version of perl introduced this.
> >
> > > diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
> > []
> > > @@ -442,7 +442,7 @@ sub maintainers_in_file {
> > > my $text = do { local($/) ; <$f> };
> > > close($f);
> > >
> > > - my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> > > + my @poss_addr = $text =~ m$[A-Za-zÀ-ɏ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> >
> > my @poss_addr = $text =~ m$[\p{XPosixAlpha}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
>
> Using variations of \p{posix} doesn't seem to work for at least perl 5.34.
>
> \p{print} seems to work for Documentation/devicetree/bindings/clock/apple,nco.yaml,
> but I don't know how fragile it is.
>
> \p{print} might be too greedy...

It is, it produces following diff (checking all files in
Documentation/devicetree/bindings):
-Lubomir Rintel <[email protected]> (in file)
+"Copyright 2019,2020 Lubomir Rintel" <[email protected]> (in file)

There are multiple hits of this form. The main issue is that \p{print}
includes space. That however fixes many names with 3 parts.

It still fails for "Rafał Miłecki <[email protected]>" which my change
handles correctly.

I'm testing with perl 5.36

> ---
> scripts/get_maintainer.pl | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
> index ab123b498fd9..790112c3e1d7 100755
> --- a/scripts/get_maintainer.pl
> +++ b/scripts/get_maintainer.pl
> @@ -442,7 +442,7 @@ sub maintainers_in_file {
> my $text = do { local($/) ; <$f> };
> close($f);
>
> - my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> + my @poss_addr = $text =~ m$[\p{print}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> push(@file_emails, clean_file_emails(@poss_addr));
> }
> }
> @@ -2456,11 +2456,12 @@ sub clean_file_emails {
> foreach my $email (@file_emails) {
> $email =~ s/[\(\<\{]{0,1}([A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+)[\)\>\}]{0,1}/\<$1\>/g;
> my ($name, $address) = parse_email($email);
> + $name =~ s/^\p{space}*\p{punct}*\p{space}*//;

This change is useful independently of the name regexp as it rejects
'- <[email protected]>' (yaml list items) as valid name, email combination.

Janne

2022-09-18 22:13:28

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] get_maintainer: Extend matched name characters in maintainers_in_file()

On Sun, 2022-09-18 at 22:32 +0200, Janne Grunau wrote:
> On 2022-09-18 10:03:17 -0700, Joe Perches wrote:
> > On Sat, 2022-09-17 at 07:11 -0700, Joe Perches wrote:
> > > On Fri, 2022-09-16 at 10:47 +0200, Janne Grunau wrote:
> > > > Extend the regexp matching name characters to cover Unicode blocks Latin
> > > > Extended-A and Extended-B.
> > > > Fixes 'scripts/get_maintainer.pl -f' for
> > > > 'Documentation/devicetree/bindings/clock/apple,nco.yaml'.
[]
> > > > diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
> > > []
> > > > @@ -442,7 +442,7 @@ sub maintainers_in_file {
> > > > my $text = do { local($/) ; <$f> };
> > > > close($f);
> > > >
> > > > - my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> > > > + my @poss_addr = $text =~ m$[A-Za-zÀ-ɏ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> > >
> > > my @poss_addr = $text =~ m$[\p{XPosixAlpha}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> >
> > Using variations of \p{posix} doesn't seem to work for at least perl 5.34.
> >
> > \p{print} seems to work for Documentation/devicetree/bindings/clock/apple,nco.yaml,
> > but I don't know how fragile it is.
> >
> > \p{print} might be too greedy...
>
> It is, it produces following diff (checking all files in
> Documentation/devicetree/bindings):
> -Lubomir Rintel <[email protected]> (in file)
> +"Copyright 2019,2020 Lubomir Rintel" <[email protected]> (in file)
>
> There are multiple hits of this form. The main issue is that \p{print}
> includes space. That however fixes many names with 3 parts.

right

> > diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
[]
> > @@ -2456,11 +2456,12 @@ sub clean_file_emails {
> > foreach my $email (@file_emails) {
> > $email =~ s/[\(\<\{]{0,1}([A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+)[\)\>\}]{0,1}/\<$1\>/g;
> > my ($name, $address) = parse_email($email);
> > + $name =~ s/^\p{space}*\p{punct}*\p{space}*//;
>
> This change is useful independently of the name regexp as it rejects
> '- <[email protected]>' (yaml list items) as valid name, email combination.

Good. The below might be a bit better too:

$name =~ s/(?:\p{space}|\p{punct})*//;