2023-10-14 17:23:38

by Alvin Šipraga

[permalink] [raw]
Subject: [PATCH] get_maintainer: correctly parse UTF-8 encoded names in files

From: Alvin Šipraga <[email protected]>

While the script correctly extracts UTF-8 encoded names from the
MAINTAINERS file, the regular expressions damage my name when parsing
from .yaml files. Fix this by replacing the Latin-1-compatible regular
expressions with the unicode property matcher \p{Latin}. It's also
necessary to instruct Perl to open all files with UTF-8 encoding.

The issue was also identified on the b4 mailing list [1]. This should
solve the observed side effects there as well.

Link: https://lore.kernel.org/all/20230726-gush-slouching-a5cd41@meerkat/ [1]
Signed-off-by: Alvin Šipraga <[email protected]>
---
scripts/get_maintainer.pl | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
index ab123b498fd9..cb78e11623a6 100755
--- a/scripts/get_maintainer.pl
+++ b/scripts/get_maintainer.pl
@@ -20,6 +20,7 @@ use Getopt::Long qw(:config no_auto_abbrev);
use Cwd;
use File::Find;
use File::Spec::Functions;
+use open qw(:std :encoding(UTF-8));

my $cur_path = fastgetcwd() . '/';
my $lk_path = "./";
@@ -442,7 +443,7 @@ sub maintainers_in_file {
my $text = do { local($/) ; <$f> };
close($f);

- my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
+ my @poss_addr = $text =~ m$[\p{Latin}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
push(@file_emails, clean_file_emails(@poss_addr));
}
}
@@ -2460,13 +2461,13 @@ sub clean_file_emails {
$name = "";
}

- my @nw = split(/[^A-Za-zÀ-ÿ\'\,\.\+-]/, $name);
+ my @nw = split(/[^\p{Latin}\'\,\.\+-]/, $name);
if (@nw > 2) {
my $first = $nw[@nw - 3];
my $middle = $nw[@nw - 2];
my $last = $nw[@nw - 1];

- if (((length($first) == 1 && $first =~ m/[A-Za-z]/) ||
+ if (((length($first) == 1 && $first =~ m/\p{Latin}/) ||
(length($first) == 2 && substr($first, -1) eq ".")) ||
(length($middle) == 1 ||
(length($middle) == 2 && substr($middle, -1) eq "."))) {

---
base-commit: 70f8c6f8f8800d970b10676cceae42bba51a4899
change-id: 20231014-get-maintainers-utf8-32c65c4d6f8a


2023-10-16 14:37:57

by Duje Mihanović

[permalink] [raw]
Subject: Re: [PATCH] get_maintainer: correctly parse UTF-8 encoded names in files

On Saturday, October 14, 2023 7:22:44 PM CEST Alvin Šipraga wrote:
> From: Alvin Šipraga <[email protected]>
>
> While the script correctly extracts UTF-8 encoded names from the
> MAINTAINERS file, the regular expressions damage my name when parsing
> from .yaml files. Fix this by replacing the Latin-1-compatible regular
> expressions with the unicode property matcher \p{Latin}. It's also
> necessary to instruct Perl to open all files with UTF-8 encoding.
>
> The issue was also identified on the b4 mailing list [1]. This should
> solve the observed side effects there as well.
>
> Link: https://lore.kernel.org/all/20230726-gush-slouching-a5cd41@meerkat/
[1]
> Signed-off-by: Alvin Šipraga <[email protected]>
> ---
> scripts/get_maintainer.pl | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)

Tested-by: Duje Mihanović <[email protected]>



2023-10-16 22:18:11

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] get_maintainer: correctly parse UTF-8 encoded names in files

On Mon, 2023-10-16 at 16:37 +0200, Duje Mihanović wrote:
> On Saturday, October 14, 2023 7:22:44 PM CEST Alvin Šipraga wrote:
> > From: Alvin Šipraga <[email protected]>
> >
> > While the script correctly extracts UTF-8 encoded names from the
> > MAINTAINERS file, the regular expressions damage my name when parsing
> > from .yaml files. Fix this by replacing the Latin-1-compatible regular
> > expressions with the unicode property matcher \p{Latin}.

Well, OK

> > It's also
> > necessary to instruct Perl to open all files with UTF-8 encoding.

But I'm not at all sure this is actually desired.

2023-10-16 23:56:59

by Alvin Šipraga

[permalink] [raw]
Subject: Re: [PATCH] get_maintainer: correctly parse UTF-8 encoded names in files

Hi Joe,

On Mon, Oct 16, 2023 at 03:17:56PM -0700, Joe Perches wrote:
> On Mon, 2023-10-16 at 16:37 +0200, Duje Mihanović wrote:
> > On Saturday, October 14, 2023 7:22:44 PM CEST Alvin Šipraga wrote:
> > > From: Alvin Šipraga <[email protected]>
> > >
> > > While the script correctly extracts UTF-8 encoded names from the
> > > MAINTAINERS file, the regular expressions damage my name when parsing
> > > from .yaml files. Fix this by replacing the Latin-1-compatible regular
> > > expressions with the unicode property matcher \p{Latin}.
>
> Well, OK
>
> > > It's also
> > > necessary to instruct Perl to open all files with UTF-8 encoding.
>
> But I'm not at all sure this is actually desired.

The whole patch, or just this last part?

Regarding the last part, it's necessary because Perl defaults to opening files
with (I think) Latin-1/ISO-8859-1, and this prevents the script from correctly
parsing UTF-8 encoded strings. It seemed the most practical solution was to just
open everything as UTF-8, including stdin/out.

Are you worried that this will cause breakage elsewhere? Indeed, while Latin-1
and UTF-8 both have the same encoding for printable ASCII, the former is not a
strict subset of the latter. But I assumed that UTF-8 would be being used
everywhere in the source tree.

Now I did a check to see if that is the case using the encguess tool. See below.
It is a basic test but it seems that the vast majority of the tree is ASCII or
UTF-8.

For your reference, below is also test sequence that shows the different results
with/without my patch, and with modifications to the encoding Perl uses when
opening files. I hope you reconsider.

Kind regards,
Alvin

----8<--------- FILE ENCODINGS IN THE TREE -------8<-------------

linux $ make mrproper
linux $ find . -type f -not -path './.git/*' \
| parallel encguess \
| grep -v -e US-ASCII -e UTF-8 \
> out.txt
linux $ head -n 2 out.txt # output is <file> <detected encoding>
./tools/include/linux/nmi.h unknown
./tools/testing/selftests/tc-testing/plugins/__init__.py unknown
linux $ cat out.txt | cut -f1 | xargs wc
0 0 0 ./tools/include/linux/nmi.h
# comment: this file is empty so encguess says unknown; ditto the others
0 0 0 ./tools/testing/selftests/tc-testing/plugins/__init__.py
0 0 0 ./tools/testing/selftests/powerpc/primitives/asm/processor.h
0 0 0 ./tools/testing/selftests/powerpc/primitives/asm/ppc-opcode.h
0 0 0 ./tools/testing/selftests/powerpc/primitives/asm/firmware.h
0 0 0 ./tools/testing/selftests/powerpc/primitives/linux/stringify.h
0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/processor.h
0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/kasan.h
0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/feature-fixups.h
0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/asm-compat.h
0 0 0 ./tools/testing/kunit/test_data/test_insufficient_memory.log
66 168 1668 ./tools/perf/util/top.h
# comment: has a console escape sequence in macro CONSOLE_CLEAR
0 0 0 ./tools/perf/util/help-unknown-cmd.h
334 1950 141644 ./tools/perf/tests/pe-file.exe.debug
58 594 75595 ./tools/perf/tests/pe-file.exe
# comment: these are binary files
0 0 0 ./tools/virtio/linux/hrtimer.h
0 0 0 ./tools/virtio/generated/autoconf.h
0 0 0 ./tools/virtio/crypto/hash.h
0 0 0 ./tools/build/tests/ex/empty/Build
252 1088 5563 ./arch/m68k/hp300/hp300map.map
# comment: seems deliberately crafted, probably OK to ignore
0 0 0 ./arch/riscv/Kconfig.debug
0 0 0 ./drivers/s390/crypto/zcrypt_cex2c.h
0 0 0 ./drivers/s390/crypto/zcrypt_cex2c.c
0 0 0 ./drivers/s390/crypto/zcrypt_cex2a.h
0 0 0 ./drivers/s390/crypto/zcrypt_cex2a.c
0 0 0 ./drivers/staging/axis-fifo/README
358 1709 12218 ./drivers/tty/vt/defkeymap.map
# comment: seems deliberately crafted, probably OK to ignore
0 0 0 ./drivers/gpu/drm/ci/xfails/virtio_gpu-none-flakes.txt
0 0 0 ./drivers/gpu/drm/ci/xfails/mediatek-mt8173-flakes.txt
89 482 16335 ./Documentation/images/logo.gif
# comment: this is an image
0 0 0 ./Documentation/devicetree/bindings/media/s5p-mfc.txt
0 0 0 ./scripts/dummy-tools/dummy-plugin-dir/include/plugin-version.h
1190 6057 254726 total


----8<--------- TEST SEQUENCE FOR THIS PATCH -----8<-------------

# fetch reference patch which exhibits this issue
# => name is corrupted
linux $ git checkout master
linux $ b4 shazam -P _ 20231014-alvin-clk-si5351-no-pll-reset-v4-1-a3567024007d@bang-olufsen.dk
...
Applying: dt-bindings: clock: si5351: convert to yaml
linux $ git format-patch HEAD^
0001-dt-bindings-clock-si5351-convert-to-yaml.patch
linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi
grep: (standard input): binary file matches
linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a
" ipraga" <[email protected]> (in file)


# apply my patch to get_maintainer.pl
# => name is OK
linux $ b4 shazam [email protected]
...
Applying: get_maintainer: correctly parse UTF-8 encoded names in files
linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a
Alvin Šipraga <[email protected]> (in file)


# remove 'use open qw(:std :encoding(UTF-8))'
# => name is still corrupted, slightly differently
linux $ sed -i '/^use open/d' -i ./scripts/get_maintainer.pl
linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a
ipraga <[email protected]> (in file)


# remove only the :std part
# => name is OK(?), but perl complains about wide char
linux $ git restore .
linux $ sed -i 's/:std //' -i ./scripts/get_maintainer.pl
linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a
Wide character in print at ./scripts/get_maintainer.pl line 2522.
Alvin Šipraga <[email protected]> (in file)

2023-12-14 01:06:32

by Alvin Šipraga

[permalink] [raw]
Subject: Re: [PATCH] get_maintainer: correctly parse UTF-8 encoded names in files

Hi again,

Sorry to be a nuisance, but could you please have another look below and
reconsider this patch? Otherwise NAK is fine, but I wanted to follow up
on this as it solves an actual, albeit minor, issue for people with
unusual names when sending and receiving patches.

Thanks!

Kind regards,
Alvin

On Mon, Oct 16, 2023 at 11:56:32PM +0000, Alvin Šipraga wrote:
> Hi Joe,
>
> On Mon, Oct 16, 2023 at 03:17:56PM -0700, Joe Perches wrote:
> > On Mon, 2023-10-16 at 16:37 +0200, Duje Mihanović wrote:
> > > On Saturday, October 14, 2023 7:22:44 PM CEST Alvin Šipraga wrote:
> > > > From: Alvin Šipraga <[email protected]>
> > > >
> > > > While the script correctly extracts UTF-8 encoded names from the
> > > > MAINTAINERS file, the regular expressions damage my name when parsing
> > > > from .yaml files. Fix this by replacing the Latin-1-compatible regular
> > > > expressions with the unicode property matcher \p{Latin}.
> >
> > Well, OK
> >
> > > > It's also
> > > > necessary to instruct Perl to open all files with UTF-8 encoding.
> >
> > But I'm not at all sure this is actually desired.
>
> The whole patch, or just this last part?
>
> Regarding the last part, it's necessary because Perl defaults to opening files
> with (I think) Latin-1/ISO-8859-1, and this prevents the script from correctly
> parsing UTF-8 encoded strings. It seemed the most practical solution was to just
> open everything as UTF-8, including stdin/out.
>
> Are you worried that this will cause breakage elsewhere? Indeed, while Latin-1
> and UTF-8 both have the same encoding for printable ASCII, the former is not a
> strict subset of the latter. But I assumed that UTF-8 would be being used
> everywhere in the source tree.
>
> Now I did a check to see if that is the case using the encguess tool. See below.
> It is a basic test but it seems that the vast majority of the tree is ASCII or
> UTF-8.
>
> For your reference, below is also test sequence that shows the different results
> with/without my patch, and with modifications to the encoding Perl uses when
> opening files. I hope you reconsider.
>
> Kind regards,
> Alvin
>
> ----8<--------- FILE ENCODINGS IN THE TREE -------8<-------------
>
> linux $ make mrproper
> linux $ find . -type f -not -path './.git/*' \
> | parallel encguess \
> | grep -v -e US-ASCII -e UTF-8 \
> > out.txt
> linux $ head -n 2 out.txt # output is <file> <detected encoding>
> ./tools/include/linux/nmi.h unknown
> ./tools/testing/selftests/tc-testing/plugins/__init__.py unknown
> linux $ cat out.txt | cut -f1 | xargs wc
> 0 0 0 ./tools/include/linux/nmi.h
> # comment: this file is empty so encguess says unknown; ditto the others
> 0 0 0 ./tools/testing/selftests/tc-testing/plugins/__init__.py
> 0 0 0 ./tools/testing/selftests/powerpc/primitives/asm/processor.h
> 0 0 0 ./tools/testing/selftests/powerpc/primitives/asm/ppc-opcode.h
> 0 0 0 ./tools/testing/selftests/powerpc/primitives/asm/firmware.h
> 0 0 0 ./tools/testing/selftests/powerpc/primitives/linux/stringify.h
> 0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/processor.h
> 0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/kasan.h
> 0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/feature-fixups.h
> 0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/asm-compat.h
> 0 0 0 ./tools/testing/kunit/test_data/test_insufficient_memory.log
> 66 168 1668 ./tools/perf/util/top.h
> # comment: has a console escape sequence in macro CONSOLE_CLEAR
> 0 0 0 ./tools/perf/util/help-unknown-cmd.h
> 334 1950 141644 ./tools/perf/tests/pe-file.exe.debug
> 58 594 75595 ./tools/perf/tests/pe-file.exe
> # comment: these are binary files
> 0 0 0 ./tools/virtio/linux/hrtimer.h
> 0 0 0 ./tools/virtio/generated/autoconf.h
> 0 0 0 ./tools/virtio/crypto/hash.h
> 0 0 0 ./tools/build/tests/ex/empty/Build
> 252 1088 5563 ./arch/m68k/hp300/hp300map.map
> # comment: seems deliberately crafted, probably OK to ignore
> 0 0 0 ./arch/riscv/Kconfig.debug
> 0 0 0 ./drivers/s390/crypto/zcrypt_cex2c.h
> 0 0 0 ./drivers/s390/crypto/zcrypt_cex2c.c
> 0 0 0 ./drivers/s390/crypto/zcrypt_cex2a.h
> 0 0 0 ./drivers/s390/crypto/zcrypt_cex2a.c
> 0 0 0 ./drivers/staging/axis-fifo/README
> 358 1709 12218 ./drivers/tty/vt/defkeymap.map
> # comment: seems deliberately crafted, probably OK to ignore
> 0 0 0 ./drivers/gpu/drm/ci/xfails/virtio_gpu-none-flakes.txt
> 0 0 0 ./drivers/gpu/drm/ci/xfails/mediatek-mt8173-flakes.txt
> 89 482 16335 ./Documentation/images/logo.gif
> # comment: this is an image
> 0 0 0 ./Documentation/devicetree/bindings/media/s5p-mfc.txt
> 0 0 0 ./scripts/dummy-tools/dummy-plugin-dir/include/plugin-version.h
> 1190 6057 254726 total
>
>
> ----8<--------- TEST SEQUENCE FOR THIS PATCH -----8<-------------
>
> # fetch reference patch which exhibits this issue
> # => name is corrupted
> linux $ git checkout master
> linux $ b4 shazam -P _ 20231014-alvin-clk-si5351-no-pll-reset-v4-1-a3567024007d@bang-olufsen.dk
> ...
> Applying: dt-bindings: clock: si5351: convert to yaml
> linux $ git format-patch HEAD^
> 0001-dt-bindings-clock-si5351-convert-to-yaml.patch
> linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi
> grep: (standard input): binary file matches
> linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a
> " ipraga" <[email protected]> (in file)
>
>
> # apply my patch to get_maintainer.pl
> # => name is OK
> linux $ b4 shazam [email protected]
> ...
> Applying: get_maintainer: correctly parse UTF-8 encoded names in files
> linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a
> Alvin Šipraga <[email protected]> (in file)
>
>
> # remove 'use open qw(:std :encoding(UTF-8))'
> # => name is still corrupted, slightly differently
> linux $ sed -i '/^use open/d' -i ./scripts/get_maintainer.pl
> linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a
> ipraga <[email protected]> (in file)
>
>
> # remove only the :std part
> # => name is OK(?), but perl complains about wide char
> linux $ git restore .
> linux $ sed -i 's/:std //' -i ./scripts/get_maintainer.pl
> linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a
> Wide character in print at ./scripts/get_maintainer.pl line 2522.
> Alvin Šipraga <[email protected]> (in file)

2023-12-14 01:42:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] get_maintainer: correctly parse UTF-8 encoded names in files

On Wed, 13 Dec 2023 at 17:06, Alvin Šipraga <[email protected]> wrote:
>
> Sorry to be a nuisance, but could you please have another look below and
> reconsider this patch? Otherwise NAK is fine, but I wanted to follow up
> on this as it solves an actual, albeit minor, issue for people with
> unusual names when sending and receiving patches.

The patch seems bogus, because it shouldn't have any "Latin" encoding
issues at all.

Opening as utf8 makes sense, but the "Latin" part of the regular
expressions seem bogus.

IOW, isn't '\p{L}' the right pattern for a "letter"? Isn't that what
we actually care about here?

Replacing one locale bug with just another locale bug seems pointless.

Linus

2023-12-14 14:58:05

by Alvin Šipraga

[permalink] [raw]
Subject: Re: [PATCH] get_maintainer: correctly parse UTF-8 encoded names in files

On Wed, Dec 13, 2023 at 05:41:59PM -0800, Linus Torvalds wrote:
> On Wed, 13 Dec 2023 at 17:06, Alvin Šipraga <[email protected]> wrote:
> >
> > Sorry to be a nuisance, but could you please have another look below and
> > reconsider this patch? Otherwise NAK is fine, but I wanted to follow up
> > on this as it solves an actual, albeit minor, issue for people with
> > unusual names when sending and receiving patches.
>
> The patch seems bogus, because it shouldn't have any "Latin" encoding
> issues at all.
>
> Opening as utf8 makes sense, but the "Latin" part of the regular
> expressions seem bogus.
>
> IOW, isn't '\p{L}' the right pattern for a "letter"? Isn't that what
> we actually care about here?

Yes, you have a point, I was being too conservative with the choice of
'\p{Latin}'. I will send a v2 using '\p{L}'.

>
> Replacing one locale bug with just another locale bug seems pointless.

Thanks for the review!

Kind regards,
Alvin