2023-12-12 09:43:36

by Antonio Borneo

[permalink] [raw]
Subject: [PATCH] checkpatch: use utf-8 match for spell checking

The current code that checks for misspelling verifies, in a more
complex regex, if $rawline matches [^\w]($misspellings)[^\w]

Being $rawline a byte-string, a utf-8 character in $rawline can
match the non-word-char [^\w].
E.g.:
./script/checkpatch.pl --git 81c2f059ab9
WARNING: 'ment' may be misspelled - perhaps 'meant'?
#36: FILE: MAINTAINERS:14360:
+M: Clément Léger <[email protected]>
^^^^

Use a utf-8 version of $rawline for spell checking.

Signed-off-by: Antonio Borneo <[email protected]>
Reported-by: Clément Le Goffic <[email protected]>
---
scripts/checkpatch.pl | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 25fdb7fda112..58646bd6ef56 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3477,7 +3477,8 @@ sub process {
# Check for various typo / spelling mistakes
if (defined($misspellings) &&
($in_commit_log || $line =~ /^(?:\+|Subject:)/i)) {
- while ($rawline =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
+ my $rawline_utf8 = decode("utf8", $rawline);
+ while ($rawline_utf8 =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
my $typo = $1;
my $blank = copy_spacing($rawline);
my $ptr = substr($blank, 0, $-[1]) . "^" x length($typo);

base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
--
2.42.0


2023-12-12 19:07:41

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] checkpatch: use utf-8 match for spell checking

On Tue, 2023-12-12 at 10:43 +0100, Antonio Borneo wrote:
> The current code that checks for misspelling verifies, in a more
> complex regex, if $rawline matches [^\w]($misspellings)[^\w]
>
> Being $rawline a byte-string, a utf-8 character in $rawline can
> match the non-word-char [^\w].
> E.g.:
> ./script/checkpatch.pl --git 81c2f059ab9
> WARNING: 'ment' may be misspelled - perhaps 'meant'?
> #36: FILE: MAINTAINERS:14360:
> +M: Cl?ment L?ger <[email protected]>
> ^^^^
>
> Use a utf-8 version of $rawline for spell checking.
>
> Signed-off-by: Antonio Borneo <[email protected]>
> Reported-by: Cl?ment Le Goffic <[email protected]>

Seems sensible, thanks, but:

> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
[]
> @@ -3477,7 +3477,8 @@ sub process {
> # Check for various typo / spelling mistakes
> if (defined($misspellings) &&
> ($in_commit_log || $line =~ /^(?:\+|Subject:)/i)) {
> - while ($rawline =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
> + my $rawline_utf8 = decode("utf8", $rawline);
> + while ($rawline_utf8 =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
> my $typo = $1;
> my $blank = copy_spacing($rawline);

Maybe this needs to use $rawline_utf8 ?

> my $ptr = substr($blank, 0, $-[1]) . "^" x length($typo);

And may now the $fix bit will not always work properly

2024-01-02 16:23:45

by Antonio Borneo

[permalink] [raw]
Subject: Re: [PATCH] checkpatch: use utf-8 match for spell checking

On Tue, 2023-12-12 at 11:07 -0800, Joe Perches wrote:
> On Tue, 2023-12-12 at 10:43 +0100, Antonio Borneo wrote:
> > The current code that checks for misspelling verifies, in a more
> > complex regex, if $rawline matches [^\w]($misspellings)[^\w]
> >
> > Being $rawline a byte-string, a utf-8 character in $rawline can
> > match the non-word-char [^\w].
> > E.g.:
> >         ./script/checkpatch.pl --git 81c2f059ab9
> >         WARNING: 'ment' may be misspelled - perhaps 'meant'?
> >         #36: FILE: MAINTAINERS:14360:
> >         +M:     Clément Léger <[email protected]>
> >                     ^^^^
> >
> > Use a utf-8 version of $rawline for spell checking.
> >
> > Signed-off-by: Antonio Borneo <[email protected]>
> > Reported-by: Clément Le Goffic <[email protected]>
>
> Seems sensible, thanks, but:
>
> > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> []
> > @@ -3477,7 +3477,8 @@ sub process {
> >  # Check for various typo / spelling mistakes
> >                 if (defined($misspellings) &&
> >                     ($in_commit_log || $line =~ /^(?:\+|Subject:)/i)) {
> > -                       while ($rawline =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
> > +                       my $rawline_utf8 = decode("utf8", $rawline);
> > +                       while ($rawline_utf8 =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
> >                                 my $typo = $1;
> >                                 my $blank = copy_spacing($rawline);
>
> Maybe this needs to use $rawline_utf8 ?

Correct, I will send a v2!

>
> >                                 my $ptr = substr($blank, 0, $-[1]) . "^" x length($typo);
>
> And may now the $fix bit will not always work properly

I have run some test and it looks ok with current ASCII file scripts/spelling.txt.

I have also tested adding some utf-8 string in the spelling file, but checkpatch reads it as
ASCII and extending it to utf-8 will require further modifications in checkpatch, way beyond
this simple fix.

Thanks for the review.
Antonio

2024-01-02 16:47:10

by Antonio Borneo

[permalink] [raw]
Subject: [PATCH v2] checkpatch: use utf-8 match for spell checking

The current code that checks for misspelling verifies, in a more
complex regex, if $rawline matches [^\w]($misspellings)[^\w]

Being $rawline a byte-string, a utf-8 character in $rawline can
match the non-word-char [^\w].
E.g.:
./scripts/checkpatch.pl --git 81c2f059ab9
WARNING: 'ment' may be misspelled - perhaps 'meant'?
#36: FILE: MAINTAINERS:14360:
+M: Clément Léger <[email protected]>
^^^^

Use a utf-8 version of $rawline for spell checking.

Signed-off-by: Antonio Borneo <[email protected]>
Reported-by: Clément Le Goffic <[email protected]>
---
Changes in v2:
- use $rawline_utf8 also in the while-loop's body;
- fix path of checkpatch in the commit message.
---
scripts/checkpatch.pl | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 25fdb7fda112..2d122d232c6d 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3477,9 +3477,10 @@ sub process {
# Check for various typo / spelling mistakes
if (defined($misspellings) &&
($in_commit_log || $line =~ /^(?:\+|Subject:)/i)) {
- while ($rawline =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
+ my $rawline_utf8 = decode("utf8", $rawline);
+ while ($rawline_utf8 =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
my $typo = $1;
- my $blank = copy_spacing($rawline);
+ my $blank = copy_spacing($rawline_utf8);
my $ptr = substr($blank, 0, $-[1]) . "^" x length($typo);
my $hereptr = "$hereline$ptr\n";
my $typo_fix = $spelling_fix{lc($typo)};

base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
--
2.42.0