LinuxLists.cc - [PATCH 1/3] scripts: add spelling

2021-06-11 07:19:56

Subject: [PATCH 1/3] scripts: add spelling_sanitizer.sh script

The file scripts/spelling.txt recorded a large number of
"mistake||correction" pairs. These entries are currently maintained in
order, but the results are not strict. In addition, when someone wants to
add some new pairs, he either sort them manually or write a script, which
is clearly a waste of labor. So add this script. It removes the duplicates
first, then sort by correctly spelled words. Sorting based on misspelled
words is not chose because it is uncontrollable.

Signed-off-by: Zhen Lei <[email protected]>
---
scripts/spelling_sanitizer.sh | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
create mode 100755 scripts/spelling_sanitizer.sh

diff --git a/scripts/spelling_sanitizer.sh b/scripts/spelling_sanitizer.sh
new file mode 100755
index 000000000000..4936c4191653
--- /dev/null
+++ b/scripts/spelling_sanitizer.sh
@@ -0,0 +1,26 @@
+#!/bin/sh
+
+src=spelling.txt
+tmp=spelling_mistake_correction_pairs.txt
+
+cd `dirname $0`
+
+# Convert the format of 'codespell' to the current
+sed -r -i 's/ ==> /||/' $src
+
+# Move the spelling "mistake||correction" pairs into file $tmp
+# There are currently 9 lines of comments in $src, so the text starts at line 10
+sed -n '10,$p' $src > $tmp
+sed -i '10,$d' $src
+
+# Remove duplicates first, then sort by correctly spelled words
+sort -u $tmp -o $tmp
+sort -t '|' -k 3 $tmp -o $tmp
+
+# Append sorted results to comments
+cat $tmp >> $src
+
+# Delete the temporary file
+rm -f $tmp
+
+cd - > /dev/null
--
2.25.1

2021-06-11 08:01:29

by Andy Shevchenko

[permalink] [raw]

Subject: Re: [PATCH 1/3] scripts: add spelling_sanitizer.sh script

On Fri, Jun 11, 2021 at 10:19 AM Zhen Lei <[email protected]> wrote:
>
> The file scripts/spelling.txt recorded a large number of
> "mistake||correction" pairs. These entries are currently maintained in
> order, but the results are not strict. In addition, when someone wants to
> add some new pairs, he either sort them manually or write a script, which
> is clearly a waste of labor. So add this script. It removes the duplicates
> first, then sort by correctly spelled words. Sorting based on misspelled
> words is not chose because it is uncontrollable.

chosen

...

> +#!/bin/sh

If you want to have stricter rules applied, use
#!/bin/sh -efu
in all your shell scripts, it will show you a lot of problems.

Missed SPDX.

> +src=spelling.txt

> +tmp=spelling_mistake_correction_pairs.txt

It will pollute the source tree, so use `mktemp` or utilize O=. In
case there is no O= supplied (or whatever equivalent to describe
output folder) you will get it in the source tree, so it needs to be
Git-ignored.

> +cd `dirname $0`

Useless use of dirname. Check for %, %%, #, and ## substitutions (`man sh`).
IIRC dirname equivalent is ${0%/*}.

> +# Convert the format of 'codespell' to the current
> +sed -r -i 's/ ==> /||/' $src
> +
> +# Move the spelling "mistake||correction" pairs into file $tmp

> +# There are currently 9 lines of comments in $src, so the text starts at line 10
> +sed -n '10,$p' $src > $tmp
> +sed -i '10,$d' $src

This is fragile, use proper comment line detection.

> +# Remove duplicates first, then sort by correctly spelled words
> +sort -u $tmp -o $tmp
> +sort -t '|' -k 3 $tmp -o $tmp

Can be one pipeline

> +# Append sorted results to comments
> +cat $tmp >> $src

I believe it can be done in a better way, but I was not thinking about it.

> +# Delete the temporary file
> +rm -f $tmp

What if the script will be trapped? It's good to handle SIGHUP I
suppose, so we won't leave garbage behind us.

> +cd - > /dev/null

--
With Best Regards,
Andy Shevchenko

2021-06-11 09:32:18

by Zhen Lei

[permalink] [raw]

Subject: Re: [PATCH 1/3] scripts: add spelling_sanitizer.sh script

On 2021/6/11 15:58, Andy Shevchenko wrote:
> On Fri, Jun 11, 2021 at 10:19 AM Zhen Lei <[email protected]> wrote:
>>
>> The file scripts/spelling.txt recorded a large number of
>> "mistake||correction" pairs. These entries are currently maintained in
>> order, but the results are not strict. In addition, when someone wants to
>> add some new pairs, he either sort them manually or write a script, which
>> is clearly a waste of labor. So add this script. It removes the duplicates
>> first, then sort by correctly spelled words. Sorting based on misspelled
>> words is not chose because it is uncontrollable.
>
> chosen

OK

>
> ...
>
>> +#!/bin/sh
>
> If you want to have stricter rules applied, use
> #!/bin/sh -efu
> in all your shell scripts, it will show you a lot of problems.
>
> Missed SPDX.

OK, I will add it.

>
>> +src=spelling.txt
>
>> +tmp=spelling_mistake_correction_pairs.txt
>
> It will pollute the source tree, so use `mktemp` or utilize O=. In
> case there is no O= supplied (or whatever equivalent to describe
> output folder) you will get it in the source tree, so it needs to be
> Git-ignored.

OK, I will use mktemp to generate the tmp file.

>
>> +cd `dirname $0`
>
> Useless use of dirname. Check for %, %%, #, and ## substitutions (`man sh`).
> IIRC dirname equivalent is ${0%/*}.

I just tried it. It works.

>
>> +# Convert the format of 'codespell' to the current
>> +sed -r -i 's/ ==> /||/' $src
>> +
>> +# Move the spelling "mistake||correction" pairs into file $tmp
>
>> +# There are currently 9 lines of comments in $src, so the text starts at line 10
>> +sed -n '10,$p' $src > $tmp
>> +sed -i '10,$d' $src
>
> This is fragile, use proper comment line detection.

I've thought about that too. But I'm wondering if it needs to be that
complicated.

Think about it. It's not something for personal temporary use, so it
should be perfect. I'll change to dynamic computing.

>
>> +# Remove duplicates first, then sort by correctly spelled words
>> +sort -u $tmp -o $tmp
>> +sort -t '|' -k 3 $tmp -o $tmp
>
> Can be one pipeline

OK, I will combine it.

>
>> +# Append sorted results to comments
>> +cat $tmp >> $src
>
> I believe it can be done in a better way, but I was not thinking about it.

I'll keep searching.

>
>> +# Delete the temporary file
>> +rm -f $tmp
>
> What if the script will be trapped? It's good to handle SIGHUP I
> suppose, so we won't leave garbage behind us.

You're very well thought out. I'll take care of it.

>
>> +cd - > /dev/null
>

2021-06-11 09:46:20

by Andy Shevchenko

[permalink] [raw]

Subject: Re: [PATCH 1/3] scripts: add spelling_sanitizer.sh script

On Fri, Jun 11, 2021 at 12:30 PM Leizhen (ThunderTown)
<[email protected]> wrote:
> On 2021/6/11 15:58, Andy Shevchenko wrote:
> > On Fri, Jun 11, 2021 at 10:19 AM Zhen Lei <[email protected]> wrote:

...

> >> +# Convert the format of 'codespell' to the current
> >> +sed -r -i 's/ ==> /||/' $src
> >> +
> >> +# Move the spelling "mistake||correction" pairs into file $tmp
> >
> >> +# There are currently 9 lines of comments in $src, so the text starts at line 10
> >> +sed -n '10,$p' $src > $tmp
> >> +sed -i '10,$d' $src
> >
> > This is fragile, use proper comment line detection.
>
> I've thought about that too. But I'm wondering if it needs to be that
> complicated.
>
> Think about it. It's not something for personal temporary use, so it
> should be perfect. I'll change to dynamic computing.

sed has a possibility to choose between two anchors.

Google for `sed -e '/anchor 1/,/anchor 2/'` expressions. So, it will
be less complicated than current code.

--
With Best Regards,
Andy Shevchenko

2021-06-11 10:01:03

by Zhen Lei

[permalink] [raw]

Subject: Re: [PATCH 1/3] scripts: add spelling_sanitizer.sh script

On 2021/6/11 17:41, Andy Shevchenko wrote:
> On Fri, Jun 11, 2021 at 12:30 PM Leizhen (ThunderTown)
> <[email protected]> wrote:
>> On 2021/6/11 15:58, Andy Shevchenko wrote:
>>> On Fri, Jun 11, 2021 at 10:19 AM Zhen Lei <[email protected]> wrote:
>
> ...
>
>>>> +# Convert the format of 'codespell' to the current
>>>> +sed -r -i 's/ ==> /||/' $src
>>>> +
>>>> +# Move the spelling "mistake||correction" pairs into file $tmp
>>>
>>>> +# There are currently 9 lines of comments in $src, so the text starts at line 10
>>>> +sed -n '10,$p' $src > $tmp
>>>> +sed -i '10,$d' $src
>>>
>>> This is fragile, use proper comment line detection.
>>
>> I've thought about that too. But I'm wondering if it needs to be that
>> complicated.
>>
>> Think about it. It's not something for personal temporary use, so it
>> should be perfect. I'll change to dynamic computing.
>
> sed has a possibility to choose between two anchors.
>
> Google for `sed -e '/anchor 1/,/anchor 2/'` expressions. So, it will
> be less complicated than current code.

OK, thanks. I'm off work. I'll post the v2 next week.

>
>

2021-06-11 15:38:57

by Joe Perches

[permalink] [raw]

Subject: Re: [PATCH 1/3] scripts: add spelling_sanitizer.sh script

On Fri, 2021-06-11 at 15:12 +0800, Zhen Lei wrote:
> The file scripts/spelling.txt recorded a large number of
> "mistake||correction" pairs. These entries are currently maintained in
> order, but the results are not strict. In addition, when someone wants to
> add some new pairs, he either sort them manually or write a script, which
> is clearly a waste of labor.

Try using lintian's make sort

https://salsa.debian.org/lintian/lintian

2021-06-15 07:02:37

by Zhen Lei

[permalink] [raw]

Subject: Re: [PATCH 1/3] scripts: add spelling_sanitizer.sh script

On 2021/6/11 23:36, Joe Perches wrote:
> On Fri, 2021-06-11 at 15:12 +0800, Zhen Lei wrote:
>> The file scripts/spelling.txt recorded a large number of
>> "mistake||correction" pairs. These entries are currently maintained in
>> order, but the results are not strict. In addition, when someone wants to
>> add some new pairs, he either sort them manually or write a script, which
>> is clearly a waste of labor.
>
> Try using lintian's make sort
>
> https://salsa.debian.org/lintian/lintian
>
>

Okay, I'll try it

>
> .
>

2021-06-16 13:21:29

by Zhen Lei

[permalink] [raw]

Subject: Re: [PATCH 1/3] scripts: add spelling_sanitizer.sh script

On 2021/6/15 15:01, Leizhen (ThunderTown) wrote:
>
>
> On 2021/6/11 23:36, Joe Perches wrote:
>> On Fri, 2021-06-11 at 15:12 +0800, Zhen Lei wrote:
>>> The file scripts/spelling.txt recorded a large number of
>>> "mistake||correction" pairs. These entries are currently maintained in
>>> order, but the results are not strict. In addition, when someone wants to
>>> add some new pairs, he either sort them manually or write a script, which
>>> is clearly a waste of labor.
>>
>> Try using lintian's make sort
>>
>> https://salsa.debian.org/lintian/lintian

I installed lintian and found no option to support sort. Can anyone give me more
specific instructions on how to use it?

Although I don't understand the perl language, after reading commit 66b47b4a9dad
("checkpatch: look for common misspellings"), it seems to match from top to bottom.
So, as Andy Shevchenko says, they should be sorted by frequency of the word usage.

I really don't know the details of the implementation of
scripts/checkpatch.pl --types=typo_spelling. Are only misspelled words involved in
spelling.txt matching? Otherwise, if correctly spelled words are also traversed,
sorting by frequency makes no sense. Because the correct number of words is far more
than the wrong number of words. If that's the case, then my modified script could
come in handy.

And if only misspelled words involved in spelling.txt matching, do we really need
spelling.txt? Just output the misspelled words is enough. I don't think anyone needs
to follow the tips to complete the fix.

>>
>>
>
> Okay, I'll try it
>
>>
>> .
>>

2021-06-22 08:48:49

by Zhen Lei

[permalink] [raw]

Subject: Re: [PATCH 1/3] scripts: add spelling_sanitizer.sh script

On 2021/6/16 19:58, Leizhen (ThunderTown) wrote:
>
>
> On 2021/6/15 15:01, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2021/6/11 23:36, Joe Perches wrote:
>>> On Fri, 2021-06-11 at 15:12 +0800, Zhen Lei wrote:
>>>> The file scripts/spelling.txt recorded a large number of
>>>> "mistake||correction" pairs. These entries are currently maintained in
>>>> order, but the results are not strict. In addition, when someone wants to
>>>> add some new pairs, he either sort them manually or write a script, which
>>>> is clearly a waste of labor.
>>>
>>> Try using lintian's make sort
>>>
>>> https://salsa.debian.org/lintian/lintian
>
> I installed lintian and found no option to support sort. Can anyone give me more
> specific instructions on how to use it?
>
> Although I don't understand the perl language, after reading commit 66b47b4a9dad
> ("checkpatch: look for common misspellings"), it seems to match from top to bottom.
> So, as Andy Shevchenko says, they should be sorted by frequency of the word usage.
>
> I really don't know the details of the implementation of
> scripts/checkpatch.pl --types=typo_spelling. Are only misspelled words involved in
> spelling.txt matching? Otherwise, if correctly spelled words are also traversed,
> sorting by frequency makes no sense. Because the correct number of words is far more
> than the wrong number of words. If that's the case, then my modified script could
> come in handy.
>
> And if only misspelled words involved in spelling.txt matching, do we really need
> spelling.txt? Just output the misspelled words is enough. I don't think anyone needs
> to follow the tips to complete the fix.

Hi all:
I did a little test:
git rm -r drivers/usb --> then revert to generate patch 'usb, 553988 insertions(+)
git rm -r mm/ --> then revert to generate patch 'mm', 157606 insertions(+)

Two Stages(Test twice each, unit: seconds)：
Before sorted by this patch:
mm 264 264
usb 1049 1047

After sorted by this patch:
mm 264 265
usb 1047 1045

According to the test results, the performance before and after sorting is basically the same.

The test method is as follows:
start=$(date +%s)
scripts/checkpatch.pl --types=TYPO_SPELLING 0001-Revert-usb-remove.patch > /dev/null
end=$(date +%s)
seconds=$((end - start))
echo $seconds

>
>>>
>>>
>>
>> Okay, I'll try it
>>
>>>
>>> .
>>>