LinuxLists.cc - [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

2021-05-10 10:36:25

Subject: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

There are several UTF-8 characters at the Kernel's documentation.

Several of them were due to the process of converting files from
DocBook, LaTeX, HTML and Markdown. They were probably introduced
by the conversion tools used on that time.

Other UTF-8 characters were added along the time, but they're easily
replaceable by ASCII chars.

As Linux developers are all around the globe, and not everybody has UTF-8
as their default charset, better to use UTF-8 only on cases where it is really
needed.

The first 3 patches on this series were manually written, in order to solve
a few special cases.

The remaining patches on series address such cases on *.rst files and
inside the Documentation/ABI, using this perl map table in order to do the
charset conversion:

my %char_map = (
0x2010 => '-', # HYPHEN
0xad => '-', # SOFT HYPHEN
0x2013 => '-', # EN DASH
0x2014 => '-', # EM DASH

0x2018 => "'", # LEFT SINGLE QUOTATION MARK
0x2019 => "'", # RIGHT SINGLE QUOTATION MARK
0xb4 => "'", # ACUTE ACCENT

0x201c => '"', # LEFT DOUBLE QUOTATION MARK
0x201d => '"', # RIGHT DOUBLE QUOTATION MARK

0x2212 => '-', # MINUS SIGN
0x2217 => '*', # ASTERISK OPERATOR
0xd7 => 'x', # MULTIPLICATION SIGN

0xbb => '>', # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK

0xa0 => ' ', # NO-BREAK SPACE
0xfeff => '', # ZERO WIDTH NO-BREAK SPACE
);

After the conversion, those UTF-8 chars will be kept:

- U+00a9 ('©'): COPYRIGHT SIGN
- U+00ac ('¬'): NOT SIGN # only at Documentation/powerpc/transactional_memory.rst
- U+00ae ('®'): REGISTERED SIGN
- U+00b0 ('°'): DEGREE SIGN
- U+00b1 ('±'): PLUS-MINUS SIGN
- U+00b2 ('²'): SUPERSCRIPT TWO
- U+00b5 ('µ'): MICRO SIGN
- U+00b7 ('·'): MIDDLE DOT # See below
- U+00bd ('½'): VULGAR FRACTION ONE HALF
- U+00c7 ('Ç'): LATIN CAPITAL LETTER C WITH CEDILLA
- U+00df ('ß'): LATIN SMALL LETTER SHARP S
- U+00e1 ('á'): LATIN SMALL LETTER A WITH ACUTE
- U+00e4 ('ä'): LATIN SMALL LETTER A WITH DIAERESIS
- U+00e6 ('æ'): LATIN SMALL LETTER AE
- U+00e7 ('ç'): LATIN SMALL LETTER C WITH CEDILLA
- U+00e9 ('é'): LATIN SMALL LETTER E WITH ACUTE
- U+00ea ('ê'): LATIN SMALL LETTER E WITH CIRCUMFLEX
- U+00eb ('ë'): LATIN SMALL LETTER E WITH DIAERESIS
- U+00f3 ('ó'): LATIN SMALL LETTER O WITH ACUTE
- U+00f4 ('ô'): LATIN SMALL LETTER O WITH CIRCUMFLEX
- U+00f6 ('ö'): LATIN SMALL LETTER O WITH DIAERESIS
- U+00f8 ('ø'): LATIN SMALL LETTER O WITH STROKE
- U+00fa ('ú'): LATIN SMALL LETTER U WITH ACUTE
- U+00fc ('ü'): LATIN SMALL LETTER U WITH DIAERESIS
- U+00fd ('ý'): LATIN SMALL LETTER Y WITH ACUTE
- U+011f ('ğ'): LATIN SMALL LETTER G WITH BREVE
- U+0142 ('ł'): LATIN SMALL LETTER L WITH STROKE
- U+03bc ('μ'): GREEK SMALL LETTER MU
- U+2026 ('…'): HORIZONTAL ELLIPSIS
- U+2122 ('™'): TRADE MARK SIGN
- U+2191 ('↑'): UPWARDS ARROW
- U+2192 ('→'): RIGHTWARDS ARROW
- U+2193 ('↓'): DOWNWARDS ARROW
- U+2264 ('≤'): LESS-THAN OR EQUAL TO
- U+2265 ('≥'): GREATER-THAN OR EQUAL TO
- U+2500 ('─'): BOX DRAWINGS LIGHT HORIZONTAL
- U+2502 ('│'): BOX DRAWINGS LIGHT VERTICAL
- U+2514 ('└'): BOX DRAWINGS LIGHT UP AND RIGHT
- U+251c ('├'): BOX DRAWINGS LIGHT VERTICAL AND RIGHT
- U+2b0d ('⬍'): UP DOWN BLACK ARROW

PS.: maintainers were bcc on patch 00/53, in order to reduce the
risk of patch 00 to be rejected by list servers.

-

For U+00b7 ('·'): MIDDLE DOT, I opted to keep it on a few places:

- Documentation/devicetree/bindings/clock/qcom,rpmcc.txt

As this file will be some day converted to yaml, where the
MIDDLE DOT will be removed, I guess it is not worth touching it.

- Documentation/scheduler/sched-deadline.rst

There, it is used on a math expressions. So, better to keep.

- Documentation/devicetree/bindings/media/video-interface-devices.yaml

There, it part of an ASCII artwork.

- translations/zh_CN

I prefer not touching it, as it might have some special meaning in Simplified Chinese.

Mauro Carvalho Chehab (53):
docs: cdrom-standard.rst: get rid of uneeded UTF-8 chars
docs: ABI: remove a meaningless UTF-8 character
docs: ABI: remove some spurious characters
docs: index.rst: avoid using UTF-8 chars
docs: hwmon: avoid using UTF-8 chars
docs: admin-guide: avoid using UTF-8 chars
docs: admin-guide: media: ipu3.rst: avoid using UTF-8 chars
docs: admin-guide: sysctl: kernel.rst: avoid using UTF-8 chars
docs: admin-guide: perf: imx-ddr.rst: avoid using UTF-8 chars
docs: admin-guide: pm: avoid using UTF-8 chars
docs: trace: coresight: coresight-etm4x-reference.rst: avoid using
UTF-8 chars
docs: driver-api: avoid using UTF-8 chars
docs: driver-api: fpga: avoid using UTF-8 chars
docs: driver-api: iio: avoid using UTF-8 chars
docs: driver-api: thermal: avoid using UTF-8 chars
docs: driver-api: media: drivers: avoid using UTF-8 chars
docs: driver-api: firmware: other_interfaces.rst: avoid using UTF-8
chars
docs: driver-api: nvdimm: btt.rst: avoid using UTF-8 chars
docs: fault-injection: nvme-fault-injection.rst: avoid using UTF-8
chars
docs: usb: avoid using UTF-8 chars
docs: process: avoid using UTF-8 chars
docs: block: data-integrity.rst: avoid using UTF-8 chars
docs: userspace-api: media: fdl-appendix.rst: avoid using UTF-8 chars
docs: userspace-api: media: v4l: avoid using UTF-8 chars
docs: userspace-api: media: dvb: avoid using UTF-8 chars
docs: vm: zswap.rst: avoid using UTF-8 chars
docs: filesystems: f2fs.rst: avoid using UTF-8 chars
docs: filesystems: ext4: avoid using UTF-8 chars
docs: kernel-hacking: avoid using UTF-8 chars
docs: hid: avoid using UTF-8 chars
docs: security: tpm: avoid using UTF-8 chars
docs: security: keys: trusted-encrypted.rst: avoid using UTF-8 chars
docs: riscv: vm-layout.rst: avoid using UTF-8 chars
docs: networking: scaling.rst: avoid using UTF-8 chars
docs: networking: devlink: devlink-dpipe.rst: avoid using UTF-8 chars
docs: networking: device_drivers: avoid using UTF-8 chars
docs: x86: avoid using UTF-8 chars
docs: scheduler: sched-deadline.rst: avoid using UTF-8 chars
docs: dev-tools: testing-overview.rst: avoid using UTF-8 chars
docs: power: powercap: powercap.rst: avoid using UTF-8 chars
docs: ABI: avoid using UTF-8 chars
docs: doc-guide: contributing.rst: avoid using UTF-8 chars
docs: PCI: acpi-info.rst: avoid using UTF-8 chars
docs: gpu: avoid using UTF-8 chars
docs: sound: kernel-api: writing-an-alsa-driver.rst: avoid using UTF-8
chars
docs: arm64: arm-acpi.rst: avoid using UTF-8 chars
docs: infiniband: tag_matching.rst: avoid using UTF-8 chars
docs: timers: no_hz.rst: avoid using UTF-8 chars
docs: misc-devices: ibmvmc.rst: avoid using UTF-8 chars
docs: firmware-guide: acpi: lpit.rst: avoid using UTF-8 chars
docs: firmware-guide: acpi: dsd: graph.rst: avoid using UTF-8 chars
docs: virt: kvm: avoid using UTF-8 chars
docs: RCU: avoid using UTF-8 chars

.../obsolete/sysfs-kernel-fadump_registered | 2 +-
.../obsolete/sysfs-kernel-fadump_release_mem | 2 +-
...sfs-class-chromeos-driver-cros-ec-lightbar | 2 +-
.../ABI/testing/sysfs-class-net-cdc_ncm | 2 +-
.../ABI/testing/sysfs-devices-platform-ipmi | 2 +-
.../testing/sysfs-devices-platform-trackpoint | 2 +-
Documentation/ABI/testing/sysfs-devices-soc | 4 +-
Documentation/ABI/testing/sysfs-module | 4 +-
Documentation/PCI/acpi-info.rst | 26 +-
.../Data-Structures/Data-Structures.rst | 52 ++--
.../Expedited-Grace-Periods.rst | 40 +--
.../Tree-RCU-Memory-Ordering.rst | 10 +-
.../RCU/Design/Requirements/Requirements.rst | 126 ++++-----
Documentation/admin-guide/index.rst | 2 +-
Documentation/admin-guide/media/ipu3.rst | 2 +-
Documentation/admin-guide/module-signing.rst | 4 +-
Documentation/admin-guide/perf/imx-ddr.rst | 2 +-
Documentation/admin-guide/pm/intel_idle.rst | 4 +-
Documentation/admin-guide/pm/intel_pstate.rst | 4 +-
Documentation/admin-guide/ras.rst | 94 +++----
.../admin-guide/reporting-issues.rst | 12 +-
Documentation/admin-guide/sysctl/kernel.rst | 2 +-
Documentation/arm64/arm-acpi.rst | 8 +-
Documentation/block/data-integrity.rst | 2 +-
Documentation/cdrom/cdrom-standard.rst | 30 +--
Documentation/dev-tools/testing-overview.rst | 4 +-
Documentation/doc-guide/contributing.rst | 2 +-
.../driver-api/firmware/other_interfaces.rst | 2 +-
Documentation/driver-api/fpga/fpga-bridge.rst | 10 +-
Documentation/driver-api/fpga/fpga-mgr.rst | 12 +-
.../driver-api/fpga/fpga-programming.rst | 8 +-
Documentation/driver-api/fpga/fpga-region.rst | 20 +-
Documentation/driver-api/iio/buffers.rst | 8 +-
Documentation/driver-api/iio/hw-consumer.rst | 10 +-
.../driver-api/iio/triggered-buffers.rst | 6 +-
Documentation/driver-api/iio/triggers.rst | 10 +-
Documentation/driver-api/index.rst | 2 +-
Documentation/driver-api/ioctl.rst | 8 +-
.../media/drivers/sh_mobile_ceu_camera.rst | 8 +-
.../driver-api/media/drivers/vidtv.rst | 4 +-
.../driver-api/media/drivers/zoran.rst | 2 +-
Documentation/driver-api/nvdimm/btt.rst | 2 +-
.../driver-api/thermal/cpu-idle-cooling.rst | 14 +-
.../driver-api/thermal/intel_powerclamp.rst | 6 +-
.../thermal/x86_pkg_temperature_thermal.rst | 2 +-
.../fault-injection/nvme-fault-injection.rst | 2 +-
Documentation/filesystems/ext4/attributes.rst | 20 +-
Documentation/filesystems/ext4/bigalloc.rst | 6 +-
Documentation/filesystems/ext4/blockgroup.rst | 8 +-
Documentation/filesystems/ext4/blocks.rst | 2 +-
Documentation/filesystems/ext4/directory.rst | 16 +-
Documentation/filesystems/ext4/eainode.rst | 2 +-
Documentation/filesystems/ext4/inlinedata.rst | 6 +-
Documentation/filesystems/ext4/inodes.rst | 6 +-
Documentation/filesystems/ext4/journal.rst | 8 +-
Documentation/filesystems/ext4/mmp.rst | 2 +-
.../filesystems/ext4/special_inodes.rst | 4 +-
Documentation/filesystems/ext4/super.rst | 10 +-
Documentation/filesystems/f2fs.rst | 6 +-
.../firmware-guide/acpi/dsd/graph.rst | 2 +-
Documentation/firmware-guide/acpi/lpit.rst | 2 +-
Documentation/gpu/i915.rst | 2 +-
Documentation/gpu/komeda-kms.rst | 2 +-
Documentation/hid/hid-sensor.rst | 70 ++---
Documentation/hid/intel-ish-hid.rst | 246 +++++++++---------
Documentation/hwmon/ir36021.rst | 2 +-
Documentation/hwmon/ltc2992.rst | 2 +-
Documentation/hwmon/pm6764tr.rst | 2 +-
Documentation/hwmon/tmp103.rst | 4 +-
Documentation/index.rst | 4 +-
Documentation/infiniband/tag_matching.rst | 8 +-
Documentation/kernel-hacking/hacking.rst | 2 +-
Documentation/kernel-hacking/locking.rst | 2 +-
Documentation/misc-devices/ibmvmc.rst | 8 +-
.../device_drivers/ethernet/intel/i40e.rst | 12 +-
.../device_drivers/ethernet/intel/iavf.rst | 6 +-
.../device_drivers/ethernet/netronome/nfp.rst | 12 +-
.../networking/devlink/devlink-dpipe.rst | 2 +-
Documentation/networking/scaling.rst | 18 +-
Documentation/power/powercap/powercap.rst | 210 +++++++--------
Documentation/process/code-of-conduct.rst | 2 +-
.../process/kernel-enforcement-statement.rst | 2 +-
Documentation/riscv/vm-layout.rst | 2 +-
Documentation/scheduler/sched-deadline.rst | 4 +-
.../security/keys/trusted-encrypted.rst | 4 +-
Documentation/security/tpm/tpm_event_log.rst | 2 +-
Documentation/security/tpm/xen-tpmfront.rst | 2 +-
.../kernel-api/writing-an-alsa-driver.rst | 68 ++---
Documentation/timers/no_hz.rst | 2 +-
.../coresight/coresight-etm4x-reference.rst | 16 +-
Documentation/usb/ehci.rst | 2 +-
Documentation/usb/gadget_printer.rst | 2 +-
Documentation/usb/mass-storage.rst | 36 +--
Documentation/usb/mtouchusb.rst | 2 +-
Documentation/usb/usb-serial.rst | 2 +-
.../media/dvb/audio-set-bypass-mode.rst | 2 +-
.../userspace-api/media/dvb/audio.rst | 2 +-
.../userspace-api/media/dvb/dmx-fopen.rst | 2 +-
.../userspace-api/media/dvb/dmx-fread.rst | 2 +-
.../media/dvb/dmx-set-filter.rst | 2 +-
.../userspace-api/media/dvb/intro.rst | 6 +-
.../userspace-api/media/dvb/video.rst | 2 +-
.../userspace-api/media/fdl-appendix.rst | 64 ++---
.../userspace-api/media/v4l/biblio.rst | 8 +-
.../userspace-api/media/v4l/crop.rst | 16 +-
.../userspace-api/media/v4l/dev-decoder.rst | 6 +-
.../userspace-api/media/v4l/diff-v4l.rst | 2 +-
.../userspace-api/media/v4l/open.rst | 2 +-
.../media/v4l/vidioc-cropcap.rst | 4 +-
Documentation/virt/kvm/api.rst | 28 +-
.../virt/kvm/running-nested-guests.rst | 12 +-
Documentation/vm/zswap.rst | 4 +-
Documentation/x86/resctrl.rst | 2 +-
Documentation/x86/sgx.rst | 4 +-
114 files changed, 807 insertions(+), 807 deletions(-)

--
2.30.2

2021-05-10 11:33:09

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

On Mon, 2021-05-10 at 12:26 +0200, Mauro Carvalho Chehab wrote:
> There are several UTF-8 characters at the Kernel's documentation.
>
> Several of them were due to the process of converting files from
> DocBook, LaTeX, HTML and Markdown. They were probably introduced
> by the conversion tools used on that time.
>
> Other UTF-8 characters were added along the time, but they're easily
> replaceable by ASCII chars.
>
> As Linux developers are all around the globe, and not everybody has UTF-8
> as their default charset, better to use UTF-8 only on cases where it is really
> needed.

No, that is absolutely the wrong approach.

If someone has a local setup which makes bogus assumptions about text
encodings, that is their own mistake.

We don't do them any favours by trying to *hide* it in the common case
so that they don't notice it for longer.

There really isn't much excuse for such brokenness, this far into the
21st century.

Even *before* UTF-8 came along in the final decade of the last
millennium, it was important to know which character set a given piece
of text was encoded in.

In fact it was even *more* important back then, we couldn't just assume
UTF-8 everywhere like we can in modern times.

Git can already do things like CRLF conversion on checking files out to
match local conventions; if you want to teach it to do character set
conversions too then I suppose that might be useful to a few developers
who've fallen through a time warp and still need it. But nobody's ever
bothered before because it just isn't necessary these days.

Please *don't* attempt to address this anachronistic and esoteric
"requirement" by dragging the kernel source back in time by three
decades.

Attachments:

smime.p7s (5.05 kB)

2021-05-10 11:33:20

by Mauro Carvalho Chehab

[permalink] [raw]

Subject: Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

Em Mon, 10 May 2021 12:52:44 +0200
Thorsten Leemhuis <[email protected]> escreveu:

> On 10.05.21 12:26, Mauro Carvalho Chehab wrote:
> >
> > As Linux developers are all around the globe, and not everybody has UTF-8
> > as their default charset, better to use UTF-8 only on cases where it is really
> > needed.
> > […]
> > The remaining patches on series address such cases on *.rst files and
> > inside the Documentation/ABI, using this perl map table in order to do the
> > charset conversion:
> >
> > my %char_map = (
> > […]
> > 0x2013 => '-', # EN DASH
> > 0x2014 => '-', # EM DASH

> I might be performing bike shedding here, but wouldn't it be better to
> replace those two with "--", as explained in
> https://en.wikipedia.org/wiki/Dash#Approximating_the_em_dash_with_two_or_three_hyphens
>
> For EM DASH there seems to be even "---", but I'd say that is a bit too
> much.

Yeah, we can do, instead:

0x2013 => '--', # EN DASH
0x2014 => '---', # EM DASH

I was actually in doubt about those ;-)

Btw, when producing HTML documentation, Sphinx should convert:
-- into EN DASH
and:
--- into EM DASH

So, the resulting html will be identical.

> Or do you fear the extra work as some lines then might break the
> 80-character limit then?

No, I suspect that the line size won't be an issue. Some care should
taken when EN DASH and EM DASH are used inside tables.

Thanks,
Mauro

2021-05-10 11:53:16

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

On 10.05.21 12:26, Mauro Carvalho Chehab wrote:
>
> As Linux developers are all around the globe, and not everybody has UTF-8
> as their default charset, better to use UTF-8 only on cases where it is really
> needed.
> […]
> The remaining patches on series address such cases on *.rst files and
> inside the Documentation/ABI, using this perl map table in order to do the
> charset conversion:
>
> my %char_map = (
> […]
> 0x2013 => '-', # EN DASH
> 0x2014 => '-', # EM DASH

I might be performing bike shedding here, but wouldn't it be better to
replace those two with "--", as explained in
https://en.wikipedia.org/wiki/Dash#Approximating_the_em_dash_with_two_or_three_hyphens

For EM DASH there seems to be even "---", but I'd say that is a bit too
much.

Or do you fear the extra work as some lines then might break the
80-character limit then?

Ciao, Thorsten

2021-05-10 13:42:49

by Mauro Carvalho Chehab

[permalink] [raw]

Subject: Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

Em Mon, 10 May 2021 14:16:16 +0100
Edward Cree <[email protected]> escreveu:

> On 10/05/2021 12:55, Mauro Carvalho Chehab wrote:
> > The main point on this series is to replace just the occurrences
> > where ASCII represents the symbol equally well
>
> > - U+2014 ('—'): EM DASH
> Em dash is not the same thing as hyphen-minus, and the latter does not
> serve 'equally well'. People use em dashes because — even in
> monospace fonts — they make text easier to read and comprehend, when
> used correctly.

True, but if you look at the diff, on several places, IMHO a single
hyphen would make more sensus. Maybe those places came from a converted
doc.

> I accept that some of the other distinctions — like en dashes — are
> needlessly pedantic (though I don't doubt there is someone out there
> who will gladly defend them with the same fervour with which I argue
> for the em dash) and I wouldn't take the trouble to use them myself;
> but I think there is a reasonable assumption that when someone goes
> to the effort of using a Unicode punctuation mark that is semantic
> (rather than merely typographical), they probably had a reason for
> doing so.
>
> > - U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
> > - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
> > - U+201c ('“'): LEFT DOUBLE QUOTATION MARK
> > - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
> (These are purely typographic, I have no problem with dumping them.)
>
> > - U+00d7 ('×'): MULTIPLICATION SIGN
> Presumably this is appearing in mathematical formulae, in which case
> changing it to 'x' loses semantic information.
>
> > Using the above symbols will just trick tools like grep for no good
> > reason.
> NBSP, sure. That one's probably an artefact of some document format
> conversion somewhere along the line, anyway.
> But what kinds of things with × or — in are going to be grept for?

Actually, on almost all places, those aren't used inside math formulae, but
instead, they describe video some resolutions:

$ git grep × Documentation/
Documentation/devicetree/bindings/display/panel/asus,z00t-tm5p5-nt35596.yaml:title: ASUS Z00T TM5P5 NT35596 5.5" 1080×1920 LCD Panel
Documentation/devicetree/bindings/display/panel/panel-simple-dsi.yaml: # LG ACX467AKM-7 4.95" 1080×1920 LCD Panel
Documentation/devicetree/bindings/sound/tlv320adcx140.yaml: 1 - Mic bias is set to VREF × 1.096
Documentation/userspace-api/media/v4l/crop.rst:of 16 × 16 pixels. The source cropping rectangle is set to defaults,
Documentation/userspace-api/media/v4l/crop.rst:which are also the upper limit in this example, of 640 × 400 pixels at
Documentation/userspace-api/media/v4l/crop.rst:offset 0, 0. An application requests an image size of 300 × 225 pixels,
Documentation/userspace-api/media/v4l/crop.rst:The driver sets the image size to the closest possible values 304 × 224,
Documentation/userspace-api/media/v4l/crop.rst:is 608 × 224 (224 × 2:1 would exceed the limit 400). The offset 0, 0 is
Documentation/userspace-api/media/v4l/crop.rst:rectangle of 608 × 456 pixels. The present scaling factors limit
Documentation/userspace-api/media/v4l/crop.rst:cropping to 640 × 384, so the driver returns the cropping size 608 × 384
Documentation/userspace-api/media/v4l/crop.rst:and adjusts the image size to closest possible 304 × 192.
Documentation/userspace-api/media/v4l/diff-v4l.rst:size bitmap of 1024 × 625 bits. Struct :c:type:`v4l2_window`
Documentation/userspace-api/media/v4l/vidioc-cropcap.rst: Assuming pixel aspect 1/1 this could be for example a 640 × 480
Documentation/userspace-api/media/v4l/vidioc-cropcap.rst: rectangle for NTSC, a 768 × 576 rectangle for PAL and SECAM

it is a way more likely that, if someone wants to grep, they would be
doing something like this, in order to get video resolutions:

$ git grep -E "\b[1-9][0-9]+\s*x\s*[0-9]+\b" Documentation/
Documentation/ABI/obsolete/sysfs-driver-hid-roccat-koneplus:Description: When read the mouse returns a 30x30 pixel image of the
Documentation/ABI/obsolete/sysfs-driver-hid-roccat-konepure:Description: When read the mouse returns a 30x30 pixel image of the
Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7: Provides access to the binary "24x7 catalog" provided by the
Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7: https://raw.githubusercontent.com/jmesmon/catalog-24x7/master/hv-24x7- catalog.h
Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7: Exposes the "version" field of the 24x7 catalog. This is also
Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7: HCALLs to retrieve hv-24x7 pmu event counter data.
Documentation/ABI/testing/sysfs-bus-vfio-mdev: "2 heads, 512M FB, 2560x1600 maximum resolution"
Documentation/ABI/testing/sysfs-driver-wacom: of the device. The image is a 64x32 pixel 4-bit gray image. The
Documentation/ABI/testing/sysfs-driver-wacom: 1024 byte binary is split up into 16x 64 byte chunks. Each 64
Documentation/ABI/testing/sysfs-driver-wacom: image has to contain 256 bytes (64x32 px 1 bit colour).
Documentation/admin-guide/edid.rst:commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200,
Documentation/admin-guide/edid.rst:1680x1050, 1920x1080) as binary blobs, but the kernel source tree does
Documentation/admin-guide/edid.rst:If you want to create your own EDID file, copy the file 1024x768.S,
Documentation/admin-guide/kernel-parameters.txt: edid/1024x768.bin, edid/1280x1024.bin,
Documentation/admin-guide/kernel-parameters.txt: edid/1680x1050.bin, or edid/1920x1080.bin is given
Documentation/admin-guide/kernel-parameters.txt: 2 - The VGA Shield is attached (1024x768)
Documentation/admin-guide/media/dvb_intro.rst:signal encoded at a resolution of 768x576 24-bit color pixels over 25
Documentation/admin-guide/media/imx.rst:1280x960 input frame to 640x480, and then /2 downscale in both
Documentation/admin-guide/media/imx.rst:dimensions to 320x240 (assumes ipu1_csi0 is linked to ipu1_csi0_mux):
Documentation/admin-guide/media/imx.rst: media-ctl -V "'ipu1_csi0_mux':2[fmt:UYVY2X8/1280x960]"

which won't get the above, due to the usage of the UTF-8 alternative.

In any case, replacing all the above by 'x' seems to be the right thing,
at least on my eyes.

> If there are em dashes lying around that semantically _should_ be
> hyphen-minus (one of your patches I've seen, for instance, fixes an
> *en* dash moonlighting as the option character in an `ethtool`
> command line), then sure, convert them.
> But any time someone is using a Unicode character to *express
> semantics*, even if you happen to think the semantic distinction
> involved is a pedantic or unimportant one, I think you need an
> explicit grep case to justify ASCIIfying it.

Yeah, in the case of hyphen/dash it seems to make sense to double check
it.

Thanks,
Mauro

2021-05-10 14:07:46

by Ben Boeckel

[permalink] [raw]

Subject: Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

On Mon, May 10, 2021 at 13:55:18 +0200, Mauro Carvalho Chehab wrote:
> $ git grep "CPU 0 has been" Documentation/RCU/
> Documentation/RCU/Design/Data-Structures/Data-Structures.rst:| #. CPU 0 has been in dyntick-idle mode for quite some time. When it |
> Documentation/RCU/Design/Data-Structures/Data-Structures.rst:| notices that CPU 0 has been in dyntick idle mode, which qualifies |

The kernel documentation uses hard line wraps, so such a naive grep is
going to always fail unless such line wraps are taken into account. Not
saying this isn't an improvement in and of itself, but smarter searching
strategies are likely needed anyways.

--Ben

2021-05-10 14:10:03

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

On Mon, 2021-05-10 at 13:55 +0200, Mauro Carvalho Chehab wrote:
> This patch series is doing conversion only when using ASCII makes
> more sense than using UTF-8.
>
> See, a number of converted documents ended with weird characters
> like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific
> character doesn't do any good.
>
> Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until
> someone tries to use grep[1].

Replacing those makes sense. But replacing emdashes — which are a
distinct character that has no direct replacement in ASCII and which
people do *deliberately* use instead of hyphen-minus — does not.

Perhaps stick to those two, and any cases where an emdash or endash has
been used where U+002D HYPHEN-MINUS *should* have been used.

And please fix your cover letter which made no reference to 'grep', and
only presented a completely bogus argument for the change instead.

Attachments:

smime.p7s (5.05 kB)

2021-05-10 14:37:46

by Edward Cree

[permalink] [raw]

Subject: Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

On 10/05/2021 14:38, Mauro Carvalho Chehab wrote:
> Em Mon, 10 May 2021 14:16:16 +0100
> Edward Cree <[email protected]> escreveu:
>> But what kinds of things with × or — in are going to be grept for?
>
> Actually, on almost all places, those aren't used inside math formulae, but
> instead, they describe video some resolutions:
Ehh, those are also proper uses of ×. It's still a multiplication,
after all.

> it is a way more likely that, if someone wants to grep, they would be
> doing something like this, in order to get video resolutions:
Why would someone be grepping for "all video resolutions mentioned in
the documentation"? That seems contrived to me.

-ed

2021-05-10 14:40:04

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

On Mon, May 10, 2021 at 02:16:16PM +0100, Edward Cree wrote:
> On 10/05/2021 12:55, Mauro Carvalho Chehab wrote:
> > The main point on this series is to replace just the occurrences
> > where ASCII represents the symbol equally well
>
> > - U+2014 ('—'): EM DASH
> Em dash is not the same thing as hyphen-minus, and the latter does not
> serve 'equally well'. People use em dashes because — even in
> monospace fonts — they make text easier to read and comprehend, when
> used correctly.
> I accept that some of the other distinctions — like en dashes — are
> needlessly pedantic (though I don't doubt there is someone out there
> who will gladly defend them with the same fervour with which I argue
> for the em dash) and I wouldn't take the trouble to use them myself;
> but I think there is a reasonable assumption that when someone goes
> to the effort of using a Unicode punctuation mark that is semantic
> (rather than merely typographical), they probably had a reason for
> doing so.

I think you're overestimating the amount of care and typographical
knowledge that your average kernel developer has. Most of these
UTF-8 characters come from latex conversions and really aren't
necessary (and are being used incorrectly).

You seem quite knowedgeable about the various differences. Perhaps
you'd be willing to write a document for Documentation/doc-guide/
that provides guidance for when to use which kinds of horizontal
line? https://www.punctuationmatters.com/hyphen-dash-n-dash-and-m-dash/
talks about it in the context of publications, but I think we need
something more suited to our needs for kernel documentation.

2021-05-10 15:16:26

by Edward Cree

[permalink] [raw]

Subject: Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

On 10/05/2021 14:59, Matthew Wilcox wrote:
> Most of these
> UTF-8 characters come from latex conversions and really aren't
> necessary (and are being used incorrectly).
I fully agree with fixing those.
The cover-letter, however, gave the impression that that was not the
main purpose of this series; just, perhaps, a happy side-effect.

> You seem quite knowedgeable about the various differences. Perhaps
> you'd be willing to write a document for Documentation/doc-guide/
> that provides guidance for when to use which kinds of horizontal
> line?I have Opinions about the proper usage of punctuation, but I also know
that other people have differing opinions. For instance, I place
spaces around an em dash, which is nonstandard according to most
style guides. Really this is an individual enough thing that I'm not
sure we could have a "kernel style guide" that would be more useful
than general-purpose guidance like the page you linked.
Moreover, such a guide could make non-native speakers needlessly self-
conscious about their writing and discourage them from contributing
documentation at all. I'm not advocating here for trying to push
kernel developers towards an eats-shoots-and-leaves level of
linguistic pedantry; rather, I merely think that existing correct
usages should be left intact (and therefore, excising incorrect usage
should only be attempted by someone with both the expertise and time
to check each case).

But if you really want such a doc I wouldn't mind contributing to it.

-ed

2021-05-10 19:23:13

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

On Mon, May 10, 2021 at 02:49:44PM +0100, David Woodhouse wrote:
> On Mon, 2021-05-10 at 13:55 +0200, Mauro Carvalho Chehab wrote:
> > This patch series is doing conversion only when using ASCII makes
> > more sense than using UTF-8.
> >
> > See, a number of converted documents ended with weird characters
> > like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific
> > character doesn't do any good.
> >
> > Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until
> > someone tries to use grep[1].
>
> Replacing those makes sense. But replacing emdashes — which are a
> distinct character that has no direct replacement in ASCII and which
> people do *deliberately* use instead of hyphen-minus — does not.

I regularly use --- for em-dashes and -- for en-dashes. Markdown will
automatically translate 3 ASCII hypens to em-dashes, and 2 ASCII
hyphens to en-dashes. It's much, much easier for me to type 2 or 3
hypens into my text editor of choice than trying to enter the UTF-8
characters. If we can make sphinx do this translation, maybe that's
the best way of dealing with these two characters?

Cheers,

- Ted

2021-05-10 22:37:05

by Adam Borowski

[permalink] [raw]

Subject: Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

On Mon, May 10, 2021 at 12:26:12PM +0200, Mauro Carvalho Chehab wrote:
> There are several UTF-8 characters at the Kernel's documentation.
[...]
> Other UTF-8 characters were added along the time, but they're easily
> replaceable by ASCII chars.
>
> As Linux developers are all around the globe, and not everybody has UTF-8
> as their default charset

I'm not aware of a distribution that still allows selecting a non-UTF-8
charset in a normal flow in their installer. And if they haven't purged
support for ancient encodings, that support is thoroughly bitrotten.
Thus, I disagree that this is a legitimate concern.

What _could_ be a legitimate reason is that someone is on a _terminal_
that can't display a wide enough set of glyphs. Such terminals are:
• Linux console (because of vgacon limitations; patchsets to improve
other cons haven't been mainlined)
• some Windows terminals (putty, old Windows console) that can't borrow
glyphs from other fonts like fontconfig can

For the former, it's whatever your distribution ships in
/usr/share/consolefonts/ or an equivalent, which is based on historic
ISO-8859 and VT100 traditions.

For the latter, the near-guaranteed character set is WGL4.

Thus, at least two of your choices seem to disagree with the above:
[dropped]
> 0xd7 => 'x', # MULTIPLICATION SIGN
[retained]
> - U+2b0d ('⬍'): UP DOWN BLACK ARROW

× is present in ISO-8859, V100, WGL4; I've found no font in
/usr/share/consolefonts/ on my Debian unstable box that lacks this
character.

⬍ is not found in any of the above. You might want to at least
convert it to ↕ which is at least present in WGL4, and thus likely
to be supported in fonts heeding Windows/Mac/OpenType recommendations.
That still won't make it work on VT.

Meow!
--
⢀⣴⠾⠻⢶⣦⠀ .--[ Makefile ]
⣾⠁⢠⠒⠀⣿⡁ # beware of races
⢿⡄⠘⠷⠚⠋⠀ all: pillage burn
⠈⠳⣄⠀⠀⠀⠀ `----

2021-05-11 09:20:33

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

On Tue, 2021-05-11 at 11:00 +0200, Mauro Carvalho Chehab wrote:
> Yet, this series has two positive side effects:
>
> - it helps people needing to touch the documents using non-utf8 locales[1];
> - it makes easier to grep for a text;
>
> [1] There are still some widely used distros nowadays (LTS ones?) that
> don't set UTF-8 as default. Last time I installed a Debian machine
> I had to explicitly set UTF-8 charset after install as the default
> were using ASCII encoding (can't remember if it was Debian 10 or an
> older version).

This whole line of thinking is fundamentally wrong.

A given set of characters in a "text file" are encoded with a specific
character set / encoding. To interpret that file and convert the bytes
back to characters, we need to use the *same* charset.

That charset is a property of the text file, and each text file or
piece of text in a system (like this email, which will contain a
Content-Type: header indicating the charset) might be encoded with a
*different* character set.

In the days before you could connect computers together — or before you
could exchange data between computers in different countries, at least
— perhaps it made sense to store 'text' files without explicitly noting
their encoding. And to interpret them using some kind of "default"
character set.

Those days are long gone. You're trying to work around an egregiously
stupid bug, if you're trying to pander to "default" encodings. There
*is* no default encoding that even makes sense, except perhaps UTF-8.
To *speak* of them as you did shows a misunderstanding of how broken
they are. It's *precisely* that kind of half-baked thinking which
always used to lead to stupid assumptions and double conversions and
Mojibake. Before we just standardised on UTF-8 everywhere and it
stopped mattering so much.

Just don't.

Now, you *can* make this work if you really insist on it, even for
systems with EBCDIC as their default encoding. Just make git do the
"convert to local charset" on checkout, precisely the same way as it
does CRLF for Windows systems. But it's stupid and anachronistic, so I
don't really see the point.

Attachments:

smime.p7s (5.05 kB)

2021-05-11 09:25:49

by Mauro Carvalho Chehab

[permalink] [raw]

Subject: Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

Em Mon, 10 May 2021 14:49:44 +0100
David Woodhouse <[email protected]> escreveu:

> On Mon, 2021-05-10 at 13:55 +0200, Mauro Carvalho Chehab wrote:
> > This patch series is doing conversion only when using ASCII makes
> > more sense than using UTF-8.
> >
> > See, a number of converted documents ended with weird characters
> > like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific
> > character doesn't do any good.
> >
> > Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until
> > someone tries to use grep[1].
>
> Replacing those makes sense. But replacing emdashes — which are a
> distinct character that has no direct replacement in ASCII and which
> people do *deliberately* use instead of hyphen-minus — does not.
>
> Perhaps stick to those two, and any cases where an emdash or endash has
> been used where U+002D HYPHEN-MINUS *should* have been used.

Ok. I'll rework the series excluding EM/EN DASH chars from it.
I'll then apply manually the changes for EM/EN DASH chars
(probably on a separate series) where it seems to fit. That should
make easier to discuss such replacements.

> And please fix your cover letter which made no reference to 'grep', and
> only presented a completely bogus argument for the change instead.

OK!

Regards,
Mauro

2021-05-11 09:37:54

by Mauro Carvalho Chehab

[permalink] [raw]

Subject: Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

Em Mon, 10 May 2021 15:22:02 -0400
"Theodore Ts'o" <[email protected]> escreveu:

> On Mon, May 10, 2021 at 02:49:44PM +0100, David Woodhouse wrote:
> > On Mon, 2021-05-10 at 13:55 +0200, Mauro Carvalho Chehab wrote:
> > > This patch series is doing conversion only when using ASCII makes
> > > more sense than using UTF-8.
> > >
> > > See, a number of converted documents ended with weird characters
> > > like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific
> > > character doesn't do any good.
> > >
> > > Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until
> > > someone tries to use grep[1].
> >
> > Replacing those makes sense. But replacing emdashes — which are a
> > distinct character that has no direct replacement in ASCII and which
> > people do *deliberately* use instead of hyphen-minus — does not.
>
> I regularly use --- for em-dashes and -- for en-dashes. Markdown will
> automatically translate 3 ASCII hypens to em-dashes, and 2 ASCII
> hyphens to en-dashes. It's much, much easier for me to type 2 or 3
> hypens into my text editor of choice than trying to enter the UTF-8
> characters.

Yeah, typing those UTF-8 chars are a lot harder than typing -- and ---
on several text editors ;-)

Here, I only type UTF-8 chars for accents (my US-layout keyboards are
all set to US international, so typing those are easy).

> If we can make sphinx do this translation, maybe that's
> the best way of dealing with these two characters?

Sphinx already does that by default[1], using smartquotes:

https://docutils.sourceforge.io/docs/user/smartquotes.html

Those are the conversions that are done there:

- Straight quotes (" and ') turned into "curly" quote characters;
- dashes (-- and ---) turned into en- and em-dash entities;
- three consecutive dots (... or . . .) turned into an ellipsis char.

So, we can simply use single/double commas, hyphens and dots for
curly commas and ellipses.

[1] There's a way to disable it at conf.py, but at the Kernel this is
kept on its default: to automatically do such conversions.

Thanks,
Mauro