LinuxLists.cc - [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

2021-05-12 12:53:05

Subject: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

This series contain basically a cleanup from all those years of converting
files to ReST.

During the conversion period, several tools like LaTeX, pandoc, DocBook
and some specially-written scripts were used in order to convert
existing documents.

Such conversion tools - plus some text editor like LibreOffice or similar - have
a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
for instance converting commas into curly commas and adding non-breakable
spaces. All of those are meant to produce better results when the text is
displayed in HTML or PDF formats.

While it is perfectly fine to use UTF-8 characters in Linux, and specially at
the documentation, it is better to stick to the ASCII subset on such
particular case, due to a couple of reasons:

1. it makes life easier for tools like grep;
2. they easier to edit with the some commonly used text/source
code editors.

Also, Sphinx already do such conversion automatically outside
literal blocks, as described at:

https://docutils.sourceforge.io/docs/user/smartquotes.html

In this series, the following UTF-8 symbols are replaced:

- U+00a0 (' '): NO-BREAK SPACE
- U+00ad (''): SOFT HYPHEN
- U+00b4 ('´'): ACUTE ACCENT
- U+00d7 ('×'): MULTIPLICATION SIGN
- U+2010 ('‐'): HYPHEN
- U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
- U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
- U+201c ('“'): LEFT DOUBLE QUOTATION MARK
- U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
- U+2212 ('−'): MINUS SIGN
- U+2217 ('∗'): ASTERISK OPERATOR
- U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)

---

v2:
- removed EM/EN DASH conversion from this patchset;
- removed a few fixes, as those were addressed on a separate series.

PS.:
The first version of this series was posted with a different name:

https://lore.kernel.org/lkml/[email protected]/

I also changed the patch texts, in order to better describe the patches goals.

Mauro Carvalho Chehab (40):
docs: hwmon: Use ASCII subset instead of UTF-8 alternate symbols
docs: admin-guide: Use ASCII subset instead of UTF-8 alternate symbols
docs: admin-guide: media: ipu3.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: admin-guide: perf: imx-ddr.rst: Use ASCII subset instead of
UTF-8 alternate symbols
docs: admin-guide: pm: Use ASCII subset instead of UTF-8 alternate
symbols
docs: trace: coresight: coresight-etm4x-reference.rst: Use ASCII
subset instead of UTF-8 alternate symbols
docs: driver-api: ioctl.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: driver-api: thermal: Use ASCII subset instead of UTF-8 alternate
symbols
docs: driver-api: media: drivers: Use ASCII subset instead of UTF-8
alternate symbols
docs: driver-api: firmware: other_interfaces.rst: Use ASCII subset
instead of UTF-8 alternate symbols
docs: fault-injection: nvme-fault-injection.rst: Use ASCII subset
instead of UTF-8 alternate symbols
docs: usb: Use ASCII subset instead of UTF-8 alternate symbols
docs: process: code-of-conduct.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: userspace-api: media: fdl-appendix.rst: Use ASCII subset instead
of UTF-8 alternate symbols
docs: userspace-api: media: v4l: Use ASCII subset instead of UTF-8
alternate symbols
docs: userspace-api: media: dvb: Use ASCII subset instead of UTF-8
alternate symbols
docs: vm: zswap.rst: Use ASCII subset instead of UTF-8 alternate
symbols
docs: filesystems: f2fs.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: filesystems: ext4: Use ASCII subset instead of UTF-8 alternate
symbols
docs: kernel-hacking: Use ASCII subset instead of UTF-8 alternate
symbols
docs: hid: Use ASCII subset instead of UTF-8 alternate symbols
docs: security: tpm: tpm_event_log.rst: Use ASCII subset instead of
UTF-8 alternate symbols
docs: security: keys: trusted-encrypted.rst: Use ASCII subset instead
of UTF-8 alternate symbols
docs: networking: scaling.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: networking: devlink: devlink-dpipe.rst: Use ASCII subset instead
of UTF-8 alternate symbols
docs: networking: device_drivers: Use ASCII subset instead of UTF-8
alternate symbols
docs: x86: Use ASCII subset instead of UTF-8 alternate symbols
docs: scheduler: sched-deadline.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: power: powercap: powercap.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: ABI: Use ASCII subset instead of UTF-8 alternate symbols
docs: PCI: acpi-info.rst: Use ASCII subset instead of UTF-8 alternate
symbols
docs: gpu: Use ASCII subset instead of UTF-8 alternate symbols
docs: sound: kernel-api: writing-an-alsa-driver.rst: Use ASCII subset
instead of UTF-8 alternate symbols
docs: arm64: arm-acpi.rst: Use ASCII subset instead of UTF-8 alternate
symbols
docs: infiniband: tag_matching.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: misc-devices: ibmvmc.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: firmware-guide: acpi: lpit.rst: Use ASCII subset instead of
UTF-8 alternate symbols
docs: firmware-guide: acpi: dsd: graph.rst: Use ASCII subset instead
of UTF-8 alternate symbols
docs: virt: kvm: api.rst: Use ASCII subset instead of UTF-8 alternate
symbols
docs: RCU: Use ASCII subset instead of UTF-8 alternate symbols

...sfs-class-chromeos-driver-cros-ec-lightbar | 2 +-
.../ABI/testing/sysfs-devices-platform-ipmi | 2 +-
.../testing/sysfs-devices-platform-trackpoint | 2 +-
Documentation/ABI/testing/sysfs-devices-soc | 4 +-
Documentation/PCI/acpi-info.rst | 22 +-
.../Data-Structures/Data-Structures.rst | 52 ++--
.../Expedited-Grace-Periods.rst | 40 +--
.../Tree-RCU-Memory-Ordering.rst | 10 +-
.../RCU/Design/Requirements/Requirements.rst | 122 ++++-----
Documentation/admin-guide/media/ipu3.rst | 2 +-
Documentation/admin-guide/perf/imx-ddr.rst | 2 +-
Documentation/admin-guide/pm/intel_idle.rst | 4 +-
Documentation/admin-guide/pm/intel_pstate.rst | 4 +-
Documentation/admin-guide/ras.rst | 86 +++---
.../admin-guide/reporting-issues.rst | 2 +-
Documentation/arm64/arm-acpi.rst | 8 +-
.../driver-api/firmware/other_interfaces.rst | 2 +-
Documentation/driver-api/ioctl.rst | 8 +-
.../media/drivers/sh_mobile_ceu_camera.rst | 8 +-
.../driver-api/media/drivers/zoran.rst | 2 +-
.../driver-api/thermal/cpu-idle-cooling.rst | 14 +-
.../driver-api/thermal/intel_powerclamp.rst | 6 +-
.../thermal/x86_pkg_temperature_thermal.rst | 2 +-
.../fault-injection/nvme-fault-injection.rst | 2 +-
Documentation/filesystems/ext4/attributes.rst | 20 +-
Documentation/filesystems/ext4/bigalloc.rst | 6 +-
Documentation/filesystems/ext4/blockgroup.rst | 8 +-
Documentation/filesystems/ext4/blocks.rst | 2 +-
Documentation/filesystems/ext4/directory.rst | 16 +-
Documentation/filesystems/ext4/eainode.rst | 2 +-
Documentation/filesystems/ext4/inlinedata.rst | 6 +-
Documentation/filesystems/ext4/inodes.rst | 6 +-
Documentation/filesystems/ext4/journal.rst | 8 +-
Documentation/filesystems/ext4/mmp.rst | 2 +-
.../filesystems/ext4/special_inodes.rst | 4 +-
Documentation/filesystems/ext4/super.rst | 10 +-
Documentation/filesystems/f2fs.rst | 4 +-
.../firmware-guide/acpi/dsd/graph.rst | 2 +-
Documentation/firmware-guide/acpi/lpit.rst | 2 +-
Documentation/gpu/i915.rst | 2 +-
Documentation/gpu/komeda-kms.rst | 2 +-
Documentation/hid/hid-sensor.rst | 70 ++---
Documentation/hid/intel-ish-hid.rst | 246 +++++++++---------
Documentation/hwmon/ir36021.rst | 2 +-
Documentation/hwmon/ltc2992.rst | 2 +-
Documentation/hwmon/pm6764tr.rst | 2 +-
Documentation/infiniband/tag_matching.rst | 4 +-
Documentation/kernel-hacking/hacking.rst | 2 +-
Documentation/kernel-hacking/locking.rst | 2 +-
Documentation/misc-devices/ibmvmc.rst | 8 +-
.../device_drivers/ethernet/intel/i40e.rst | 8 +-
.../device_drivers/ethernet/intel/iavf.rst | 4 +-
.../device_drivers/ethernet/netronome/nfp.rst | 12 +-
.../networking/devlink/devlink-dpipe.rst | 2 +-
Documentation/networking/scaling.rst | 18 +-
Documentation/power/powercap/powercap.rst | 210 +++++++--------
Documentation/process/code-of-conduct.rst | 2 +-
Documentation/scheduler/sched-deadline.rst | 2 +-
.../security/keys/trusted-encrypted.rst | 4 +-
Documentation/security/tpm/tpm_event_log.rst | 2 +-
.../kernel-api/writing-an-alsa-driver.rst | 68 ++---
.../coresight/coresight-etm4x-reference.rst | 16 +-
Documentation/usb/ehci.rst | 2 +-
Documentation/usb/gadget_printer.rst | 2 +-
Documentation/usb/mass-storage.rst | 36 +--
.../media/dvb/audio-set-bypass-mode.rst | 2 +-
.../userspace-api/media/dvb/audio.rst | 2 +-
.../userspace-api/media/dvb/dmx-fopen.rst | 2 +-
.../userspace-api/media/dvb/dmx-fread.rst | 2 +-
.../media/dvb/dmx-set-filter.rst | 2 +-
.../userspace-api/media/dvb/intro.rst | 6 +-
.../userspace-api/media/dvb/video.rst | 2 +-
.../userspace-api/media/fdl-appendix.rst | 64 ++---
.../userspace-api/media/v4l/crop.rst | 16 +-
.../userspace-api/media/v4l/dev-decoder.rst | 6 +-
.../userspace-api/media/v4l/diff-v4l.rst | 2 +-
.../userspace-api/media/v4l/open.rst | 2 +-
.../media/v4l/vidioc-cropcap.rst | 4 +-
Documentation/virt/kvm/api.rst | 28 +-
Documentation/vm/zswap.rst | 4 +-
Documentation/x86/resctrl.rst | 2 +-
Documentation/x86/sgx.rst | 4 +-
82 files changed, 693 insertions(+), 693 deletions(-)

--
2.30.2

2021-05-12 12:54:16

by Mauro Carvalho Chehab

[permalink] [raw]

Subject: [PATCH v2 19/40] docs: filesystems: ext4: Use ASCII subset instead of UTF-8 alternate symbols

The conversion tools used during DocBook/LaTeX/Markdown->ReST conversion
and some automatic rules which exists on certain text editors like
LibreOffice turned ASCII characters into some UTF-8 alternatives that
are better displayed on html and PDF.

While it is OK to use UTF-8 characters in Linux, it is better to
use the ASCII subset instead of using an UTF-8 equivalent character
as it makes life easier for tools like grep, and are easier to edit
with the some commonly used text/source code editors.

Also, Sphinx already do such conversion automatically outside literal blocks:
https://docutils.sourceforge.io/docs/user/smartquotes.html

So, replace the occurences of the following UTF-8 characters:

- U+201c ('“'): LEFT DOUBLE QUOTATION MARK
- U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
- U+2217 ('∗'): ASTERISK OPERATOR

Acked-by: Theodore Ts'o <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
---
Documentation/filesystems/ext4/attributes.rst | 20 +++++++++----------
Documentation/filesystems/ext4/bigalloc.rst | 6 +++---
Documentation/filesystems/ext4/blockgroup.rst | 8 ++++----
Documentation/filesystems/ext4/blocks.rst | 2 +-
Documentation/filesystems/ext4/directory.rst | 16 +++++++--------
Documentation/filesystems/ext4/eainode.rst | 2 +-
Documentation/filesystems/ext4/inlinedata.rst | 6 +++---
Documentation/filesystems/ext4/inodes.rst | 6 +++---
Documentation/filesystems/ext4/journal.rst | 8 ++++----
Documentation/filesystems/ext4/mmp.rst | 2 +-
.../filesystems/ext4/special_inodes.rst | 4 ++--
Documentation/filesystems/ext4/super.rst | 10 +++++-----
12 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/Documentation/filesystems/ext4/attributes.rst b/Documentation/filesystems/ext4/attributes.rst
index 54386a010a8d..39e695678c01 100644
--- a/Documentation/filesystems/ext4/attributes.rst
+++ b/Documentation/filesystems/ext4/attributes.rst
@@ -8,7 +8,7 @@ block on the disk and referenced from inodes via ``inode.i_file_acl*``.
The first use of extended attributes seems to have been for storing file
ACLs and other security data (selinux). With the ``user_xattr`` mount
option it is possible for users to store extended attributes so long as
-all attribute names begin with “user”; this restriction seems to have
+all attribute names begin with "user"; this restriction seems to have
disappeared as of Linux 3.0.

There are two places where extended attributes can be found. The first
@@ -165,22 +165,22 @@ the key name. Here is a map of name index values to key prefixes:
* - 0
- (no prefix)
* - 1
- - “user.”
+ - "user."
* - 2
- - “system.posix\_acl\_access”
+ - "system.posix\_acl\_access"
* - 3
- - “system.posix\_acl\_default”
+ - "system.posix\_acl\_default"
* - 4
- - “trusted.”
+ - "trusted."
* - 6
- - “security.”
+ - "security."
* - 7
- - “system.” (inline\_data only?)
+ - "system." (inline\_data only?)
* - 8
- - “system.richacl” (SuSE kernels only?)
+ - "system.richacl" (SuSE kernels only?)

-For example, if the attribute key is “user.fubar”, the attribute name
-index is set to 1 and the “fubar” name is recorded on disk.
+For example, if the attribute key is "user.fubar", the attribute name
+index is set to 1 and the "fubar" name is recorded on disk.

POSIX ACLs
~~~~~~~~~~
diff --git a/Documentation/filesystems/ext4/bigalloc.rst b/Documentation/filesystems/ext4/bigalloc.rst
index 72075aa608e4..897e1b284c97 100644
--- a/Documentation/filesystems/ext4/bigalloc.rst
+++ b/Documentation/filesystems/ext4/bigalloc.rst
@@ -27,8 +27,8 @@ stored in the s\_log\_cluster\_size field in the superblock); from then
on, the block bitmaps track clusters, not individual blocks. This means
that block groups can be several gigabytes in size (instead of just
128MiB); however, the minimum allocation unit becomes a cluster, not a
-block, even for directories. TaoBao had a patchset to extend the “use
-units of clusters instead of blocks” to the extent tree, though it is
+block, even for directories. TaoBao had a patchset to extend the "use
+units of clusters instead of blocks" to the extent tree, though it is
not clear where those patches went-- they eventually morphed into
-“extent tree v2” but that code has not landed as of May 2015.
+"extent tree v2" but that code has not landed as of May 2015.

diff --git a/Documentation/filesystems/ext4/blockgroup.rst b/Documentation/filesystems/ext4/blockgroup.rst
index 3da156633339..99aa1f330bd1 100644
--- a/Documentation/filesystems/ext4/blockgroup.rst
+++ b/Documentation/filesystems/ext4/blockgroup.rst
@@ -41,8 +41,8 @@ across the disk in case the beginning of the disk gets trashed, though
not all block groups necessarily host a redundant copy (see following
paragraph for more details). If the group does not have a redundant
copy, the block group begins with the data block bitmap. Note also that
-when the filesystem is freshly formatted, mkfs will allocate “reserve
-GDT block” space after the block group descriptors and before the start
+when the filesystem is freshly formatted, mkfs will allocate "reserve
+GDT block" space after the block group descriptors and before the start
of the block bitmaps to allow for future expansion of the filesystem. By
default, a filesystem is allowed to increase in size by a factor of
1024x over the original filesystem size.
@@ -84,7 +84,7 @@ Without the option META\_BG, for safety concerns, all block group
descriptors copies are kept in the first block group. Given the default
128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4
can have at most 2^27/64 = 2^21 block groups. This limits the entire
-filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.
+filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB.

The solution to this problem is to use the metablock group feature
(META\_BG), which is already in ext3 for all 2.6 releases. With the
@@ -131,5 +131,5 @@ rely on the kernel to initialize the inode tables in the background.

By not writing zeroes to the bitmaps and inode table, mkfs time is
reduced considerably. Note the feature flag is RO\_COMPAT\_GDT\_CSUM,
-but the dumpe2fs output prints this as “uninit\_bg”. They are the same
+but the dumpe2fs output prints this as "uninit\_bg". They are the same
thing.
diff --git a/Documentation/filesystems/ext4/blocks.rst b/Documentation/filesystems/ext4/blocks.rst
index bd722ecd92d6..ca16435d469e 100644
--- a/Documentation/filesystems/ext4/blocks.rst
+++ b/Documentation/filesystems/ext4/blocks.rst
@@ -3,7 +3,7 @@
Blocks
------

-ext4 allocates storage space in units of “blocks”. A block is a group of
+ext4 allocates storage space in units of "blocks". A block is a group of
sectors between 1KiB and 64KiB, and the number of sectors must be an
integral power of 2. Blocks are in turn grouped into larger units called
block groups. Block size is specified at mkfs time and typically is
diff --git a/Documentation/filesystems/ext4/directory.rst b/Documentation/filesystems/ext4/directory.rst
index 55f618b37144..317e672cd457 100644
--- a/Documentation/filesystems/ext4/directory.rst
+++ b/Documentation/filesystems/ext4/directory.rst
@@ -15,8 +15,8 @@ is desired.
Linear (Classic) Directories
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-By default, each directory lists its entries in an “almost-linear”
-array. I write “almost” because it's not a linear array in the memory
+By default, each directory lists its entries in an "almost-linear"
+array. I write "almost" because it's not a linear array in the memory
sense because directory entries are not split across filesystem blocks.
Therefore, it is more accurate to say that a directory is a series of
data blocks and that each block contains a linear array of directory
@@ -26,7 +26,7 @@ takes it all the way to the end of the block. The end of the entire
directory is of course signified by reaching the end of the file. Unused
directory entries are signified by inode = 0. By default the filesystem
uses ``struct ext4_dir_entry_2`` for directory entries unless the
-“filetype” feature flag is not set, in which case it uses
+"filetype" feature flag is not set, in which case it uses
``struct ext4_dir_entry``.

The original directory entry format is ``struct ext4_dir_entry``, which
@@ -197,7 +197,7 @@ balanced tree keyed off a hash of the directory entry name. If the
EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a
hashed btree (htree) to organize and find directory entries. For
backwards read-only compatibility with ext2, this tree is actually
-hidden inside the directory file, masquerading as “empty” directory data
+hidden inside the directory file, masquerading as "empty" directory data
blocks! It was stated previously that the end of the linear directory
entry table was signified with an entry pointing to inode 0; this is
(ab)used to fool the old linear-scan algorithm into thinking that the
@@ -263,7 +263,7 @@ of a data block:
* - 0x8
- char
- dot.name[4]
- - “.\\0\\0\\0”
+ - ".\\0\\0\\0"
* - 0xC
- \_\_le32
- dotdot.inode
@@ -284,7 +284,7 @@ of a data block:
* - 0x14
- char
- dotdot\_name[4]
- - “..\\0\\0”
+ - "..\\0\\0"
* - 0x18
- \_\_le32
- struct dx\_root\_info.reserved\_zero
@@ -372,11 +372,11 @@ also the full length of a data block:
* - 0x6
- u8
- name\_len
- - Zero. There is no name for this “unused” directory entry.
+ - Zero. There is no name for this "unused" directory entry.
* - 0x7
- u8
- file\_type
- - Zero. There is no file type for this “unused” directory entry.
+ - Zero. There is no file type for this "unused" directory entry.
* - 0x8
- \_\_le16
- limit
diff --git a/Documentation/filesystems/ext4/eainode.rst b/Documentation/filesystems/ext4/eainode.rst
index ecc0d01a0a72..71e64aadaa89 100644
--- a/Documentation/filesystems/ext4/eainode.rst
+++ b/Documentation/filesystems/ext4/eainode.rst
@@ -6,7 +6,7 @@ Large Extended Attribute Values
To enable ext4 to store extended attribute values that do not fit in the
inode or in the single extended attribute block attached to an inode,
the EA\_INODE feature allows us to store the value in the data blocks of
-a regular file inode. This “EA inode” is linked only from the extended
+a regular file inode. This "EA inode" is linked only from the extended
attribute name index and must not appear in a directory entry. The
inode's i\_atime field is used to store a checksum of the xattr value;
and i\_ctime/i\_version store a 64-bit reference count, which enables
diff --git a/Documentation/filesystems/ext4/inlinedata.rst b/Documentation/filesystems/ext4/inlinedata.rst
index d1075178ce0b..8efa8a1cf273 100644
--- a/Documentation/filesystems/ext4/inlinedata.rst
+++ b/Documentation/filesystems/ext4/inlinedata.rst
@@ -9,7 +9,7 @@ data is so tiny that it readily fits inside the inode, which
file is smaller than 60 bytes, then the data are stored inline in
``inode.i_block``. If the rest of the file would fit inside the extended
attribute space, then it might be found as an extended attribute
-“system.data” within the inode body (“ibody EA”). This of course
+"system.data" within the inode body ("ibody EA"). This of course
constrains the amount of extended attributes one can attach to an inode.
If the data size increases beyond i\_block + ibody EA, a regular block
is allocated and the contents moved to that block.
@@ -20,14 +20,14 @@ inline data, one ought to be able to store 160 bytes of data in a
that, the limit was 156 bytes due to inefficient use of inode space.

The inline data feature requires the presence of an extended attribute
-for “system.data”, even if the attribute value is zero length.
+for "system.data", even if the attribute value is zero length.

Inline Directories
~~~~~~~~~~~~~~~~~~

The first four bytes of i\_block are the inode number of the parent
directory. Following that is a 56-byte space for an array of directory
-entries; see ``struct ext4_dir_entry``. If there is a “system.data”
+entries; see ``struct ext4_dir_entry``. If there is a "system.data"
attribute in the inode body, the EA value is an array of
``struct ext4_dir_entry`` as well. Note that for inline directories, the
i\_block and EA space are treated as separate dirent blocks; directory
diff --git a/Documentation/filesystems/ext4/inodes.rst b/Documentation/filesystems/ext4/inodes.rst
index a65baffb4ebf..cd3bbc3c1e33 100644
--- a/Documentation/filesystems/ext4/inodes.rst
+++ b/Documentation/filesystems/ext4/inodes.rst
@@ -90,7 +90,7 @@ The inode table entry is laid out in ``struct ext4_inode``.
* - 0x1C
- \_\_le32
- i\_blocks\_lo
- - Lower 32-bits of “block” count. If the huge\_file feature flag is not
+ - Lower 32-bits of "block" count. If the huge\_file feature flag is not
set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks
on disk. If huge\_file is set and EXT4\_HUGE\_FILE\_FL is NOT set in
``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi
@@ -109,7 +109,7 @@ The inode table entry is laid out in ``struct ext4_inode``.
* - 0x28
- 60 bytes
- i\_block[EXT4\_N\_BLOCKS=15]
- - Block map or extent tree. See the section “The Contents of inode.i\_block”.
+ - Block map or extent tree. See the section "The Contents of inode.i\_block".
* - 0x64
- \_\_le32
- i\_generation
@@ -507,7 +507,7 @@ orphaned inode, or zero if there are no more orphans.
If the inode structure size ``sb->s_inode_size`` is larger than 128
bytes and the ``i_inode_extra`` field is large enough to encompass the
respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime
-inode fields are widened to 64 bits. Within this “extra” 32-bit field,
+inode fields are widened to 64 bits. Within this "extra" 32-bit field,
the lower two bits are used to extend the 32-bit seconds field to be 34
bit wide; the upper 30 bits are used to provide nanosecond timestamp
accuracy. Therefore, timestamps should not overflow until May 2446.
diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
index cdbfec473167..9e12d5366ad6 100644
--- a/Documentation/filesystems/ext4/journal.rst
+++ b/Documentation/filesystems/ext4/journal.rst
@@ -6,7 +6,7 @@ Journal (jbd2)
Introduced in ext3, the ext4 filesystem employs a journal to protect the
filesystem against corruption in the case of a system crash. A small
continuous region of disk (default 128MiB) is reserved inside the
-filesystem as a place to land “important” data writes on-disk as quickly
+filesystem as a place to land "important" data writes on-disk as quickly
as possible. Once the important data transaction is fully written to the
disk and flushed from the disk write cache, a record of the data being
committed is also written to the journal. At some later point in time,
@@ -507,7 +507,7 @@ Data Block
In general, the data blocks being written to disk through the journal
are written verbatim into the journal file after the descriptor block.
However, if the first four bytes of the block match the jbd2 magic
-number then those four bytes are replaced with zeroes and the “escaped”
+number then those four bytes are replaced with zeroes and the "escaped"
flag is set in the descriptor block tag.

Revocation Block
@@ -520,8 +520,8 @@ block is freed and re-allocated as a file data block; in this case, a
journal replay after the file block was written to disk will cause
corruption.

-**NOTE**: This mechanism is NOT used to express “this journal block is
-superseded by this other journal block”, as the author (djwong)
+**NOTE**: This mechanism is NOT used to express "this journal block is
+superseded by this other journal block", as the author (djwong)
mistakenly thought. Any block being added to a transaction will cause
the removal of all existing revocation records for that block.

diff --git a/Documentation/filesystems/ext4/mmp.rst b/Documentation/filesystems/ext4/mmp.rst
index 25660981d93c..20631883a32b 100644
--- a/Documentation/filesystems/ext4/mmp.rst
+++ b/Documentation/filesystems/ext4/mmp.rst
@@ -42,7 +42,7 @@ The MMP structure (``struct mmp_struct``) is as follows:
* - 0x0
- \_\_le32
- mmp\_magic
- - Magic number for MMP, 0x004D4D50 (“MMP”).
+ - Magic number for MMP, 0x004D4D50 ("MMP").
* - 0x4
- \_\_le32
- mmp\_seq
diff --git a/Documentation/filesystems/ext4/special_inodes.rst b/Documentation/filesystems/ext4/special_inodes.rst
index 9061aabba827..407537be8fe5 100644
--- a/Documentation/filesystems/ext4/special_inodes.rst
+++ b/Documentation/filesystems/ext4/special_inodes.rst
@@ -26,11 +26,11 @@ ext4 reserves some inode for special features, as follows:
* - 6
- Undelete directory.
* - 7
- - Reserved group descriptors inode. (“resize inode”)
+ - Reserved group descriptors inode. ("resize inode")
* - 8
- Journal inode.
* - 9
- - The “exclude” inode, for snapshots(?)
+ - The "exclude" inode, for snapshots(?)
* - 10
- Replica inode, used for some non-upstream feature?
* - 11
diff --git a/Documentation/filesystems/ext4/super.rst b/Documentation/filesystems/ext4/super.rst
index 2eb1ab20498d..8c52ccc6dd04 100644
--- a/Documentation/filesystems/ext4/super.rst
+++ b/Documentation/filesystems/ext4/super.rst
@@ -572,7 +572,7 @@ following:
* - 0x1
- Directory preallocation (COMPAT\_DIR\_PREALLOC).
* - 0x2
- - “imagic inodes”. Not clear from the code what this does
+ - "imagic inodes". Not clear from the code what this does
(COMPAT\_IMAGIC\_INODES).
* - 0x4
- Has a journal (COMPAT\_HAS\_JOURNAL).
@@ -584,12 +584,12 @@ following:
* - 0x20
- Has directory indices (COMPAT\_DIR\_INDEX).
* - 0x40
- - “Lazy BG”. Not in Linux kernel, seems to have been for uninitialized
+ - "Lazy BG". Not in Linux kernel, seems to have been for uninitialized
block groups? (COMPAT\_LAZY\_BG)
* - 0x80
- - “Exclude inode”. Not used. (COMPAT\_EXCLUDE\_INODE).
+ - "Exclude inode". Not used. (COMPAT\_EXCLUDE\_INODE).
* - 0x100
- - “Exclude bitmap”. Seems to be used to indicate the presence of
+ - "Exclude bitmap". Seems to be used to indicate the presence of
snapshot-related exclude bitmaps? Not defined in kernel or used in
e2fsprogs (COMPAT\_EXCLUDE\_BITMAP).
* - 0x200
@@ -695,7 +695,7 @@ the following:
* - 0x100
- `Quota <Quota>`__ (RO\_COMPAT\_QUOTA).
* - 0x200
- - This filesystem supports “bigalloc”, which means that file extents are
+ - This filesystem supports "bigalloc", which means that file extents are
tracked in units of clusters (of blocks) instead of blocks
(RO\_COMPAT\_BIGALLOC).
* - 0x400
--
2.30.2

2021-05-12 14:17:43

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote:
> v2:
> - removed EM/EN DASH conversion from this patchset;

Are you still thinking about doing the

EN DASH --> "--"
EM DASH --> "---"

conversion? That's not going to change what the documentation will
look like in the HTML and PDF output forms, and I think it would make
life easier for people are reading and editing the Documentation/*
files in text form.

- Ted

2021-05-12 15:50:36

by Mauro Carvalho Chehab

[permalink] [raw]

Subject: Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

Em Wed, 12 May 2021 10:14:44 -0400
"Theodore Ts'o" <[email protected]> escreveu:

> On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote:
> > v2:
> > - removed EM/EN DASH conversion from this patchset;
>
> Are you still thinking about doing the
>
> EN DASH --> "--"
> EM DASH --> "---"
>
> conversion?

Yes, but I intend to submit it on a separate patch series, probably after
having this one merged. Let's first cleanup the large part of the
conversion-generated UTF-8 char noise ;-)

> That's not going to change what the documentation will
> look like in the HTML and PDF output forms, and I think it would make
> life easier for people are reading and editing the Documentation/*
> files in text form.

Agreed. I'm also considering to add a couple of cases of this char:

- U+2026 ('…'): HORIZONTAL ELLIPSIS

As Sphinx also replaces "..." into HORIZONTAL ELLIPSIS.

-

Anyway, I'm opting to submitting those in separate because it seems
that at least some maintainers added EM/EN DASH intentionally.

So, it may generate case-per-case discussions.

Also, IMO, at least a couple of EN/EM DASH cases would be better served
with a single hyphen.

Thanks,
Mauro

2021-05-12 18:20:15

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

Your title 'Use ASCII subset' is now at least a bit *closer* to
describing what the patches are actually doing, but it's still a bit
misleading because you're only doing it for *some* characters.

And the wording is still indicative of a fundamentally *misguided*
motivation for doing any of this. Your commit comments should be about
fixing a specific thing, nothing to do with "use ASCII subset", which
is pointless in itself.

On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
> Such conversion tools - plus some text editor like LibreOffice or similar - have
> a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
> for instance converting commas into curly commas and adding non-breakable
> spaces. All of those are meant to produce better results when the text is
> displayed in HTML or PDF formats.

And don't we render our documentation into HTML or PDF formats? Are
some of those non-breaking spaces not actually *useful* for their
intended purpose?

> While it is perfectly fine to use UTF-8 characters in Linux, and specially at
> the documentation, it is better to stick to the ASCII subset on such
> particular case, due to a couple of reasons:
>
> 1. it makes life easier for tools like grep;

Barely, as noted, because of things like line feeds.

> 2. they easier to edit with the some commonly used text/source
> code editors.

That is nonsense. Any but the most broken and/or anachronistic
environments and editors will be just fine.

Attachments:

smime.p7s (5.05 kB)

2021-05-12 18:20:30

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

On Wed, 2021-05-12 at 17:17 +0200, Mauro Carvalho Chehab wrote:
> Em Wed, 12 May 2021 10:14:44 -0400
> "Theodore Ts'o" <[email protected]> escreveu:
>
> > On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote:
> > > v2:
> > > - removed EM/EN DASH conversion from this patchset;
> >
> > Are you still thinking about doing the
> >
> > EN DASH --> "--"
> > EM DASH --> "---"
> >
> > conversion?
>
> Yes, but I intend to submit it on a separate patch series, probably after
> having this one merged. Let's first cleanup the large part of the
> conversion-generated UTF-8 char noise ;-)
>
> > That's not going to change what the documentation will
> > look like in the HTML and PDF output forms, and I think it would make
> > life easier for people are reading and editing the Documentation/*
> > files in text form.
>
> Agreed. I'm also considering to add a couple of cases of this char:
>
> - U+2026 ('…'): HORIZONTAL ELLIPSIS
>
> As Sphinx also replaces "..." into HORIZONTAL ELLIPSIS.

Er, what?

The *only* part of this whole enterprise that actually seemed to make
even a tiny bit of sense — rather than seeming like a thinly veiled
retrospective excuse for dragging us back in time by 30 years — was the
bit about making it easier to grep.

But if I understand you correctly, you're talking about using something
like C trigraphs to represent the perfectly reasonable text emdash
character ("—") as two hyphen-minuses ("--") in the source code of the
documentation? Isn't that going to achieve precisely the *opposite*? If
I select some text in the HTML output of the docs and then search for
it in the source code, that's going to *stop* it matching my search?

Attachments:

smime.p7s (5.05 kB)

2021-05-14 10:14:48

by Mauro Carvalho Chehab

[permalink] [raw]

Subject: Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

Em Wed, 12 May 2021 18:07:04 +0100
David Woodhouse <[email protected]> escreveu:

> On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
> > Such conversion tools - plus some text editor like LibreOffice or similar - have
> > a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
> > for instance converting commas into curly commas and adding non-breakable
> > spaces. All of those are meant to produce better results when the text is
> > displayed in HTML or PDF formats.
>
> And don't we render our documentation into HTML or PDF formats?

Yes.

> Are
> some of those non-breaking spaces not actually *useful* for their
> intended purpose?

No.

The thing is: non-breaking space can cause a lot of problems.

We even had to disable Sphinx usage of non-breaking space for
PDF outputs, as this was causing bad LaTeX/PDF outputs.

See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output")

The afore mentioned patch disables Sphinx default behavior of
using NON-BREAKABLE SPACE on literal blocks and strings, using this
special setting: "parsedliteralwraps=true".

When NON-BREAKABLE SPACE were used on PDF outputs, several parts of
the media uAPI docs were violating the document margins by far,
causing texts to be truncated.

So, please **don't add NON-BREAKABLE SPACE**, unless you test
(and keep testing it from time to time) if outputs on all
formats are properly supporting it on different Sphinx versions.

-

Also, most of those came from conversion tools, together with other
eccentricities, like the usage of U+FEFF (BOM) character at the
start of some documents. The remaining ones seem to came from
cut-and-paste.

For instance, bibliographic references (there are a couple of
those on media) sometimes have NON-BREAKABLE SPACE. I'm pretty
sure that those came from cut-and-pasting the document titles
from their names at the original PDF documents or web pages that
are referenced.

> > While it is perfectly fine to use UTF-8 characters in Linux, and specially at
> > the documentation, it is better to stick to the ASCII subset on such
> > particular case, due to a couple of reasons:
> >
> > 1. it makes life easier for tools like grep;
>
> Barely, as noted, because of things like line feeds.

You can use grep with "-z" to seek for multi-line strings(*), Like:

$ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f)
Documentation/RCU/Design/Data-Structures/Data-Structures.rst

(*) Unfortunately, while "git grep" also has a "-z" flag, it
seems that this is (currently?) broken with regards of handling multilines:

$ git grep -Pzl 'grace period started,\s*then'
$

> > 2. they easier to edit with the some commonly used text/source
> > code editors.
>
> That is nonsense. Any but the most broken and/or anachronistic
> environments and editors will be just fine.

Not really.

I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely
on the US-intl keyboard settings, that allow me to type as "'a" for á.
However, there's no shortcut for non-Latin UTF-codes, as far as I know.

So, if would need to type a curly comma on the text editors I normally
use for development (vim, nano, kate), I would need to cut-and-paste
it from somewhere[1].

[1] If I have a table with UTF-8 codes handy, I could type the UTF-8
number manually... However, it seems that this is currently broken
at least on Fedora 33 (with Mate Desktop and US intl keyboard with
dead keys).

Here, <CTRL><SHIFT>U is not working. No idea why. I haven't
test it for *years*, as I din't see any reason why I would
need to type UTF-8 characters by numbers until we started
this thread.

In practice, on the very rare cases where I needed to write
non-Latin utf-8 chars (maybe once in a year or so, Like when I
would need to use a Greek letter or some weird symbol), there changes
are high that I wouldn't remember its UTF-8 code.

So, If I need to spend time to seek for an specific symbol, after
finding it, I just cut-and-paste it.

But even in the best case scenario where I know the UTF-8 and
<CTRL><SHIFT>U works, if I wanted to use, for instance, a curly
comma, the keystroke sequence would be:

<CTRL><SHIFT>U201csome string<CTRL><SHIFT>U201d

That's a lot harder than typing and has a higher chances of
mistakenly add a wrong symbol than just typing:

"some string"

Knowing that both will produce *exactly* the same output, why
should I bother doing it the hard way?

-

Now, I'm not arguing that you can't use whatever UTF-8 symbol you
want on your docs. I'm just saying that, now that the conversion
is over and a lot of documents ended getting some UTF-8 characters
by accident, it is time for a cleanup.

Thanks,
Mauro

2021-05-14 10:36:03

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote:
> Em Wed, 12 May 2021 18:07:04 +0100
> David Woodhouse <[email protected]> escreveu:
>
> > On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
> > > Such conversion tools - plus some text editor like LibreOffice or similar - have
> > > a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
> > > for instance converting commas into curly commas and adding non-breakable
> > > spaces. All of those are meant to produce better results when the text is
> > > displayed in HTML or PDF formats.
> >
> > And don't we render our documentation into HTML or PDF formats?
>
> Yes.
>
> > Are
> > some of those non-breaking spaces not actually *useful* for their
> > intended purpose?
>
> No.
>
> The thing is: non-breaking space can cause a lot of problems.
>
> We even had to disable Sphinx usage of non-breaking space for
> PDF outputs, as this was causing bad LaTeX/PDF outputs.
>
> See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output")
>
> The afore mentioned patch disables Sphinx default behavior of
> using NON-BREAKABLE SPACE on literal blocks and strings, using this
> special setting: "parsedliteralwraps=true".
>
> When NON-BREAKABLE SPACE were used on PDF outputs, several parts of
> the media uAPI docs were violating the document margins by far,
> causing texts to be truncated.
>
> So, please **don't add NON-BREAKABLE SPACE**, unless you test
> (and keep testing it from time to time) if outputs on all
> formats are properly supporting it on different Sphinx versions.

And there you have a specific change with a specific fix. Nothing to do
with whether NON-BREAKABLE SPACE is ∉ ASCII, and *certainly* nothing to
do with the fact that, like *every* character in every kernel file
except the *binary* files, it's representable in UTF-8.

By all means fix the specific characters which are typographically
wrong or which, like NON-BREAKABLE SPACE, cause problems for rendering
the documentation.

> Also, most of those came from conversion tools, together with other
> eccentricities, like the usage of U+FEFF (BOM) character at the
> start of some documents. The remaining ones seem to came from
> cut-and-paste.

... or which are just entirely redundant and gratuitous, like a BOM in
an environment where all files are UTF-8 and never 16-bit encodings
anyway.

> > > While it is perfectly fine to use UTF-8 characters in Linux, and specially at
> > > the documentation, it is better to stick to the ASCII subset on such
> > > particular case, due to a couple of reasons:
> > >
> > > 1. it makes life easier for tools like grep;
> >
> > Barely, as noted, because of things like line feeds.
>
> You can use grep with "-z" to seek for multi-line strings(*), Like:
>
> $ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f)
> Documentation/RCU/Design/Data-Structures/Data-Structures.rst

Yeah, right. That works if you don't just use the text that you'll have
seen in the HTML/PDF "grace period started, then", and if you instead
craft a *regex* for it, replacing the spaces with '\s*'. Or is that
[[:space:]]* if you don't want to use the experimental Perl regex
feature?

$ grep -zlr 'grace[[:space:]]\+period[[:space:]]\+started,[[:space:]]\+then' Documentation/RCU
Documentation/RCU/Design/Data-Structures/Data-Structures.rst

And without '-l' it'll obviously just give you the whole file. No '-A5
-B5' to see the surroundings... it's hardly a useful thing, is it?

> (*) Unfortunately, while "git grep" also has a "-z" flag, it
> seems that this is (currently?) broken with regards of handling multilines:
>
> $ git grep -Pzl 'grace period started,\s*then'
> $

Even better. So no, multiline grep isn't really a commonly usable
feature at all.

This is why we prefer to put user-visible strings on one line in C
source code, even if it takes the lines over 80 characters — to allow
for grep to find them.

> > > 2. they easier to edit with the some commonly used text/source
> > > code editors.
> >
> > That is nonsense. Any but the most broken and/or anachronistic
> > environments and editors will be just fine.
>
> Not really.
>
> I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely
> on the US-intl keyboard settings, that allow me to type as "'a" for á.
> However, there's no shortcut for non-Latin UTF-codes, as far as I know.
>
> So, if would need to type a curly comma on the text editors I normally
> use for development (vim, nano, kate), I would need to cut-and-paste
> it from somewhere[1].

That's entirely irrelevant. You don't need to be able to *type* every
character that you see in front of you, as long as your editor will
render it correctly and perhaps let you cut/paste it as you're editing
the document if you're moving things around.

> [1] If I have a table with UTF-8 codes handy, I could type the UTF-8
> number manually... However, it seems that this is currently broken
> at least on Fedora 33 (with Mate Desktop and US intl keyboard with
> dead keys).
>
> Here, <CTRL><SHIFT>U is not working. No idea why. I haven't
> test it for *years*, as I din't see any reason why I would
> need to type UTF-8 characters by numbers until we started
> this thread.

Please provide the bug number for this; I'd like to track it.

> But even in the best case scenario where I know the UTF-8 and
> <CTRL><SHIFT>U works, if I wanted to use, for instance, a curly
> comma, the keystroke sequence would be:
>
> <CTRL><SHIFT>U201csome string<CTRL><SHIFT>U201d
>
> That's a lot harder than typing and has a higher chances of
> mistakenly add a wrong symbol than just typing:
>
> "some string"
>
> Knowing that both will produce *exactly* the same output, why
> should I bother doing it the hard way?

Nobody's asked you to do it the "hard way". That's completely
irrelevant to the discussion we were having.

> Now, I'm not arguing that you can't use whatever UTF-8 symbol you
> want on your docs. I'm just saying that, now that the conversion
> is over and a lot of documents ended getting some UTF-8 characters
> by accident, it is time for a cleanup.

All text documents are *full* of UTF-8 characters. If there is a file
in the source code which has *any* non-UTF8, we call that a 'binary
file'.

Again, if you want to make specific fixes like removing non-breaking
spaces and byte order marks, with specific reasons, then those make
sense. But it's got very little to do with UTF-8 and how easy it is to
type them. And the excuse you've put in the commit comment for your
patches is utterly bogus.

Attachments:

smime.p7s (5.05 kB)

2021-05-14 11:17:47

by Edward Cree

[permalink] [raw]

Subject: Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

> On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote:
>> I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely
>> on the US-intl keyboard settings, that allow me to type as "'a" for á.
>> However, there's no shortcut for non-Latin UTF-codes, as far as I know.
>>
>> So, if would need to type a curly comma on the text editors I normally
>> use for development (vim, nano, kate), I would need to cut-and-paste
>> it from somewhere

For anyone who doesn't know about it: X has this wonderful thing called
the Compose key[1]. For instance, type ⎄--- to get —, or ⎄<" for “.
Much more mnemonic than Unicode codepoints; and you can extend it with
user-defined sequences in your ~/.XCompose file.
(I assume Wayland supports all this too, but don't know the details.)

On 14/05/2021 10:06, David Woodhouse wrote:
> Again, if you want to make specific fixes like removing non-breaking
> spaces and byte order marks, with specific reasons, then those make
> sense. But it's got very little to do with UTF-8 and how easy it is to
> type them. And the excuse you've put in the commit comment for your
> patches is utterly bogus.

+1

-ed

[1] https://en.wikipedia.org/wiki/Compose_key

2021-05-14 19:13:05

by Mauro Carvalho Chehab

[permalink] [raw]

Subject: Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

Em Fri, 14 May 2021 12:08:36 +0100
Edward Cree <[email protected]> escreveu:

> For anyone who doesn't know about it: X has this wonderful thing called
> the Compose key[1]. For instance, type ⎄--- to get —, or ⎄<" for “.
> Much more mnemonic than Unicode codepoints; and you can extend it with
> user-defined sequences in your ~/.XCompose file.

Good tip. I haven't use composite for years, as US-intl with dead keys is
enough for 99.999% of my needs.

Btw, at least on Fedora with Mate, Composite is disabled by default. It has
to be enabled first using the same tool that allows changing the Keyboard
layout[1].

Yet, typing an EN DASH for example, would be "<composite>--.", with is 4
keystrokes instead of just two ('--'). It means twice the effort ;-)

[1] KDE, GNome, Mate, ... have different ways to enable it and to
select what key would be considered <composite>:

https://dry.sailingissues.com/us-international-keyboard-layout.html
https://help.ubuntu.com/community/ComposeKey

Thanks,
Mauro

2021-05-15 11:29:36

by Mauro Carvalho Chehab

[permalink] [raw]

Subject: Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

Em Fri, 14 May 2021 10:06:01 +0100
David Woodhouse <[email protected]> escreveu:

> On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote:
> > Em Wed, 12 May 2021 18:07:04 +0100
> > David Woodhouse <[email protected]> escreveu:
> >
> > > On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
> > > > Such conversion tools - plus some text editor like LibreOffice or similar - have
> > > > a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
> > > > for instance converting commas into curly commas and adding non-breakable
> > > > spaces. All of those are meant to produce better results when the text is
> > > > displayed in HTML or PDF formats.
> > >
> > > And don't we render our documentation into HTML or PDF formats?
> >
> > Yes.
> >
> > > Are
> > > some of those non-breaking spaces not actually *useful* for their
> > > intended purpose?
> >
> > No.
> >
> > The thing is: non-breaking space can cause a lot of problems.
> >
> > We even had to disable Sphinx usage of non-breaking space for
> > PDF outputs, as this was causing bad LaTeX/PDF outputs.
> >
> > See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output")
> >
> > The afore mentioned patch disables Sphinx default behavior of
> > using NON-BREAKABLE SPACE on literal blocks and strings, using this
> > special setting: "parsedliteralwraps=true".
> >
> > When NON-BREAKABLE SPACE were used on PDF outputs, several parts of
> > the media uAPI docs were violating the document margins by far,
> > causing texts to be truncated.
> >
> > So, please **don't add NON-BREAKABLE SPACE**, unless you test
> > (and keep testing it from time to time) if outputs on all
> > formats are properly supporting it on different Sphinx versions.
>
> And there you have a specific change with a specific fix. Nothing to do
> with whether NON-BREAKABLE SPACE is ∉ ASCII, and *certainly* nothing to
> do with the fact that, like *every* character in every kernel file
> except the *binary* files, it's representable in UTF-8.
>
> By all means fix the specific characters which are typographically
> wrong or which, like NON-BREAKABLE SPACE, cause problems for rendering
> the documentation.
>
>
> > Also, most of those came from conversion tools, together with other
> > eccentricities, like the usage of U+FEFF (BOM) character at the
> > start of some documents. The remaining ones seem to came from
> > cut-and-paste.
>
> ... or which are just entirely redundant and gratuitous, like a BOM in
> an environment where all files are UTF-8 and never 16-bit encodings
> anyway.

Agreed.

>
> > > > While it is perfectly fine to use UTF-8 characters in Linux, and specially at
> > > > the documentation, it is better to stick to the ASCII subset on such
> > > > particular case, due to a couple of reasons:
> > > >
> > > > 1. it makes life easier for tools like grep;
> > >
> > > Barely, as noted, because of things like line feeds.
> >
> > You can use grep with "-z" to seek for multi-line strings(*), Like:
> >
> > $ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f)
> > Documentation/RCU/Design/Data-Structures/Data-Structures.rst
>
> Yeah, right. That works if you don't just use the text that you'll have
> seen in the HTML/PDF "grace period started, then", and if you instead
> craft a *regex* for it, replacing the spaces with '\s*'. Or is that
> [[:space:]]* if you don't want to use the experimental Perl regex
> feature?
>
> $ grep -zlr 'grace[[:space:]]\+period[[:space:]]\+started,[[:space:]]\+then' Documentation/RCU
> Documentation/RCU/Design/Data-Structures/Data-Structures.rst
>
> And without '-l' it'll obviously just give you the whole file. No '-A5
> -B5' to see the surroundings... it's hardly a useful thing, is it?
>
> > (*) Unfortunately, while "git grep" also has a "-z" flag, it
> > seems that this is (currently?) broken with regards of handling multilines:
> >
> > $ git grep -Pzl 'grace period started,\s*then'
> > $
>
> Even better. So no, multiline grep isn't really a commonly usable
> feature at all.
>
> This is why we prefer to put user-visible strings on one line in C
> source code, even if it takes the lines over 80 characters — to allow
> for grep to find them.

Makes sense, but in case of documentation, this is a little more
complex than that.

Btw, the theme used when building html by default[1] has a search
box (written in Javascript) that could be able to find multi-line
patterns, working somewhat similar to "git grep foo -a bar".

[1] https://github.com/readthedocs/sphinx_rtd_theme

> > [1] If I have a table with UTF-8 codes handy, I could type the UTF-8
> > number manually... However, it seems that this is currently broken
> > at least on Fedora 33 (with Mate Desktop and US intl keyboard with
> > dead keys).
> >
> > Here, <CTRL><SHIFT>U is not working. No idea why. I haven't
> > test it for *years*, as I din't see any reason why I would
> > need to type UTF-8 characters by numbers until we started
> > this thread.
>
> Please provide the bug number for this; I'd like to track it.

Just opened a BZ and added you as c/c.

> > Now, I'm not arguing that you can't use whatever UTF-8 symbol you
> > want on your docs. I'm just saying that, now that the conversion
> > is over and a lot of documents ended getting some UTF-8 characters
> > by accident, it is time for a cleanup.
>
> All text documents are *full* of UTF-8 characters. If there is a file
> in the source code which has *any* non-UTF8, we call that a 'binary
> file'.
>
> Again, if you want to make specific fixes like removing non-breaking
> spaces and byte order marks, with specific reasons, then those make
> sense. But it's got very little to do with UTF-8 and how easy it is to
> type them. And the excuse you've put in the commit comment for your
> patches is utterly bogus.

Let's take one step back, in order to return to the intents of this
UTF-8, as the discussions here are not centered into the patches, but
instead, on what to do and why.

-

This discussion started originally at linux-doc ML.

While discussing about an issue when machine's locale was not set
to UTF-8 on a build VM, we discovered that some converted docs ended
with BOM characters. Those specific changes were introduced by some
of my convert patches, probably converted via pandoc.

So, I went ahead in order to check what other possible weird things
were introduced by the conversion, where several scripts and tools
were used on files that had already a different markup.

I actually checked the current UTF-8 issues, and asked people at
linux-doc to comment what of those are valid usecases, and what
should be replaced by plain ASCII.

Basically, this is the current situation (at docs/docs-next), for the
ReST files under Documentation/, excluding translations is:

1. Spaces and BOM

- U+00a0 (' '): NO-BREAK SPACE
- U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)

Based on the discussions there and on this thread, those should be
dropped, as BOM is useless and NO-BREAK SPACE can cause problems
at the html/pdf output;

2. Symbols

- U+00a9 ('©'): COPYRIGHT SIGN
- U+00ac ('¬'): NOT SIGN
- U+00ae ('®'): REGISTERED SIGN
- U+00b0 ('°'): DEGREE SIGN
- U+00b1 ('±'): PLUS-MINUS SIGN
- U+00b2 ('²'): SUPERSCRIPT TWO
- U+00b5 ('µ'): MICRO SIGN
- U+03bc ('μ'): GREEK SMALL LETTER MU
- U+00b7 ('·'): MIDDLE DOT
- U+00bd ('½'): VULGAR FRACTION ONE HALF
- U+2122 ('™'): TRADE MARK SIGN
- U+2264 ('≤'): LESS-THAN OR EQUAL TO
- U+2265 ('≥'): GREATER-THAN OR EQUAL TO
- U+2b0d ('⬍'): UP DOWN BLACK ARROW

Those seem OK on my eyes.

On a side note, both MICRO SIGN and GREEK SMALL LETTER MU are
used several docs to represent microseconds, micro-volts and
micro-ampères. If we write an orientation document, it probably
makes sense to recommend using MICRO SIGN on such cases.

3. Latin

- U+00c7 ('Ç'): LATIN CAPITAL LETTER C WITH CEDILLA
- U+00df ('ß'): LATIN SMALL LETTER SHARP S
- U+00e1 ('á'): LATIN SMALL LETTER A WITH ACUTE
- U+00e4 ('ä'): LATIN SMALL LETTER A WITH DIAERESIS
- U+00e6 ('æ'): LATIN SMALL LETTER AE
- U+00e7 ('ç'): LATIN SMALL LETTER C WITH CEDILLA
- U+00e9 ('é'): LATIN SMALL LETTER E WITH ACUTE
- U+00ea ('ê'): LATIN SMALL LETTER E WITH CIRCUMFLEX
- U+00eb ('ë'): LATIN SMALL LETTER E WITH DIAERESIS
- U+00f3 ('ó'): LATIN SMALL LETTER O WITH ACUTE
- U+00f4 ('ô'): LATIN SMALL LETTER O WITH CIRCUMFLEX
- U+00f6 ('ö'): LATIN SMALL LETTER O WITH DIAERESIS
- U+00f8 ('ø'): LATIN SMALL LETTER O WITH STROKE
- U+00fa ('ú'): LATIN SMALL LETTER U WITH ACUTE
- U+00fc ('ü'): LATIN SMALL LETTER U WITH DIAERESIS
- U+00fd ('ý'): LATIN SMALL LETTER Y WITH ACUTE
- U+011f ('ğ'): LATIN SMALL LETTER G WITH BREVE
- U+0142 ('ł'): LATIN SMALL LETTER L WITH STROKE

Those should be kept as well, as they're used for non-English names.

4. arrows and box drawing symbols:
- U+2191 ('↑'): UPWARDS ARROW
- U+2192 ('→'): RIGHTWARDS ARROW
- U+2193 ('↓'): DOWNWARDS ARROW

- U+2500 ('─'): BOX DRAWINGS LIGHT HORIZONTAL
- U+2502 ('│'): BOX DRAWINGS LIGHT VERTICAL
- U+2514 ('└'): BOX DRAWINGS LIGHT UP AND RIGHT
- U+251c ('├'): BOX DRAWINGS LIGHT VERTICAL AND RIGHT

Also should be kept.

In summary, based on the discussions we have so far, I suspect that
there's not much to be discussed for the above cases.

So, I'll post a v3 of this series, changing only:

- U+00a0 (' '): NO-BREAK SPACE
- U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)

---

Now, this specific patch series address also this extra case:

5. curly commas:

- U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
- U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
- U+201c ('“'): LEFT DOUBLE QUOTATION MARK
- U+201d ('”'): RIGHT DOUBLE QUOTATION MARK

IMO, those should be replaced by ASCII commas: ' and ".

The rationale is simple:

- most were introduced during the conversion from Docbook,
markdown and LaTex;
- they don't add any extra value, as using "foo" of “foo” means
the same thing;
- Sphinx already use "fancy" commas at the output.

I guess I will put this on a separate series, as this is not a bug
fix, but just a cleanup from the conversion work.

I'll re-post those cleanups on a separate series, for patch per patch
review.

---

The remaining cases are future work, outside the scope of this v2:

6. Hyphen/Dashes and ellipsis

- U+2212 ('−'): MINUS SIGN
- U+00ad (''): SOFT HYPHEN
- U+2010 ('‐'): HYPHEN

Those three are used on places where a normal ASCII hyphen/minus
should be used instead. There are even a couple of C files which
use them instead of '-' on comments.

IMO are fixes/cleanups from conversions and bad cut-and-paste.

- U+2013 ('–'): EN DASH
- U+2014 ('—'): EM DASH
- U+2026 ('…'): HORIZONTAL ELLIPSIS

Those are auto-replaced by Sphinx from "--", "---" and "...",
respectively.

I guess those are a matter of personal preference about
weather using ASCII or UTF-8.

My personal preference (and Ted seems to have a similar
opinion) is to let Sphinx do the conversion.

For those, I intend to post a separate series, to be
reviewed patch per patch, as this is really a matter
of personal taste. Hardly we'll reach a consensus here.

7. math symbols:

- U+00d7 ('×'): MULTIPLICATION SIGN

This one is used mostly do describe video resolutions, but this is
on a smaller changeset than the ones that use "x" letter.

- U+2217 ('∗'): ASTERISK OPERATOR

This is used only here:
Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.

Probably added by some conversion tool. IMO, this one should
also be replaced by an ASCII asterisk.

I guess I'll post a patch for the ASTERISK OPERATOR.
Thanks,
Mauro

2021-05-15 11:54:11

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

On Sat, 2021-05-15 at 10:22 +-0200, Mauro Carvalho Chehab wrote:
+AD4 +AD4 +AD4 Here, +ADw-CTRL+AD4APA-SHIFT+AD4-U is not working. No idea why. I haven't
+AD4 +AD4 +AD4 test it for +ACo-years+ACo, as I din't see any reason why I would
+AD4 +AD4 +AD4 need to type UTF-8 characters by numbers until we started
+AD4 +AD4 +AD4 this thread.
+AD4 +AD4
+AD4 +AD4 Please provide the bug number for this+ADs I'd like to track it.
+AD4
+AD4 Just opened a BZ and added you as c/c.

Thanks.

+AD4 Let's take one step back, in order to return to the intents of this
+AD4 UTF-8, as the discussions here are not centered into the patches, but
+AD4 instead, on what to do and why.
+AD4
+AD4 -
+AD4
+AD4 This discussion started originally at linux-doc ML.
+AD4
+AD4 While discussing about an issue when machine's locale was not set
+AD4 to UTF-8 on a build VM,

Stop. Stop +ACo-right+ACo there before you go any further.

The machine's locale should have +ACo-nothing+ACo to do with anything.

When you view this email, it comes with a Content-Type: header which
explicitly tells you the character set that the message is encoded in,
which I think I've set to UTF-7.

When showing you the mail, your system has to interpret the bytes of
the content using +ACo-that+ACo character set encoding. Anything else is just
fundamentally broken. Your system locale has +ACo-nothing+ACo to do with it.

If your local system is running EBCDIC that doesn't +ACo-matter+ACo.

Now, the character set encoding of the kernel source and documentation
text files is UTF-8. It isn't EBCDIC, it isn't ISO8859-15 or any of the
legacy crap. It isn't system locale either, unless your system locale
+ACo-happens+ACo to be UTF-8.

UTF-8 +ACo-happens+ACo to be compatible with ASCII for the limited subset of
characters which ASCII contains, sure +IBQ just as +ACo-many+ACo, but not all, of
the legacy 8-bit character sets are also a superset of ASCII's 7 bits.

But if the docs contain +ACo-any+ACo characters which aren't ASCII, and you
build them with a broken build system which assumes ASCII, you are
going to produce wrong output. There is +ACo-no+ACo substitute for fixing the
+ACo-actual+ACo bug which started all this, and ensuring your build system (or
whatever) uses the +ACo-actual+ACo encoding of the text files it's processing,
instead of making stupid and bogus assumptions based on a system
default.

You concede keeping U+-00a9 +AKk COPYRIGHT SIGN. And that's encoded in UTF-
8 as two bytes 0xC2 0xA9. If some broken build system +ACo-assumes+ACo those
bytes are ISO8859-15 it'll take them to mean two separate characters

U+-00C2 +AMI LATIN CAPITAL LETTER A WITH CIRCUMFLEX
U+-00A9 +AKk COPYRIGHT SIGN

Your broken build system that started all this is never going to be
+ACo-anything+ACo other than broken. You can only paper over the cracks and
make it slightly less likely that people will notice in the common
case, perhaps? That's all you do by +ACo-reducing+ACo the use of non-ASCII,
unless you're going to drag us all the way back to the 1980s and
strictly limit us to pure ASCII, using the equivalent of trigraphs for
+ACo-anything+ACo outside the 0-127 character ranges.

And even if you did that, systems which use EBCDIC as their local
encoding would +ACo-still+ACo be broken, if they have the same bug you started
from. Because EBCDIC isn't compatible with ASCII +ACo-even+ACo for the first 7
bits.

+AD4 we discovered that some converted docs ended
+AD4 with BOM characters. Those specific changes were introduced by some
+AD4 of my convert patches, probably converted via pandoc.
+AD4
+AD4 So, I went ahead in order to check what other possible weird things
+AD4 were introduced by the conversion, where several scripts and tools
+AD4 were used on files that had already a different markup.
+AD4
+AD4 I actually checked the current UTF-8 issues, and asked people at
+AD4 linux-doc to comment what of those are valid usecases, and what
+AD4 should be replaced by plain ASCII.

No, these aren't +ACI-UTF-8 issues+ACI. Those are +ACo-conversion+ACo issues, and
would still be there if the output of the conversion had been UTF-7,
UCS-16, etc. Or +ACo-even+ACo if the output of the conversion had been
trigraph-like stuff like '--' for emdash. It's +ACo-nothing+ACo to do with the
encoding that we happen to be using.

Fixing the conversion issues makes a lot of sense. Try to do it without
making +ACo-any+ACo mention of UTF-8 at all.

+AD4 In summary, based on the discussions we have so far, I suspect that
+AD4 there's not much to be discussed for the above cases.
+AD4
+AD4 So, I'll post a v3 of this series, changing only:
+AD4
+AD4 - U+-00a0 (' '): NO-BREAK SPACE
+AD4 - U+-feff ('+/v8'): ZERO WIDTH NO-BREAK SPACE (BOM)

Ack, as long as those make +ACo-no+ACo mention of UTF-8. Except perhaps to
note that BOM is redundant because UTF-8 doesn't have a byteorder.

+AD4 ---
+AD4
+AD4 Now, this specific patch series address also this extra case:
+AD4
+AD4 5. curly commas:
+AD4
+AD4 - U+-2018 ('+IBg'): LEFT SINGLE QUOTATION MARK
+AD4 - U+-2019 ('+IBk'): RIGHT SINGLE QUOTATION MARK
+AD4 - U+-201c ('+IBw'): LEFT DOUBLE QUOTATION MARK
+AD4 - U+-201d ('+IB0'): RIGHT DOUBLE QUOTATION MARK
+AD4
+AD4 IMO, those should be replaced by ASCII commas: ' and +ACI.
+AD4
+AD4 The rationale is simple:
+AD4
+AD4 - most were introduced during the conversion from Docbook,
+AD4 markdown and LaTex+ADs
+AD4 - they don't add any extra value, as using +ACI-foo+ACI of +IBw-foo+IB0 means
+AD4 the same thing+ADs
+AD4 - Sphinx already use +ACI-fancy+ACI commas at the output.
+AD4
+AD4 I guess I will put this on a separate series, as this is not a bug
+AD4 fix, but just a cleanup from the conversion work.
+AD4
+AD4 I'll re-post those cleanups on a separate series, for patch per patch
+AD4 review.

Makes sense.

The left/right quotation marks exists to make human-readable text much
easier to read, but the key point here is that they are redundant
because the tooling already emits them in the +ACo-output+ACo so they don't
need to be in the source, yes?

As long as the tooling gets it +ACo-right+ACo and uses them where it should,
that seems sane enough.

However, it +ACo-does+ACo break 'grep', because if I cut/paste a snippet from
the documentation and try to grep for it, it'll no longer match.

Consistency is good, but perhaps we should actually be consistent the
other way round and always use the left/right versions in the source
+ACo-instead+ACo of relying on the tooling, to make searches work better?
You claimed to care about that, right?

+AD4 The remaining cases are future work, outside the scope of this v2:
+AD4
+AD4 6. Hyphen/Dashes and ellipsis
+AD4
+AD4 - U+-2212 ('+IhI'): MINUS SIGN
+AD4 - U+-00ad ('+AK0'): SOFT HYPHEN
+AD4 - U+-2010 ('+IBA'): HYPHEN
+AD4
+AD4 Those three are used on places where a normal ASCII hyphen/minus
+AD4 should be used instead. There are even a couple of C files which
+AD4 use them instead of '-' on comments.
+AD4
+AD4 IMO are fixes/cleanups from conversions and bad cut-and-paste.

That seems to make sense.

+AD4 - U+-2013 ('+IBM'): EN DASH
+AD4 - U+-2014 ('+IBQ'): EM DASH
+AD4 - U+-2026 ('+ICY'): HORIZONTAL ELLIPSIS
+AD4
+AD4 Those are auto-replaced by Sphinx from +ACI---+ACI, +ACI----+ACI and +ACI...+ACI,
+AD4 respectively.
+AD4
+AD4 I guess those are a matter of personal preference about
+AD4 weather using ASCII or UTF-8.
+AD4
+AD4 My personal preference (and Ted seems to have a similar
+AD4 opinion) is to let Sphinx do the conversion.
+AD4
+AD4 For those, I intend to post a separate series, to be
+AD4 reviewed patch per patch, as this is really a matter
+AD4 of personal taste. Hardly we'll reach a consensus here.
+AD4

Again using the trigraph-like '--' and '...' instead of just using the
plain text '+IBQ' and '+ICY' breaks searching, because what's in the output
doesn't match the input. Again consistency is good, but perhaps we
should standardise on just putting these in their plain text form
instead of the trigraphs?

+AD4 7. math symbols:
+AD4
+AD4 - U+-00d7 ('+ANc'): MULTIPLICATION SIGN
+AD4
+AD4 This one is used mostly do describe video resolutions, but this is
+AD4 on a smaller changeset than the ones that use +ACI-x+ACI letter.

I think standardising on +ANc for video resolutions in documentation would
make it look better and be easier to read.

+AD4
+AD4 - U+-2217 ('+Ihc'): ASTERISK OPERATOR
+AD4
+AD4 This is used only here:
+AD4 Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2+AF4-21 +Ihc 2+AF4-27 +AD0 2+AF4-48bytes or 256TiB.
+AD4
+AD4 Probably added by some conversion tool. IMO, this one should
+AD4 also be replaced by an ASCII asterisk.
+AD4
+AD4 I guess I'll post a patch for the ASTERISK OPERATOR.

That makes sense.

Attachments:

smime.p7s (5.05 kB)

2021-05-15 12:46:45

by Mauro Carvalho Chehab

[permalink] [raw]

Subject: Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

Em Sat, 15 May 2021 10:24:28 +0100
David Woodhouse <[email protected]> escreveu:

> On Sat, 2021-05-15 at 10:22 +0200, Mauro Carvalho Chehab wrote:
> > > > Here, <CTRL><SHIFT>U is not working. No idea why. I haven't
> > > > test it for *years*, as I din't see any reason why I would
> > > > need to type UTF-8 characters by numbers until we started
> > > > this thread.
> > >
> > > Please provide the bug number for this; I'd like to track it.
> >
> > Just opened a BZ and added you as c/c.
>
> Thanks.
>
> > Let's take one step back, in order to return to the intents of this
> > UTF-8, as the discussions here are not centered into the patches, but
> > instead, on what to do and why.
> >
> > -
> >
> > This discussion started originally at linux-doc ML.
> >
> > While discussing about an issue when machine's locale was not set
> > to UTF-8 on a build VM,
>
> Stop. Stop *right* there before you go any further.
>
> The machine's locale should have *nothing* to do with anything.
>
> When you view this email, it comes with a Content-Type: header which
> explicitly tells you the character set that the message is encoded in,
> which I think I've set to UTF-7.
>
> When showing you the mail, your system has to interpret the bytes of
> the content using *that* character set encoding. Anything else is just
> fundamentally broken. Your system locale has *nothing* to do with it.
>
> If your local system is running EBCDIC that doesn't *matter*.
>
> Now, the character set encoding of the kernel source and documentation
> text files is UTF-8. It isn't EBCDIC, it isn't ISO8859-15 or any of the
> legacy crap. It isn't system locale either, unless your system locale
> *happens* to be UTF-8.
>
> UTF-8 *happens* to be compatible with ASCII for the limited subset of
> characters which ASCII contains, sure — just as *many*, but not all, of
> the legacy 8-bit character sets are also a superset of ASCII's 7 bits.
>
> But if the docs contain *any* characters which aren't ASCII, and you
> build them with a broken build system which assumes ASCII, you are
> going to produce wrong output. There is *no* substitute for fixing the
> *actual* bug which started all this, and ensuring your build system (or
> whatever) uses the *actual* encoding of the text files it's processing,
> instead of making stupid and bogus assumptions based on a system
> default.
>
> You concede keeping U+00a9 © COPYRIGHT SIGN. And that's encoded in UTF-
> 8 as two bytes 0xC2 0xA9. If some broken build system *assumes* those
> bytes are ISO8859-15 it'll take them to mean two separate characters
>
> U+00C2 Â LATIN CAPITAL LETTER A WITH CIRCUMFLEX
> U+00A9 © COPYRIGHT SIGN
>
> Your broken build system that started all this is never going to be
> *anything* other than broken. You can only paper over the cracks and
> make it slightly less likely that people will notice in the common
> case, perhaps? That's all you do by *reducing* the use of non-ASCII,
> unless you're going to drag us all the way back to the 1980s and
> strictly limit us to pure ASCII, using the equivalent of trigraphs for
> *anything* outside the 0-127 character ranges.
>
> And even if you did that, systems which use EBCDIC as their local
> encoding would *still* be broken, if they have the same bug you started
> from. Because EBCDIC isn't compatible with ASCII *even* for the first 7
> bits.

Now, you're making a lot of wrong assumptions here ;-)

1. I didn't report the bug. Another person reported it at linux-doc;
2. I fully agree with you that the building system should work fine
whatever locate the machine has;
3. Sphinx supported charset for the REST input and its output is UTF-8.

Despite of that, it seems that there are some issues at the building
tool set, at least under certain circunstances. One of the hypothesis
that it was mentioned there is that the Sphinx logger crashes when it
tries to print an UTF-8 message when the machine's locale is not UTF-8.

That's said, I tried forcing a non-UTF-8 on some tests I did to try
to reproduce, but the build went fine.

So, I was not able to reproduce the issue.

This series doesn't address the issue. It is just a side effect of the
discussions, where, while trying to understand the bug, we noticed
several UTF-8 characters introduced during the conversion that were't
the original author's intent.

So, with regards to the original but report, if I find a way to
reproduce it and to address it, I'll post a separate series.

If you want to discuss this issue further, let's not discuss here, but
instead, at the linux-doc thread:

https://lore.kernel.org/linux-doc/[email protected]/

>
>
> > we discovered that some converted docs ended
> > with BOM characters. Those specific changes were introduced by some
> > of my convert patches, probably converted via pandoc.
> >
> > So, I went ahead in order to check what other possible weird things
> > were introduced by the conversion, where several scripts and tools
> > were used on files that had already a different markup.
> >
> > I actually checked the current UTF-8 issues, and asked people at
> > linux-doc to comment what of those are valid usecases, and what
> > should be replaced by plain ASCII.
>
> No, these aren't "UTF-8 issues". Those are *conversion* issues, and
> would still be there if the output of the conversion had been UTF-7,
> UCS-16, etc. Or *even* if the output of the conversion had been
> trigraph-like stuff like '--' for emdash. It's *nothing* to do with the
> encoding that we happen to be using.

Yes. That's what I said.

>
> Fixing the conversion issues makes a lot of sense. Try to do it without
> making *any* mention of UTF-8 at all.
>
> > In summary, based on the discussions we have so far, I suspect that
> > there's not much to be discussed for the above cases.
> >
> > So, I'll post a v3 of this series, changing only:
> >
> > - U+00a0 (' '): NO-BREAK SPACE
> > - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)
>
> Ack, as long as those make *no* mention of UTF-8. Except perhaps to
> note that BOM is redundant because UTF-8 doesn't have a byteorder.

I need to tell what UTF-8 codes are replaced, as otherwise the patch
wouldn't make much sense to reviewers, as both U+00a0 and whitespaces
are displayed the same way, and BOM is invisible.

>
> > ---
> >
> > Now, this specific patch series address also this extra case:
> >
> > 5. curly commas:
> >
> > - U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
> > - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
> > - U+201c ('“'): LEFT DOUBLE QUOTATION MARK
> > - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
> >
> > IMO, those should be replaced by ASCII commas: ' and ".
> >
> > The rationale is simple:
> >
> > - most were introduced during the conversion from Docbook,
> > markdown and LaTex;
> > - they don't add any extra value, as using "foo" of “foo” means
> > the same thing;
> > - Sphinx already use "fancy" commas at the output.
> >
> > I guess I will put this on a separate series, as this is not a bug
> > fix, but just a cleanup from the conversion work.
> >
> > I'll re-post those cleanups on a separate series, for patch per patch
> > review.
>
> Makes sense.
>
> The left/right quotation marks exists to make human-readable text much
> easier to read, but the key point here is that they are redundant
> because the tooling already emits them in the *output* so they don't
> need to be in the source, yes?

Yes.

> As long as the tooling gets it *right* and uses them where it should,
> that seems sane enough.
>
> However, it *does* break 'grep', because if I cut/paste a snippet from
> the documentation and try to grep for it, it'll no longer match.

>
> Consistency is good, but perhaps we should actually be consistent the
> other way round and always use the left/right versions in the source
> *instead* of relying on the tooling, to make searches work better?
> You claimed to care about that, right?

That's indeed a good point. It would be interesting to have more
opinions with that matter.

There are a couple of things to consider:

1. It is (usually) trivial to discover what document produced a
certain page at the documentation.

For instance, if you want to know where the text under this
file came from, or to grep a text from it:

https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

You can click at the "View page source" button at the first line.
It will show the .rst file used to produce it:

https://www.kernel.org/doc/html/latest/_sources/admin-guide/cgroup-v2.rst.txt

2. If all you want is to search for a text inside the docs,
you can click at the "Search docs" box, which is part of the
Read the Docs theme.

3. Kernel has several extensions for Sphinx, in order to make life
easier for Kernel developers:

Documentation/sphinx/automarkup.py
Documentation/sphinx/cdomain.py
Documentation/sphinx/kernel_abi.py
Documentation/sphinx/kernel_feat.py
Documentation/sphinx/kernel_include.py
Documentation/sphinx/kerneldoc.py
Documentation/sphinx/kernellog.py
Documentation/sphinx/kfigure.py
Documentation/sphinx/load_config.py
Documentation/sphinx/maintainers_include.py
Documentation/sphinx/rstFlatTable.py

Those (in particular automarkup and kerneldoc) will also dynamically
change things during ReST conversion, which may cause grep to not work.

5. some PDF tools like evince will match curly commas if you
type an ASCII comma on their search boxes.

6. Some developers prefer to only deal with the files inside the
Kernel tree. Those are very unlikely to do grep with curly aspas.

My opinion on that matter is that we should make life easier for
developers to grep on text files, as the ones using the web interface
are already served by the search box in html format or by tools like
evince.

So, my vote here is to keep aspas as plain ASCII.

>
> > The remaining cases are future work, outside the scope of this v2:
> >
> > 6. Hyphen/Dashes and ellipsis
> >
> > - U+2212 ('−'): MINUS SIGN
> > - U+00ad (''): SOFT HYPHEN
> > - U+2010 ('‐'): HYPHEN
> >
> > Those three are used on places where a normal ASCII hyphen/minus
> > should be used instead. There are even a couple of C files which
> > use them instead of '-' on comments.
> >
> > IMO are fixes/cleanups from conversions and bad cut-and-paste.
>
> That seems to make sense.
>
> > - U+2013 ('–'): EN DASH
> > - U+2014 ('—'): EM DASH
> > - U+2026 ('…'): HORIZONTAL ELLIPSIS
> >
> > Those are auto-replaced by Sphinx from "--", "---" and "...",
> > respectively.
> >
> > I guess those are a matter of personal preference about
> > weather using ASCII or UTF-8.
> >
> > My personal preference (and Ted seems to have a similar
> > opinion) is to let Sphinx do the conversion.
> >
> > For those, I intend to post a separate series, to be
> > reviewed patch per patch, as this is really a matter
> > of personal taste. Hardly we'll reach a consensus here.
> >
>
> Again using the trigraph-like '--' and '...' instead of just using the
> plain text '—' and '…' breaks searching, because what's in the output
> doesn't match the input. Again consistency is good, but perhaps we
> should standardise on just putting these in their plain text form
> instead of the trigraphs?

Good point.

While I don't have any strong preferences here, there's something that
annoys me with regards to EM/EN DASH:

With the monospaced fonts I'm using here - both at my e-mailer and
on my terminals, both EM and EN DASH are displayed look *exactly*
the same.

>
> > 7. math symbols:
> >
> > - U+00d7 ('×'): MULTIPLICATION SIGN
> >
> > This one is used mostly do describe video resolutions, but this is
> > on a smaller changeset than the ones that use "x" letter.
>
> I think standardising on × for video resolutions in documentation would
> make it look better and be easier to read.
>
> >
> > - U+2217 ('∗'): ASTERISK OPERATOR
> >
> > This is used only here:
> > Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.
> >
> > Probably added by some conversion tool. IMO, this one should
> > also be replaced by an ASCII asterisk.
> >
> > I guess I'll post a patch for the ASTERISK OPERATOR.
>
> That makes sense.

Thanks,
Mauro

2021-05-15 12:56:50

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

On Sat, 2021-05-15 at 13:23 +0200, Mauro Carvalho Chehab wrote:
> Em Sat, 15 May 2021 10:24:28 +0100
> David Woodhouse <[email protected]> escreveu:
> > > Let's take one step back, in order to return to the intents of this
> > > UTF-8, as the discussions here are not centered into the patches, but
> > > instead, on what to do and why.
> > >
> > > This discussion started originally at linux-doc ML.
> > >
> > > While discussing about an issue when machine's locale was not set
> > > to UTF-8 on a build VM,
> >
> > Stop. Stop *right* there before you go any further.
> >
> > The machine's locale should have *nothing* to do with anything.
>
> Now, you're making a lot of wrong assumptions here ;-)
>
> 1. I didn't report the bug. Another person reported it at linux-doc;
> 2. I fully agree with you that the building system should work fine
> whatever locate the machine has;
> 3. Sphinx supported charset for the REST input and its output is UTF-8.

OK, fine. So that's an unrelated issue really, and just happened to be
what historically triggered the discussion. Let's set it aside.

> > > I actually checked the current UTF-8 issues …
> >
> > No, these aren't "UTF-8 issues". Those are *conversion* issues, and
> > … *nothing* to do with the encoding that we happen to be using.
>
> Yes. That's what I said.

Er… I'm fairly sure you *did* call them "UTF-8 issues". Whatever.

> >
> > Fixing the conversion issues makes a lot of sense. Try to do it without
> > making *any* mention of UTF-8 at all.
> >
> > > In summary, based on the discussions we have so far, I suspect that
> > > there's not much to be discussed for the above cases.
> > >
> > > So, I'll post a v3 of this series, changing only:
> > >
> > > - U+00a0 (' '): NO-BREAK SPACE
> > > - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)
> >
> > Ack, as long as those make *no* mention of UTF-8. Except perhaps to
> > note that BOM is redundant because UTF-8 doesn't have a byteorder.
>
> I need to tell what UTF-8 codes are replaced, as otherwise the patch
> wouldn't make much sense to reviewers, as both U+00a0 and whitespaces
> are displayed the same way, and BOM is invisible.
>

No. Again, this is *nothing* to do with UTF-8. The encoding we choose
to map between byte in the file and characters is *utterly* irrelevant
here. If we were using UTF-7, UTF-16, or even (in the case of non-
breaking space) one of the legacy 8-bit charsets that includes it like
ISO8859-1, the issue would be precisely the same.

It's about the *character* U+00A0 NO-BREAK SPACE; nothing to do with
UTF-8 at all. Don't mention UTF-8. It's *irrelevant* and just shows
that you can't actually bothered to stop and do any critical thinking
about the matter at all.

As I said, the only time that it makes sense to mention UTF-8 in this
context is when talking about *why* the BOM is not needed. And even
then, you could say "because we *aren't* using an encoding where
endianness matters, such as UTF-16", instead of actually mentioning
UTF-8. Try it ☺

> >
> > > ---
> > >
> > > Now, this specific patch series address also this extra case:
> > >
> > > 5. curly commas:
> > >
> > > - U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
> > > - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
> > > - U+201c ('“'): LEFT DOUBLE QUOTATION MARK
> > > - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
> > >
> > > IMO, those should be replaced by ASCII commas: ' and ".
> > >
> > > The rationale is simple:
> > >
> > > - most were introduced during the conversion from Docbook,
> > > markdown and LaTex;
> > > - they don't add any extra value, as using "foo" of “foo” means
> > > the same thing;
> > > - Sphinx already use "fancy" commas at the output.
> > >
> > > I guess I will put this on a separate series, as this is not a bug
> > > fix, but just a cleanup from the conversion work.
> > >
> > > I'll re-post those cleanups on a separate series, for patch per patch
> > > review.
> >
> > Makes sense.
> >
> > The left/right quotation marks exists to make human-readable text much
> > easier to read, but the key point here is that they are redundant
> > because the tooling already emits them in the *output* so they don't
> > need to be in the source, yes?
>
> Yes.
>
> > As long as the tooling gets it *right* and uses them where it should,
> > that seems sane enough.
> >
> > However, it *does* break 'grep', because if I cut/paste a snippet from
> > the documentation and try to grep for it, it'll no longer match.
> >
> > Consistency is good, but perhaps we should actually be consistent the
> > other way round and always use the left/right versions in the source
> > *instead* of relying on the tooling, to make searches work better?
> > You claimed to care about that, right?
>
> That's indeed a good point. It would be interesting to have more
> opinions with that matter.
>
> There are a couple of things to consider:
>
> 1. It is (usually) trivial to discover what document produced a
> certain page at the documentation.
>
> For instance, if you want to know where the text under this
> file came from, or to grep a text from it:
>
> https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
>
> You can click at the "View page source" button at the first line.
> It will show the .rst file used to produce it:
>
> https://www.kernel.org/doc/html/latest/_sources/admin-guide/cgroup-v2.rst.txt
>
> 2. If all you want is to search for a text inside the docs,
> you can click at the "Search docs" box, which is part of the
> Read the Docs theme.
>
> 3. Kernel has several extensions for Sphinx, in order to make life
> easier for Kernel developers:
>
> Documentation/sphinx/automarkup.py
> Documentation/sphinx/cdomain.py
> Documentation/sphinx/kernel_abi.py
> Documentation/sphinx/kernel_feat.py
> Documentation/sphinx/kernel_include.py
> Documentation/sphinx/kerneldoc.py
> Documentation/sphinx/kernellog.py
> Documentation/sphinx/kfigure.py
> Documentation/sphinx/load_config.py
> Documentation/sphinx/maintainers_include.py
> Documentation/sphinx/rstFlatTable.py
>
> Those (in particular automarkup and kerneldoc) will also dynamically
> change things during ReST conversion, which may cause grep to not work.
>
> 5. some PDF tools like evince will match curly commas if you
> type an ASCII comma on their search boxes.
>
> 6. Some developers prefer to only deal with the files inside the
> Kernel tree. Those are very unlikely to do grep with curly aspas.
>
> My opinion on that matter is that we should make life easier for
> developers to grep on text files, as the ones using the web interface
> are already served by the search box in html format or by tools like
> evince.
>
> So, my vote here is to keep aspas as plain ASCII.

OK, but all your reasoning is about the *character* used, not the
encoding. So try to do it without mentioning ASCII, and especially
without mentioning UTF-8.

Your point is that the *character* is the one easily reachable on
standard keyboard layouts, and the one which people are most likely to
enter manually. It has *nothing* to do with charset encodings, so don't
conflate is with talking about charset encodings.

>
> >
> > > The remaining cases are future work, outside the scope of this v2:
> > >
> > > 6. Hyphen/Dashes and ellipsis
> > >
> > > - U+2212 ('−'): MINUS SIGN
> > > - U+00ad (''): SOFT HYPHEN
> > > - U+2010 ('‐'): HYPHEN
> > >
> > > Those three are used on places where a normal ASCII hyphen/minus
> > > should be used instead. There are even a couple of C files which
> > > use them instead of '-' on comments.
> > >
> > > IMO are fixes/cleanups from conversions and bad cut-and-paste.
> >
> > That seems to make sense.
> >
> > > - U+2013 ('–'): EN DASH
> > > - U+2014 ('—'): EM DASH
> > > - U+2026 ('…'): HORIZONTAL ELLIPSIS
> > >
> > > Those are auto-replaced by Sphinx from "--", "---" and "...",
> > > respectively.
> > >
> > > I guess those are a matter of personal preference about
> > > weather using ASCII or UTF-8.
> > >
> > > My personal preference (and Ted seems to have a similar
> > > opinion) is to let Sphinx do the conversion.
> > >
> > > For those, I intend to post a separate series, to be
> > > reviewed patch per patch, as this is really a matter
> > > of personal taste. Hardly we'll reach a consensus here.
> > >
> >
> > Again using the trigraph-like '--' and '...' instead of just using the
> > plain text '—' and '…' breaks searching, because what's in the output
> > doesn't match the input. Again consistency is good, but perhaps we
> > should standardise on just putting these in their plain text form
> > instead of the trigraphs?
>
> Good point.
>
> While I don't have any strong preferences here, there's something that
> annoys me with regards to EM/EN DASH:
>
> With the monospaced fonts I'm using here - both at my e-mailer and
> on my terminals, both EM and EN DASH are displayed look *exactly*
> the same.

Interesting. They definitely show differently in my terminal, and in
the monospaced font in email.

Attachments:

smime.p7s (5.05 kB)