2006-02-17 23:34:56

by Adam Tlałka

[permalink] [raw]
Subject: [PATCH]console:UTF-8 mode compatibility fixes


This patch applies to 2.6.15.3 kernel sources to drivers/char/vt.c file.
It should work with other versions too.

Changed console behaviour so in UTF-8 mode vt100 alternate character
sequences work as described in terminfo/termcap linux terminal definition.
Programs can use vt100 control seqences - smacs, rmacs and acsc characters
in UTF-8 mode in the same way as in normal mode so one definition is always
valid - current behaviour make these seqences not working in UTF-8 mode.

Added reporting malformed UTF-8 seqences as replacement glyphs.
I think that terminal should always display something rather then ignoring
these kind of data as it does now. Also it sticks to Unicode standards
saying that every wrong byte should be reported. It is more human readable
too in case of Latin subsets including ASCII chars.

Signed-off-by: Adam Tla/lka <[email protected]>

---

--- drivers/char/vt_orig.c 2006-02-13 11:33:54.000000000 +0100
+++ drivers/char/vt.c 2006-02-17 23:05:50.000000000 +0100
@@ -63,6 +63,12 @@
*
* Removed console_lock, enabled interrupts across all console operations
* 13 March 2001, Andrew Morton
+ *
+ * Fixed UTF-8 mode so alternate charset modes always work without need
+ * of different linux terminal definition for normal and UTF-8 modes
+ * preserving backward US-ASCII and VT100 semigraphics compatibility,
+ * malformed UTF sequences represented as sequences of replacement glyphs
+ * by Adam Tla/lka <[email protected]>, Feb 2006
*/

#include <linux/module.h>
@@ -1991,17 +1997,23 @@
/* Do no translation at all in control states */
if (vc->vc_state != ESnormal) {
tc = c;
- } else if (vc->vc_utf) {
+ } else if (vc->vc_utf && !vc->vc_disp_ctrl) {
/* Combine UTF-8 into Unicode */
- /* Incomplete characters silently ignored */
+ /* Malformed sequence represented as replacement glyphs */
+rescan_last_byte:
if(c > 0x7f) {
- if (vc->vc_utf_count > 0 && (c & 0xc0) == 0x80) {
- vc->vc_utf_char = (vc->vc_utf_char << 6) | (c & 0x3f);
- vc->vc_utf_count--;
- if (vc->vc_utf_count == 0)
- tc = c = vc->vc_utf_char;
- else continue;
+ if (vc->vc_utf_count) {
+ if ((c & 0xc0) == 0x80) {
+ vc->vc_utf_char = (vc->vc_utf_char << 6) | (c & 0x3f);
+ if (--vc->vc_utf_count) {
+ vc->vc_npar++;
+ continue;
+ }
+ tc = c = vc->vc_utf_char;
+ } else
+ goto insert_replacement_glyph;
} else {
+ vc->vc_npar = 0;
if ((c & 0xe0) == 0xc0) {
vc->vc_utf_count = 1;
vc->vc_utf_char = (c & 0x1f);
@@ -2018,12 +2030,13 @@
vc->vc_utf_count = 5;
vc->vc_utf_char = (c & 0x01);
} else
- vc->vc_utf_count = 0;
+ goto insert_replacement_glyph;
continue;
}
} else {
+ if (vc->vc_utf_count)
+ goto insert_replacement_glyph;
tc = c;
- vc->vc_utf_count = 0;
}
} else { /* no utf */
tc = vc->vc_translate[vc->vc_toggle_meta ? (c | 0x80) : c];
@@ -2040,8 +2053,8 @@
* direct-to-font zone in UTF-8 mode.
*/
ok = tc && (c >= 32 ||
- (!vc->vc_utf && !(((vc->vc_disp_ctrl ? CTRL_ALWAYS
- : CTRL_ACTION) >> c) & 1)))
+ !(vc->vc_disp_ctrl ? (CTRL_ALWAYS >> c) & 1 :
+ vc->vc_utf || ((CTRL_ACTION >> c) & 1)))
&& (c != 127 || vc->vc_disp_ctrl)
&& (c != 128+27);

@@ -2051,6 +2064,7 @@
if ( tc == -4 ) {
/* If we got -4 (not found) then see if we have
defined a replacement character (U+FFFD) */
+insert_replacement_glyph:
tc = conv_uni_to_pc(vc, 0xfffd);

/* One reason for the -4 can be that we just
@@ -2062,9 +2076,19 @@
/* Bad hash table -- hope for the best */
tc = c;
}
- if (tc & ~charmask)
+ if (tc & ~charmask) {
+ /* no replacement glyph */
+ if (vc->vc_utf_count) {
+ vc->vc_utf_count = 0;
+ if (vc->vc_npar) {
+ c = orig;
+ goto rescan_last_byte;
+ }
+ }
continue; /* Conversion failed */
+ }

+repeat_replacement_glyph:
if (vc->vc_need_wrap || vc->vc_decim)
FLUSH
if (vc->vc_need_wrap) {
@@ -2088,6 +2112,15 @@
vc->vc_x++;
draw_to = (vc->vc_pos += 2);
}
+ if (vc->vc_utf_count) {
+ if (vc->vc_npar) {
+ vc->vc_npar--;
+ goto repeat_replacement_glyph;
+ }
+ vc->vc_utf_count = 0;
+ c = orig;
+ goto rescan_last_byte;
+ }
continue;
}
FLUSH


2006-02-18 11:01:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

Adam Tla/lka <[email protected]> wrote:
>
>
> This patch applies to 2.6.15.3 kernel sources to drivers/char/vt.c file.
> It should work with other versions too.
>
> Changed console behaviour so in UTF-8 mode vt100 alternate character
> sequences work as described in terminfo/termcap linux terminal definition.
> Programs can use vt100 control seqences - smacs, rmacs and acsc characters
> in UTF-8 mode in the same way as in normal mode so one definition is always
> valid - current behaviour make these seqences not working in UTF-8 mode.
>
> Added reporting malformed UTF-8 seqences as replacement glyphs.
> I think that terminal should always display something rather then ignoring
> these kind of data as it does now. Also it sticks to Unicode standards
> saying that every wrong byte should be reported. It is more human readable
> too in case of Latin subsets including ASCII chars.
>
> ...
>
> - } else if (vc->vc_utf) {
> + } else if (vc->vc_utf && !vc->vc_disp_ctrl) {
> /* Combine UTF-8 into Unicode */
> - /* Incomplete characters silently ignored */
> + /* Malformed sequence represented as replacement glyphs */
> +rescan_last_byte:
> if(c > 0x7f) {
>
> ...
>
> + if (vc->vc_npar) {
> + c = orig;
> + goto rescan_last_byte;
> + }
>
> ...
>
> + }
> + vc->vc_utf_count = 0;
> + c = orig;
> + goto rescan_last_byte;
> + }
> continue;
> }

I spent some time trying to work out why this cannot cause an infinite loop
and gave up. Can you explain?

2006-02-18 14:17:38

by Alexander E. Patrakov

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

[sorry for repost, the first attempt got blocked due to html attachment,
now I gzipped it to circumvent the filter]

Adam Tla/lka wrote:
> This patch applies to 2.6.15.3 kernel sources to drivers/char/vt.c file.
> It should work with other versions too.
>
> Changed console behaviour so in UTF-8 mode vt100 alternate character
> sequences work as described in terminfo/termcap linux terminal definition.
> Programs can use vt100 control seqences - smacs, rmacs and acsc characters
> in UTF-8 mode in the same way as in normal mode so one definition is always
> valid - current behaviour make these seqences not working in UTF-8 mode.

Doesn't work here with linux-2.6.16-rc3-mm1, ncurses-5.5. BTW has this
been discussed with Thomas Dickey (ncurses maintainer)?

> Added reporting malformed UTF-8 seqences as replacement glyphs.

Works.

> I think that terminal should always display something rather then ignoring
> these kind of data as it does now. Also it sticks to Unicode standards
> saying that every wrong byte should be reported. It is more human readable
> too in case of Latin subsets including ASCII chars.

Another feature request / bug report (spotted while viewing in Lynx a
page containing English text and a few Chinese characters, artificial
testcase attached):

If ncurses attempt to add some Chinese character to the Linux text
screen, Linux (correctly) prints this replacement character and advances
the cursor by one position. Ncurses think that the cursor has moved two
positions forward. The effect is that when you view the testcase in Lynx
(compiled --with-screen=ncursesw) on Linux console and press PageDown,
the fourth line contains "Thek" instead of "The" in the end.

This disagreement has to be solved somehow.

UTF-8 input issues (fixed by a patch originally from
http://chris.heathens.co.nz/linux/utf8.html) are also outstanding.

> Signed-off-by: Adam Tla/lka <[email protected]>

Not adding my own signed-off-by line here because the acs part doesn't work.

> --- drivers/char/vt_orig.c 2006-02-13 11:33:54.000000000 +0100
> +++ drivers/char/vt.c 2006-02-17 23:05:50.000000000 +0100
> @@ -63,6 +63,12 @@
> *
> * Removed console_lock, enabled interrupts across all console operations
> * 13 March 2001, Andrew Morton
> + *
> + * Fixed UTF-8 mode so alternate charset modes always work without need
> + * of different linux terminal definition for normal and UTF-8 modes
> + * preserving backward US-ASCII and VT100 semigraphics compatibility,
> + * malformed UTF sequences represented as sequences of replacement glyphs
> + * by Adam Tla/lka <[email protected]>, Feb 2006
> */
>
> #include <linux/module.h>
> @@ -1991,17 +1997,23 @@
> /* Do no translation at all in control states */
> if (vc->vc_state != ESnormal) {
> tc = c;
> - } else if (vc->vc_utf) {
> + } else if (vc->vc_utf && !vc->vc_disp_ctrl) {
> /* Combine UTF-8 into Unicode */
> - /* Incomplete characters silently ignored */
> + /* Malformed sequence represented as replacement glyphs */
> +rescan_last_byte:
> if(c > 0x7f) {
> - if (vc->vc_utf_count > 0 && (c & 0xc0) == 0x80) {
> - vc->vc_utf_char = (vc->vc_utf_char << 6) | (c & 0x3f);
> - vc->vc_utf_count--;
> - if (vc->vc_utf_count == 0)
> - tc = c = vc->vc_utf_char;
> - else continue;
> + if (vc->vc_utf_count) {
> + if ((c & 0xc0) == 0x80) {
> + vc->vc_utf_char = (vc->vc_utf_char << 6) | (c & 0x3f);
> + if (--vc->vc_utf_count) {
> + vc->vc_npar++;
> + continue;
> + }
> + tc = c = vc->vc_utf_char;
> + } else
> + goto insert_replacement_glyph;
> } else {
> + vc->vc_npar = 0;
> if ((c & 0xe0) == 0xc0) {
> vc->vc_utf_count = 1;
> vc->vc_utf_char = (c & 0x1f);
> @@ -2018,12 +2030,13 @@
> vc->vc_utf_count = 5;
> vc->vc_utf_char = (c & 0x01);
> } else
> - vc->vc_utf_count = 0;
> + goto insert_replacement_glyph;
> continue;
> }
> } else {
> + if (vc->vc_utf_count)
> + goto insert_replacement_glyph;
> tc = c;
> - vc->vc_utf_count = 0;
> }
> } else { /* no utf */
> tc = vc->vc_translate[vc->vc_toggle_meta ? (c | 0x80) : c];
> @@ -2040,8 +2053,8 @@
> * direct-to-font zone in UTF-8 mode.
> */
> ok = tc && (c >= 32 ||
> - (!vc->vc_utf && !(((vc->vc_disp_ctrl ? CTRL_ALWAYS
> - : CTRL_ACTION) >> c) & 1)))
> + !(vc->vc_disp_ctrl ? (CTRL_ALWAYS >> c) & 1 :
> + vc->vc_utf || ((CTRL_ACTION >> c) & 1)))
> && (c != 127 || vc->vc_disp_ctrl)
> && (c != 128+27);
>
> @@ -2051,6 +2064,7 @@
> if ( tc == -4 ) {
> /* If we got -4 (not found) then see if we have
> defined a replacement character (U+FFFD) */
> +insert_replacement_glyph:
> tc = conv_uni_to_pc(vc, 0xfffd);
>
> /* One reason for the -4 can be that we just
> @@ -2062,9 +2076,19 @@
> /* Bad hash table -- hope for the best */
> tc = c;
> }
> - if (tc & ~charmask)
> + if (tc & ~charmask) {
> + /* no replacement glyph */
> + if (vc->vc_utf_count) {
> + vc->vc_utf_count = 0;
> + if (vc->vc_npar) {
> + c = orig;
> + goto rescan_last_byte;
> + }
> + }
> continue; /* Conversion failed */
> + }
>
> +repeat_replacement_glyph:
> if (vc->vc_need_wrap || vc->vc_decim)
> FLUSH
> if (vc->vc_need_wrap) {
> @@ -2088,6 +2112,15 @@
> vc->vc_x++;
> draw_to = (vc->vc_pos += 2);
> }
> + if (vc->vc_utf_count) {
> + if (vc->vc_npar) {
> + vc->vc_npar--;
> + goto repeat_replacement_glyph;
> + }
> + vc->vc_utf_count = 0;
> + c = orig;
> + goto rescan_last_byte;
> + }
> continue;
> }
> FLUSH

--
Alexander E. Patrakov


Attachments:
test.html.gz (238.00 B)

2006-02-18 14:38:12

by Adam Tlałka

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

Użytkownik Alexander E. Patrakov napisał:
> Adam Tla/lka wrote:
>
>> This patch applies to 2.6.15.3 kernel sources to drivers/char/vt.c file.
>> It should work with other versions too.
>>
>> Changed console behaviour so in UTF-8 mode vt100 alternate character
>> sequences work as described in terminfo/termcap linux terminal
>> definition.
>> Programs can use vt100 control seqences - smacs, rmacs and acsc
>> characters
>> in UTF-8 mode in the same way as in normal mode so one definition is
>> always
>> valid - current behaviour make these seqences not working in UTF-8 mode.
>
>
> Doesn't work here with linux-2.6.16-rc3-mm1, ncurses-5.5. BTW has this
> been discussed with Thomas Dickey (ncurses maintainer)?
>

Hmm, I don't know how current ncurses treat terminfo definition in UTF-8
mode and what linux terminal definition you are using in you test
system. There are different definitions floating around in different
distributions which are slightly different especially as we talk about
acsc chars and enacs, smacs and rmacs sequences. Which is also wrong
approach so there should be one proper definition of the linux console.
Maybe kernel developers should prepare some most compatible and
acceptable one.
I can post the one I am using today:

# Reconstructed via infocmp from file: /etc/terminfo/l/linux
linux|linux console,
am, bce, ccc, eo, mir, msgr, xenl, xon,
colors#8, it#8, ncv#18, pairs#64,

acsc=++\,\,--..00__``aaffgbhhjjkkllmmnnooqqssttuuvvwwxxyyzz{{||}c~~,
bel=^G, blink=\E[5m, bold=\E[1m, civis=\E[?25l\E[?1c,
clear=\E[H\E[J, cnorm=\E[?25h\E[?0c, cr=^M,
csr=\E[%i%p1%d;%p2%dr, cub1=^H, cud1=^J, cuf1=\E[C,
cup=\E[%i%p1%d;%p2%dH, cuu1=\E[A, cvvis=\E[?25h\E[?8c,
dch=\E[%p1%dP, dch1=\E[P, dim=\E[2m, dl=\E[%p1%dM,
dl1=\E[M, ech=\E[%p1%dX, ed=\E[J, el=\E[K, el1=\E[1K,
enacs=\E)0, flash=\E[?5h\E[?5l$<200/>, home=\E[H,
hpa=\E[%i%p1%dG, ht=^I, hts=\EH, il=\E[%p1%dL, il1=\E[L,
ind=^J,

initc=\E]P%p1%x%p2%{256}%*%{1000}%/%02x%p3%{256}%*%{1000}%/%02x%p4%{256}%*%{1000}%/%02x,
invis=\E[8m, kb2=\E[G, kbs=\177, kcbt=\E[Z, kcub1=\E[D,
kcud1=\E[B, kcuf1=\E[C, kcuu1=\E[A, kdch1=\E[3~,
kend=\E[4~, kf1=\E[[A, kf10=\E[21~, kf11=\E[23~,
kf12=\E[24~, kf13=\E[25~, kf14=\E[26~, kf15=\E[28~,
kf16=\E[29~, kf17=\E[31~, kf18=\E[32~, kf19=\E[33~,
kf2=\E[[B, kf20=\E[34~, kf3=\E[[C, kf4=\E[[D, kf5=\E[[E,
kf6=\E[17~, kf7=\E[18~, kf8=\E[19~, kf9=\E[20~,
khome=\E[1~, kich1=\E[2~, kmous=\E[M, knp=\E[6~, kpp=\E[5~,
kspd=^Z, nel=^M^J, oc=\E]R, op=\E[39;49m, rc=\E8, rev=\E[7m,
ri=\EM, rmacs=^O, rmam=\E[?7l, rmir=\E[4l, rmpch=\E[10m,
rmso=\E[27m, rmul=\E[24m, rs1=\Ec\E]R, sc=\E7,
setab=\E[4%p1%dm, setaf=\E[3%p1%dm,

sgr=\E[0;10%?%p1%t;7%;%?%p2%t;4%;%?%p3%t;7%;%?%p4%t;5%;%?%p5%t;2%;%?%p6%t;1%;%?%p7%t;8%;m%?%p9%t\016%e\017%;,
sgr0=\E[0;10m, smacs=^N, smam=\E[?7h, smir=\E[4h,
smpch=\E[11m, smso=\E[7m, smul=\E[4m, tbc=\E[3g,
u6=\E[%i%d;%dR, u7=\E[6n, u8=\E[?6c, u9=\E[c,
vpa=\E[%i%p1%dd,

If using this definition and my patch applied acsc mode is not working
then there is something wrong in curses code IMHO.
I am using Ubuntu distro and with this patch and this definition
aptitude properly displays semigraphics in UTF-8 mode and Midnight
Commander also works almost correctly (almost because it is not properly
ported - it displays help using iso-8859-2 help file without conversion
to UTF-8 so I can see replacement glyphs as it should be but in options
and menu it uses UTF-8 but incorrectly counts bytes instedad of glyphs
so screen is not looking properly if there is a multibyte UTF-8 seqence
used in a line; anyway it's MC bad code not terminal fault).

Regards
--
Adam Tlałka mailto:[email protected] ^v^ ^v^ ^v^
System & Network Administration Group - - - ~~~~~~
Computer Center, Gdańsk University of Technology, Poland
PGP public key: finger [email protected]

2006-02-18 16:02:16

by Adam Tlałka

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

Użytkownik Andrew Morton napisał:

> Adam Tla/lka <[email protected]> wrote:
>
>>
>>This patch applies to 2.6.15.3 kernel sources to drivers/char/vt.c file.
>>It should work with other versions too.
>>
>>Changed console behaviour so in UTF-8 mode vt100 alternate character
>>sequences work as described in terminfo/termcap linux terminal definition.
>>Programs can use vt100 control seqences - smacs, rmacs and acsc characters
>>in UTF-8 mode in the same way as in normal mode so one definition is always
>>valid - current behaviour make these seqences not working in UTF-8 mode.
>>
>>Added reporting malformed UTF-8 seqences as replacement glyphs.
>>I think that terminal should always display something rather then ignoring
>>these kind of data as it does now. Also it sticks to Unicode standards
>>saying that every wrong byte should be reported. It is more human readable
>>too in case of Latin subsets including ASCII chars.
>>
>>...
>>
>>- } else if (vc->vc_utf) {
>>+ } else if (vc->vc_utf && !vc->vc_disp_ctrl) {
>> /* Combine UTF-8 into Unicode */
>>- /* Incomplete characters silently ignored */
>>+ /* Malformed sequence represented as replacement glyphs */
>>+rescan_last_byte:
>> if(c > 0x7f) {
>>
>>...
>>
>>+ if (vc->vc_npar) {
>>+ c = orig;
>>+ goto rescan_last_byte;
>>+ }
>>
>>...
>>
>>+ }
>>+ vc->vc_utf_count = 0;
>>+ c = orig;
>>+ goto rescan_last_byte;
>>+ }
>> continue;
>> }
>
>
> I spent some time trying to work out why this cannot cause an infinite loop
> and gave up. Can you explain?

1. this code is executed only if vc_utf_count != 0
which means uncompleted UTF-8 sequence, because in case of proper UTF-8
sequence or normal mode vc_utf_count == 0 in these places of code.

2. vc_npar is not used while completing UTF-seqence so I used it as a
counter of scanned sequence continuation bytes, it is set to 0 if begin
of UTF-8 seqence is detected and vc_utf_count set to number of
continuation bytes

3. when you can't display replacement glyph bad sequence is ignored as
previously so vc_utf_count and vc_npar is zeroed in case of malformed
UTF-8 seqence and there is no loop - anyway replacement glyph
should always be defined IMHO or I must change this code because
it seems not to be correct to use c as tc as a last resort because in
this case c means byte value which malformed scanned seqence so it is
not valuable for us. Maybe the better way is to use "?" char as a last
resort instead of c value. What do you think?
Maybe I should remember all bytes of the UTF-sequence to use their
values as a last resort char in case of malformed sequence and 0xfffd
not defined?

Regards
--
Adam Tlałka mailto:[email protected] ^v^ ^v^ ^v^
System & Network Administration Group - - - ~~~~~~
Computer Center, Gdańsk University of Technology, Poland
PGP public key: finger [email protected]

2006-02-18 22:36:34

by Adam Tlałka

[permalink] [raw]
Subject: [PATCH]console:UTF-8 mode compatibility fixes


This patch applies to 2.6.15.3 kernel sources to drivers/char/vt.c file.
It should work with other versions too.

Changed console behaviour so in UTF-8 mode vt100 alternate character
sequences work as described in terminfo/termcap linux terminal definition.
Programs can use vt100 control seqences - smacs, rmacs and acsc characters
in UTF-8 mode in the same way as in normal mode so one definition is always
valid - current behaviour make these seqences not working in UTF-8 mode.

Added reporting malformed UTF-8 seqences as replacement glyphs.
I think that terminal should always display something rather then ignoring
these kind of data as it does now. Also it sticks to Unicode standards
saying that every wrong byte should be reported. It is more human readable
too in case of Latin subsets including ASCII chars.

It's the second version with original codes used in case when there is no replacement
glyph defined. So all scanned bytes are remembered. Anyway it could be used
in the future if we implement screen buffer as UCS-2 plus attributes
so copying from it in UTF-8 mode will always work properly.

Signed-off-by: Adam Tla/lka <[email protected]>

---

--- drivers/char/vt_orig.c 2006-02-13 11:33:54.000000000 +0100
+++ drivers/char/vt.c 2006-02-18 23:18:50.000000000 +0100
@@ -63,6 +63,13 @@
*
* Removed console_lock, enabled interrupts across all console operations
* 13 March 2001, Andrew Morton
+ *
+ * Fixed UTF-8 mode so alternate charset modes always work without need
+ * of different linux terminal definition for normal and UTF-8 modes
+ * preserving backward US-ASCII and VT100 semigraphics compatibility,
+ * malformed UTF sequences represented as sequences of replacement glyphs
+ * or original codes if replacement glyph is undefined
+ * by Adam Tla/lka <[email protected]>, Feb 2006
*/

#include <linux/module.h>
@@ -1991,17 +1998,26 @@ static int do_con_write(struct tty_struc
/* Do no translation at all in control states */
if (vc->vc_state != ESnormal) {
tc = c;
- } else if (vc->vc_utf) {
+ } else if (vc->vc_utf && !vc->vc_disp_ctrl) {
/* Combine UTF-8 into Unicode */
- /* Incomplete characters silently ignored */
+ /* Malformed sequence represented as replacement glyphs */
+rescan_last_byte:
if(c > 0x7f) {
- if (vc->vc_utf_count > 0 && (c & 0xc0) == 0x80) {
- vc->vc_utf_char = (vc->vc_utf_char << 6) | (c & 0x3f);
- vc->vc_utf_count--;
- if (vc->vc_utf_count == 0)
- tc = c = vc->vc_utf_char;
- else continue;
+ if (vc->vc_utf_count) {
+ if ((c & 0xc0) == 0x80) {
+ vc->vc_utf_char = (vc->vc_utf_char << 6) | (c & 0x3f);
+ if (--vc->vc_utf_count) {
+ vc->vc_par[vc->vc_utf_count] = c;
+ vc->vc_npar++;
+ continue;
+ }
+ tc = c = vc->vc_utf_char;
+ } else {
+ c = vc->vc_par[vc->vc_utf_count + vc->vc_npar];
+ goto insert_replacement_glyph;
+ }
} else {
+ vc->vc_npar = 0;
if ((c & 0xe0) == 0xc0) {
vc->vc_utf_count = 1;
vc->vc_utf_char = (c & 0x1f);
@@ -2018,12 +2034,16 @@ static int do_con_write(struct tty_struc
vc->vc_utf_count = 5;
vc->vc_utf_char = (c & 0x01);
} else
- vc->vc_utf_count = 0;
+ goto insert_replacement_glyph;
+ vc->vc_par[vc->vc_utf_count] = c;
continue;
}
} else {
+ if (vc->vc_utf_count) {
+ c = vc->vc_par[vc->vc_utf_count + vc->vc_npar];
+ goto insert_replacement_glyph;
+ }
tc = c;
- vc->vc_utf_count = 0;
}
} else { /* no utf */
tc = vc->vc_translate[vc->vc_toggle_meta ? (c | 0x80) : c];
@@ -2040,8 +2060,8 @@ static int do_con_write(struct tty_struc
* direct-to-font zone in UTF-8 mode.
*/
ok = tc && (c >= 32 ||
- (!vc->vc_utf && !(((vc->vc_disp_ctrl ? CTRL_ALWAYS
- : CTRL_ACTION) >> c) & 1)))
+ !(vc->vc_disp_ctrl ? (CTRL_ALWAYS >> c) & 1 :
+ vc->vc_utf || ((CTRL_ACTION >> c) & 1)))
&& (c != 127 || vc->vc_disp_ctrl)
&& (c != 128+27);

@@ -2051,6 +2071,7 @@ static int do_con_write(struct tty_struc
if ( tc == -4 ) {
/* If we got -4 (not found) then see if we have
defined a replacement character (U+FFFD) */
+insert_replacement_glyph:
tc = conv_uni_to_pc(vc, 0xfffd);

/* One reason for the -4 can be that we just
@@ -2063,7 +2084,7 @@ static int do_con_write(struct tty_struc
tc = c;
}
if (tc & ~charmask)
- continue; /* Conversion failed */
+ goto check_malformed_sequence;

if (vc->vc_need_wrap || vc->vc_decim)
FLUSH
@@ -2088,6 +2109,16 @@ static int do_con_write(struct tty_struc
vc->vc_x++;
draw_to = (vc->vc_pos += 2);
}
+check_malformed_sequence:
+ if (vc->vc_utf_count) {
+ if (vc->vc_npar) {
+ c = vc->vc_par[vc->vc_utf_count + --vc->vc_npar];
+ goto insert_replacement_glyph;
+ }
+ vc->vc_utf_count = 0;
+ c = orig;
+ goto rescan_last_byte;
+ }
continue;
}
FLUSH

2006-02-19 01:44:28

by Thomas Dickey

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

On Sat, 18 Feb 2006, Adam Tla?~Bka wrote:

> Hmm, I don't know how current ncurses treat terminfo definition in UTF-8 mode
> and what linux terminal definition you are using in you test system. There
> are different definitions floating around in different distributions which

Since we're talking about ncurses, there are not "different definitions
floating around". Different distributions may of course have their
favorite tweaks, but ncurses definition for Linux console doesn't change
that much over the past ten years.

> are slightly different especially as we talk about acsc chars and enacs,
> smacs and rmacs sequences. Which is also wrong approach so there should be
> one proper definition of the linux console. Maybe kernel developers should
> prepare some most compatible and acceptable one.
> I can post the one I am using today:

...of course, this one isn't like any of the variations I've seen.
But you knew that. (Aside from the acs/enacs/rmacs/smacs changes,
I'm curious what happened to ich/ich1).

--
Thomas E. Dickey
http://invisible-island.net
ftp://invisible-island.net

2006-02-19 01:54:00

by Thomas Dickey

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

On Sat, 18 Feb 2006, Alexander E. Patrakov wrote:

> [sorry for repost, the first attempt got blocked due to html attachment, now
> I gzipped it to circumvent the filter]
>
> Adam Tla/lka wrote:
>> This patch applies to 2.6.15.3 kernel sources to drivers/char/vt.c file.
>> It should work with other versions too.
>>
>> Changed console behaviour so in UTF-8 mode vt100 alternate character
>> sequences work as described in terminfo/termcap linux terminal definition.
>> Programs can use vt100 control seqences - smacs, rmacs and acsc characters
>> in UTF-8 mode in the same way as in normal mode so one definition is always
>> valid - current behaviour make these seqences not working in UTF-8 mode.

I expect some discussion from the people who _vehemently_ refused to allow
the Linux console to have anything that resembled a mode.

More to the point: since it's been in this form for several years, it
doesn't do much good to developers, because there are already workarounds
in ncurses to accommodate this, and even if you fixed it today, it would
be needed in ncurses for a few more years.

For example (man ncurses):

NCURSES_NO_UTF8_ACS
During initialization, the ncurses library checks for special
cases where VT100 line-drawing (and the corresponding alternate
character set capabilities) described in the terminfo are known to
be missing. Specifically, when running in a UTF-8 locale, the
Linux console emulator and the GNU screen program ignore these.
Ncurses checks the TERM environment variable for these. For other
special cases, you should set this environment variable. Doing
this tells ncurses to use Unicode values which correspond to the
VT100 line-drawing glyphs. That works for the special cases
cited, and is likely to work for terminal emulators.

When setting this variable, you should set it to a nonzero value.
Setting it to zero (or to a nonnumber) disables the special check
for Linux and screen.

is the most recent refinement (from a year ago) to a workaround which
first appeared in ncurses 5.4 (originally from December 2002).

> Doesn't work here with linux-2.6.16-rc3-mm1, ncurses-5.5. BTW has this
> been discussed with Thomas Dickey (ncurses maintainer)?

no (my seeing it via google doesn't count).

> Another feature request / bug report (spotted while viewing in Lynx a
> page containing English text and a few Chinese characters, artificial
> testcase attached):
>
> If ncurses attempt to add some Chinese character to the Linux text
> screen, Linux (correctly) prints this replacement character and advances
> the cursor by one position. Ncurses think that the cursor has moved two
> positions forward. The effect is that when you view the testcase in Lynx
> (compiled --with-screen=ncursesw) on Linux console and press PageDown,
> the fourth line contains "Thek" instead of "The" in the end.
>
> This disagreement has to be solved somehow.

yes. ncurses has no better information for this than the result from
wcwidth(). Shall we add another kludge to accommodate Linux console?
(Are there other terminal emulators with this specific problem?)

--
Thomas E. Dickey
http://invisible-island.net
ftp://invisible-island.net

2006-02-19 04:23:36

by Alexander E. Patrakov

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

Adam Tlałka wrote:

> Maybe I should remember all bytes of the UTF-sequence to use their
> values as a last resort char in case of malformed sequence and 0xfffd
> not defined?

Please don't do that. Display question marks instead in the case when
0xfffd is not defined.

--
Alexander E. Patrakov

2006-02-19 04:32:47

by Alexander E. Patrakov

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

Thomas Dickey wrote:

> On Sat, 18 Feb 2006, Alexander E. Patrakov wrote:
>
>> If ncurses attempt to add some Chinese character to the Linux text
>> screen, Linux (correctly) prints this replacement character and advances
>> the cursor by one position. Ncurses think that the cursor has moved two
>> positions forward. The effect is that when you view the testcase in Lynx
>> (compiled --with-screen=ncursesw) on Linux console and press PageDown,
>> the fourth line contains "Thek" instead of "The" in the end.
>>
>> This disagreement has to be solved somehow.
>
>
> yes. ncurses has no better information for this than the result from
> wcwidth(). Shall we add another kludge to accommodate Linux console?

Maybe yes, since putting wcwidth() into the kernel is a bigger kludge,
because linux kernel will never draw CJK, and because after glibc
update, kernel and glibc might disagree upon wcwidth of some characters.

So on linux console, ncurses should draw two 0xfffd characters when a
character with wcwidth > 1 is requested.

> (Are there other terminal emulators with this specific problem?)

Not sure. I will test putty a bit later. Anyway, an environment variable
for this kludge may be a good idea.

--
Alexander E. Patrakov

2006-02-19 05:41:47

by Alexander E. Patrakov

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

I wrote:

> Adam Tla/lka wrote:

>> Changed console behaviour so in UTF-8 mode vt100 alternate character
>> sequences work as described in terminfo/termcap linux terminal
>> definition.
>> Programs can use vt100 control seqences - smacs, rmacs and acsc
>> characters
>> in UTF-8 mode in the same way as in normal mode so one definition is
>> always
>> valid - current behaviour make these seqences not working in UTF-8 mode.
>
>
> Doesn't work here with linux-2.6.16-rc3-mm1, ncurses-5.5.

Sorry, that my non-true statement was due to the less-than-perfect
description of the patch. After patching, this produces a horizontal line:

echo -e '\x0eqqqq\x0f'

So please correct the description and the first (comment) hunk of the
patch, so that it doesn't mention "smacs" and similar words with meaning
that may vary, and so that it mentions the exact control codes.

--
Alexander E. Patrakov

2006-02-19 10:16:28

by Adam Tlałka

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

On Sun, Feb 19, 2006 at 10:42:38AM +0500, Alexander E. Patrakov wrote:
> Sorry, that my non-true statement was due to the less-than-perfect
> description of the patch. After patching, this produces a horizontal line:
>
> echo -e '\x0eqqqq\x0f'
>
> So please correct the description and the first (comment) hunk of the
> patch, so that it doesn't mention "smacs" and similar words with meaning
> that may vary, and so that it mentions the exact control codes.
>

Ok, you are right.
Here is the corrected description and the new patch:

Fixed UTF-8 mode so alternate charset modes always work according
to control sequences interpreted in do_con_trol function:
smacs = '\x0e', rmacs = '\x0f' if vt100 translation map for alternate
charset is active which means enacs = '\e)0' which preserves
backward US-ASCII and VT100 semigraphics compatibility,
malformed UTF sequences represented as sequences of replacement glyphs
or original codes if replacement glyph is undefined
Signed-off-by: Adam Tla/lka <[email protected]>

--- drivers/char/vt_orig.c 2006-02-13 11:33:54.000000000 +0100
+++ drivers/char/vt.c 2006-02-19 10:59:27.000000000 +0100
@@ -63,6 +63,15 @@
*
* Removed console_lock, enabled interrupts across all console operations
* 13 March 2001, Andrew Morton
+ *
+ * Fixed UTF-8 mode so alternate charset modes always work according
+ * to control sequences interpreted in do_con_trol function:
+ * smacs = '\x0e', rmacs = '\x0f' if vt100 translation map for alternate
+ * charset is active which means enacs = '\e)0'
+ * preserving backward US-ASCII and VT100 semigraphics compatibility,
+ * malformed UTF sequences represented as sequences of replacement glyphs
+ * or original codes if replacement glyph is undefined
+ * by Adam Tla/lka <[email protected]>, Feb 2006
*/

#include <linux/module.h>
@@ -1991,17 +2000,26 @@ static int do_con_write(struct tty_struc
/* Do no translation at all in control states */
if (vc->vc_state != ESnormal) {
tc = c;
- } else if (vc->vc_utf) {
+ } else if (vc->vc_utf && !vc->vc_disp_ctrl) {
/* Combine UTF-8 into Unicode */
- /* Incomplete characters silently ignored */
+ /* Malformed sequence represented as replacement glyphs */
+rescan_last_byte:
if(c > 0x7f) {
- if (vc->vc_utf_count > 0 && (c & 0xc0) == 0x80) {
- vc->vc_utf_char = (vc->vc_utf_char << 6) | (c & 0x3f);
- vc->vc_utf_count--;
- if (vc->vc_utf_count == 0)
- tc = c = vc->vc_utf_char;
- else continue;
+ if (vc->vc_utf_count) {
+ if ((c & 0xc0) == 0x80) {
+ vc->vc_utf_char = (vc->vc_utf_char << 6) | (c & 0x3f);
+ if (--vc->vc_utf_count) {
+ vc->vc_par[vc->vc_utf_count] = c;
+ vc->vc_npar++;
+ continue;
+ }
+ tc = c = vc->vc_utf_char;
+ } else {
+ c = vc->vc_par[vc->vc_utf_count + vc->vc_npar];
+ goto insert_replacement_glyph;
+ }
} else {
+ vc->vc_npar = 0;
if ((c & 0xe0) == 0xc0) {
vc->vc_utf_count = 1;
vc->vc_utf_char = (c & 0x1f);
@@ -2018,12 +2036,16 @@ static int do_con_write(struct tty_struc
vc->vc_utf_count = 5;
vc->vc_utf_char = (c & 0x01);
} else
- vc->vc_utf_count = 0;
+ goto insert_replacement_glyph;
+ vc->vc_par[vc->vc_utf_count] = c;
continue;
}
} else {
+ if (vc->vc_utf_count) {
+ c = vc->vc_par[vc->vc_utf_count + vc->vc_npar];
+ goto insert_replacement_glyph;
+ }
tc = c;
- vc->vc_utf_count = 0;
}
} else { /* no utf */
tc = vc->vc_translate[vc->vc_toggle_meta ? (c | 0x80) : c];
@@ -2040,8 +2062,8 @@ static int do_con_write(struct tty_struc
* direct-to-font zone in UTF-8 mode.
*/
ok = tc && (c >= 32 ||
- (!vc->vc_utf && !(((vc->vc_disp_ctrl ? CTRL_ALWAYS
- : CTRL_ACTION) >> c) & 1)))
+ !(vc->vc_disp_ctrl ? (CTRL_ALWAYS >> c) & 1 :
+ vc->vc_utf || ((CTRL_ACTION >> c) & 1)))
&& (c != 127 || vc->vc_disp_ctrl)
&& (c != 128+27);

@@ -2051,6 +2073,7 @@ static int do_con_write(struct tty_struc
if ( tc == -4 ) {
/* If we got -4 (not found) then see if we have
defined a replacement character (U+FFFD) */
+insert_replacement_glyph:
tc = conv_uni_to_pc(vc, 0xfffd);

/* One reason for the -4 can be that we just
@@ -2063,7 +2086,7 @@ static int do_con_write(struct tty_struc
tc = c;
}
if (tc & ~charmask)
- continue; /* Conversion failed */
+ goto check_malformed_sequence;

if (vc->vc_need_wrap || vc->vc_decim)
FLUSH
@@ -2088,6 +2111,16 @@ static int do_con_write(struct tty_struc
vc->vc_x++;
draw_to = (vc->vc_pos += 2);
}
+check_malformed_sequence:
+ if (vc->vc_utf_count) {
+ if (vc->vc_npar) {
+ c = vc->vc_par[vc->vc_utf_count + --vc->vc_npar];
+ goto insert_replacement_glyph;
+ }
+ vc->vc_utf_count = 0;
+ c = orig;
+ goto rescan_last_byte;
+ }
continue;
}
FLUSH

2006-02-19 10:46:45

by Adam Tlałka

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

On Sat, Feb 18, 2006 at 08:43:56PM -0500, Thomas Dickey wrote:
> On Sat, 18 Feb 2006, Adam Tla?~Bka wrote:
> >one proper definition of the linux console. Maybe kernel developers should
> >prepare some most compatible and acceptable one.
> >I can post the one I am using today:
>
> ...of course, this one isn't like any of the variations I've seen.
> But you knew that. (Aside from the acs/enacs/rmacs/smacs changes,
> I'm curious what happened to ich/ich1).

They were removed because of possible incompatibility warning.
Maybe they should stay - I posted only currently used by me linux terminal
definition. As I said before there should be the official linux terminal
definition and description included with kernel sources because vt.c
in sources defines console behaviour.

Anyway acs, enacs, smacs and rmacs sequences are defined here
to stick to the controls interpretation used in vt.c
do_con_trol function and to get desired behaviour.

Regards
--
Adam Tla?ka mailto:[email protected] ^v^ ^v^ ^v^
System & Network Administration Group ~~~~~~
Computer Center, Gda?sk University of Technology, Poland
PGP public key: finger [email protected]

2006-02-19 11:48:52

by Adam Tlałka

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

On Sat, Feb 18, 2006 at 08:53:38PM -0500, Thomas Dickey wrote:
> More to the point: since it's been in this form for several years, it
> doesn't do much good to developers, because there are already workarounds
> in ncurses to accommodate this, and even if you fixed it today, it would
> be needed in ncurses for a few more years.
>
> For example (man ncurses):
>
> NCURSES_NO_UTF8_ACS
> ...
> for Linux and screen.
>
> is the most recent refinement (from a year ago) to a workaround which
> first appeared in ncurses 5.4 (originally from December 2002).

OK but my fix is for all not only curses programs. Anyway from my point of view
programs should be written in a way so they are easy to use and work correctly
without user special intervention and knowledge. So ncurses hack is not
a desired way IMHO. Generally saying products should be designed with customers
comfort and not developers/producers comfort in mind. But this is just a wish
of course in current world and off topic.

> >This disagreement has to be solved somehow.
>
> yes. ncurses has no better information for this than the result from
> wcwidth(). Shall we add another kludge to accommodate Linux console?
> (Are there other terminal emulators with this specific problem?)

If ncurses know that this is a two-columns wide character so for compatibility
and correct handling reasons kernel console driver should know it too
and send two replacement glyphs or better one replacement glyph and one space
or there should be n column replacement glyph implemented
(maybe as a replacement glyph and n-1 additional spaces) as for example putty
xterm-color terminal emulator does.

For correct selecting and pasting from console in UTF-8 mode without
data corruption in case of non displayed glyphs and malformed sequences
console screen contents should be memorized as UCS-2 wide chars plus additional
attributes and color information.

Also there should be some specification - maybe RFC - how correctly handle
UTF-8 malformed seqences so no information is corrupt during
displaying/cut/paste/edit operations. If user edits then it should be changed
but if there is no action there should be no change.
So automatic recoding or not displaying something leads to state when I could
have inproper file name and just can't correct it from command line because
I can't input inproper sequence from keyboard - just like in MS Windows -
or I must use some additional software or graphical tool.

Additionally a replacement glyph means that a valid glyph can't be displayed
by a device which doesn't mean that the particular UTF-8 sequence is incorrect.
Of course an inproper byte can't be displayed too because we don't know
what it really means. Result is the same but reasons are different.
So showed on vt screen results hide real reasons which is bad IMHO.
But only the replacement glyph is defined in the standard and there is no bad
byte indicator which is really needed ;-(.

Regards
--
Adam Tla?ka mailto:[email protected] ^v^ ^v^ ^v^
System & Network Administration Group ~~~~~~
Computer Center, Gda?sk University of Technology, Poland
PGP public key: finger [email protected]

2006-02-19 12:46:35

by Adam Tlałka

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

On Sun, Feb 19, 2006 at 09:24:26AM +0500, Alexander E. Patrakov wrote:
> Adam Tlałka wrote:
>
> >Maybe I should remember all bytes of the UTF-sequence to use their
> >values as a last resort char in case of malformed sequence and 0xfffd
> >not defined?
>
> Please don't do that. Display question marks instead in the case when
> 0xfffd is not defined.

But this is the normal console behaviour in case of non existing replacement
glyph for some char. I just not changed it. Anyway if you have no '?' glyph
at '?' char code position you get some different char on the screen.
Replacement glyph should always be defined as I said before,
so this part of code could be easilly removed but I do not really know
if this assumption is true now.

Regards
--
Adam Tla?ka mailto:[email protected] ^v^ ^v^ ^v^
System & Network Administration Group ~~~~~~
Computer Center, Gda?sk University of Technology, Poland
PGP public key: finger [email protected]

2006-02-19 16:18:05

by Adam Tlałka

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

On Sun, Feb 19, 2006 at 09:24:26AM +0500, Alexander E. Patrakov wrote:
> Adam Tlałka wrote:
>
> >Maybe I should remember all bytes of the UTF-sequence to use their
> >values as a last resort char in case of malformed sequence and 0xfffd
> >not defined?
>
> Please don't do that. Display question marks instead in the case when
> 0xfffd is not defined.

Look at the original code. If conv_uni_to_pc fails and there is no replacement
char (after a clear_unimap for example) and we using US-ASCII we rather
should see something then sequences of '?' chars.
Maybe I could change this to:

if (tc == -4) {
if (c < 128)
tc = c;
else
tc = '?';
}

What about that?

Remembering of original bytes is needed if we could then remember
them in a way so paste from screen gives us the same sequence as it was
in input. With current console design it is impossible is case
of correct UTF-8 sequences containing undisplayable glyphs or malformed
sequences. So I can remove that part of patch and it will wait until
this functionality will be implemented - not so easy but can
be done IMHO and worth it to obtain properly working selection
and copying in UTF-8 mode.

Regards
--
Adam Tla?ka mailto:[email protected] ^v^ ^v^ ^v^
System & Network Administration Group ~~~~~~
Computer Center, Gda?sk University of Technology, Poland
PGP public key: finger [email protected]

2006-02-19 17:06:40

by Alexander E. Patrakov

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

Adam Tla/lka wrote:
> On Sun, Feb 19, 2006 at 09:24:26AM +0500, Alexander E. Patrakov wrote:
>
>> Adam Tlałka wrote:
>>
>>
>>> Maybe I should remember all bytes of the UTF-sequence to use their
>>> values as a last resort char in case of malformed sequence and 0xfffd
>>> not defined?
>>>
>> Please don't do that. Display question marks instead in the case when
>> 0xfffd is not defined.
>>
>
> Look at the original code. If conv_uni_to_pc fails and there is no replacement
> char (after a clear_unimap for example) and we using US-ASCII we rather
> should see something then sequences of '?' chars.
> Maybe I could change this to:
>
> if (tc == -4) {
> if (c < 128)
> tc = c;
> else
> tc = '?';
> }
>
> What about that?
>
I'd let someone else judge, but that is clearly a broken case that just
has to be declared broken. <joke>could you please also adapt to a font
that has all glyphs looking as smileys?</joke> But it's only three extra
lines of code, so let's accept that "c<128" check.
> Remembering of original bytes is needed if we could then remember
> them in a way so paste from screen gives us the same sequence as it was
> in input.
This doesn't match the behaviour of X.
> With current console design it is impossible is case
> of correct UTF-8 sequences containing undisplayable glyphs or malformed
> sequences.
I agree that, in some cases, it makes sense to copy and paste
undisplayable glyphs. However, IMHO, this should not be allowed for
malformed sequences.

--
Alexander E. Patrakov

2006-02-19 23:20:48

by Adam Tlałka

[permalink] [raw]
Subject: [PATCH]console:UTF-8 mode compatibility fixes - new version


Corrected and optimized version of the first patch - continuation
bytes and the first byte of the scanned UTF-8 sequence are not memorized because
with current console design this information is unsable.

Description:

Fixed UTF-8 mode so alternate charset modes always work according
to control sequences interpreted in do_con_trol function:
smacs = '\x0e', rmacs = '\x0f' if vt100 translation map for alternate
charset is active which means enacs = '\e)0'
preserving backward US-ASCII and VT100 semigraphics compatibility.

Malformed UTF sequences are represented as sequences of replacement glyphs,
original codes or '?' as a last resort if replacement glyph is undefined
which is a bad console state but something should be displayed
if UCS-2 code represents displayable char.

Signed-off-by: Adam Tla/lka <[email protected]>


--- drivers/char/vt_orig.c 2006-02-13 11:33:54.000000000 +0100
+++ drivers/char/vt.c 2006-02-19 23:50:00.000000000 +0100
@@ -63,6 +63,15 @@
*
* Removed console_lock, enabled interrupts across all console operations
* 13 March 2001, Andrew Morton
+ *
+ * Fixed UTF-8 mode so alternate charset modes always work according
+ * to control sequences interpreted in do_con_trol function:
+ * smacs = '\x0e', rmacs = '\x0f' if vt100 translation map for alternate
+ * charset is active which means enacs = '\e)0'
+ * preserving backward US-ASCII and VT100 semigraphics compatibility,
+ * malformed UTF sequences represented as sequences of replacement glyphs,
+ * original codes or '?' as a last resort if replacement glyph is undefined
+ * by Adam Tla/lka <[email protected]>, Feb 2006
*/

#include <linux/module.h>
@@ -1991,17 +2000,23 @@ static int do_con_write(struct tty_struc
/* Do no translation at all in control states */
if (vc->vc_state != ESnormal) {
tc = c;
- } else if (vc->vc_utf) {
+ } else if (vc->vc_utf && !vc->vc_disp_ctrl) {
/* Combine UTF-8 into Unicode */
- /* Incomplete characters silently ignored */
+ /* Malformed sequence represented as replacement glyphs */
+rescan_last_byte:
if(c > 0x7f) {
- if (vc->vc_utf_count > 0 && (c & 0xc0) == 0x80) {
- vc->vc_utf_char = (vc->vc_utf_char << 6) | (c & 0x3f);
- vc->vc_utf_count--;
- if (vc->vc_utf_count == 0)
- tc = c = vc->vc_utf_char;
- else continue;
+ if (vc->vc_utf_count) {
+ if ((c & 0xc0) == 0x80) {
+ vc->vc_utf_char = (vc->vc_utf_char << 6) | (c & 0x3f);
+ if (--vc->vc_utf_count) {
+ vc->vc_npar++;
+ continue;
+ }
+ tc = c = vc->vc_utf_char;
+ } else
+ goto insert_replacement_glyph;
} else {
+ vc->vc_npar = 0;
if ((c & 0xe0) == 0xc0) {
vc->vc_utf_count = 1;
vc->vc_utf_char = (c & 0x1f);
@@ -2018,12 +2033,13 @@ static int do_con_write(struct tty_struc
vc->vc_utf_count = 5;
vc->vc_utf_char = (c & 0x01);
} else
- vc->vc_utf_count = 0;
+ goto insert_replacement_glyph;
continue;
}
} else {
+ if (vc->vc_utf_count)
+ goto insert_replacement_glyph;
tc = c;
- vc->vc_utf_count = 0;
}
} else { /* no utf */
tc = vc->vc_translate[vc->vc_toggle_meta ? (c | 0x80) : c];
@@ -2040,31 +2056,41 @@ static int do_con_write(struct tty_struc
* direct-to-font zone in UTF-8 mode.
*/
ok = tc && (c >= 32 ||
- (!vc->vc_utf && !(((vc->vc_disp_ctrl ? CTRL_ALWAYS
- : CTRL_ACTION) >> c) & 1)))
+ !(vc->vc_disp_ctrl ? (CTRL_ALWAYS >> c) & 1 :
+ vc->vc_utf || ((CTRL_ACTION >> c) & 1)))
&& (c != 127 || vc->vc_disp_ctrl)
&& (c != 128+27);

if (vc->vc_state == ESnormal && ok) {
/* Now try to find out how to display it */
tc = conv_uni_to_pc(vc, tc);
- if ( tc == -4 ) {
+ if ( tc & ~charmask ) {
+ if ( tc == -4 ) {
/* If we got -4 (not found) then see if we have
defined a replacement character (U+FFFD) */
- tc = conv_uni_to_pc(vc, 0xfffd);
+insert_replacement_glyph:
+ tc = conv_uni_to_pc(vc, 0xfffd);

/* One reason for the -4 can be that we just
did a clear_unimap();
try at least to show something. */
- if (tc == -4)
- tc = c;
- } else if ( tc == -3 ) {
+ if (tc & ~charmask) {
+ if (c & ~charmask)
+ tc = '?';
+ else
+ tc = c;
+ }
+ } else if ( tc == -3 ) {
/* Bad hash table -- hope for the best */
- tc = c;
+ if (c & ~charmask)
+ tc = '?';
+ else
+ tc = c;
+ } else
+ continue; /* Conversion failed */
}
- if (tc & ~charmask)
- continue; /* Conversion failed */

+repeat_replacement_glyph:
if (vc->vc_need_wrap || vc->vc_decim)
FLUSH
if (vc->vc_need_wrap) {
@@ -2088,6 +2114,15 @@ static int do_con_write(struct tty_struc
vc->vc_x++;
draw_to = (vc->vc_pos += 2);
}
+ if (vc->vc_utf_count) {
+ if (vc->vc_npar) {
+ vc->vc_npar--;
+ goto repeat_replacement_glyph;
+ }
+ vc->vc_utf_count = 0;
+ c = orig;
+ goto rescan_last_byte;
+ }
continue;
}
FLUSH

2006-02-20 01:21:30

by Thomas Dickey

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

On Sun, 19 Feb 2006, Adam Tla/lka wrote:

> On Sat, Feb 18, 2006 at 08:53:38PM -0500, Thomas Dickey wrote: OK but my
> fix is for all not only curses programs. Anyway from my point of view
> programs should be written in a way so they are easy to use and work
> correctly without user special intervention and knowledge. So ncurses
> hack is not

of course (which is why I have xterm doing the same thing).

> a desired way IMHO. Generally saying products should be designed with
> customers comfort and not developers/producers comfort in mind. But this
> is just a wish of course in current world and off topic.

;-)

--
Thomas E. Dickey
http://invisible-island.net
ftp://invisible-island.net

2006-02-20 08:15:58

by Adam Tlałka

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes - new version #1

Hope the last version applaying to current console design.
What do you think?

Description:

Fixed UTF-8 mode so alternate charset modes always work according
to control sequences interpreted in do_con_trol function:
smacs = '\x0e', rmacs = '\x0f' if vt100 translation map for alternate
charset is active which means enacs = '\e)0'
preserving backward US-ASCII and VT100 semigraphics compatibility.

Malformed UTF sequences are represented as sequences of replacement glyphs,
original codes or '?' as a last resort if replacement glyph is undefined
and UCS-2 code does not fit in 0-127 code range.

Signed-off-by: Adam Tla/lka <[email protected]>

--- drivers/char/vt_orig.c 2006-02-13 11:33:54.000000000 +0100
+++ drivers/char/vt.c 2006-02-20 08:53:59.000000000 +0100
@@ -63,6 +63,16 @@
*
* Removed console_lock, enabled interrupts across all console operations
* 13 March 2001, Andrew Morton
+ *
+ * Fixed UTF-8 mode so alternate charset modes always work according
+ * to control sequences interpreted in do_con_trol function:
+ * smacs = '\x0e', rmacs = '\x0f' if vt100 translation map for alternate
+ * charset is active which means enacs = '\e)0'
+ * preserving backward US-ASCII and VT100 semigraphics compatibility,
+ * malformed UTF sequences represented as sequences of replacement glyphs,
+ * original codes or '?' as a last resort if replacement glyph is undefined
+ * and UCS-2 code does not fit in 0-127 code range
+ * by Adam Tla/lka <[email protected]>, Feb 2006
*/

#include <linux/module.h>
@@ -1991,17 +2001,23 @@ static int do_con_write(struct tty_struc
/* Do no translation at all in control states */
if (vc->vc_state != ESnormal) {
tc = c;
- } else if (vc->vc_utf) {
+ } else if (vc->vc_utf && !vc->vc_disp_ctrl) {
/* Combine UTF-8 into Unicode */
- /* Incomplete characters silently ignored */
+ /* Malformed sequence represented as replacement glyphs */
+rescan_last_byte:
if(c > 0x7f) {
- if (vc->vc_utf_count > 0 && (c & 0xc0) == 0x80) {
- vc->vc_utf_char = (vc->vc_utf_char << 6) | (c & 0x3f);
- vc->vc_utf_count--;
- if (vc->vc_utf_count == 0)
- tc = c = vc->vc_utf_char;
- else continue;
+ if (vc->vc_utf_count) {
+ if ((c & 0xc0) == 0x80) {
+ vc->vc_utf_char = (vc->vc_utf_char << 6) | (c & 0x3f);
+ if (--vc->vc_utf_count) {
+ vc->vc_npar++;
+ continue;
+ }
+ tc = c = vc->vc_utf_char;
+ } else
+ goto insert_replacement_glyph;
} else {
+ vc->vc_npar = 0;
if ((c & 0xe0) == 0xc0) {
vc->vc_utf_count = 1;
vc->vc_utf_char = (c & 0x1f);
@@ -2018,12 +2034,13 @@ static int do_con_write(struct tty_struc
vc->vc_utf_count = 5;
vc->vc_utf_char = (c & 0x01);
} else
- vc->vc_utf_count = 0;
+ goto insert_replacement_glyph;
continue;
}
} else {
+ if (vc->vc_utf_count)
+ goto insert_replacement_glyph;
tc = c;
- vc->vc_utf_count = 0;
}
} else { /* no utf */
tc = vc->vc_translate[vc->vc_toggle_meta ? (c | 0x80) : c];
@@ -2040,31 +2057,41 @@ static int do_con_write(struct tty_struc
* direct-to-font zone in UTF-8 mode.
*/
ok = tc && (c >= 32 ||
- (!vc->vc_utf && !(((vc->vc_disp_ctrl ? CTRL_ALWAYS
- : CTRL_ACTION) >> c) & 1)))
+ !(vc->vc_disp_ctrl ? (CTRL_ALWAYS >> c) & 1 :
+ vc->vc_utf || ((CTRL_ACTION >> c) & 1)))
&& (c != 127 || vc->vc_disp_ctrl)
&& (c != 128+27);

if (vc->vc_state == ESnormal && ok) {
/* Now try to find out how to display it */
tc = conv_uni_to_pc(vc, tc);
- if ( tc == -4 ) {
+ if ( tc & ~charmask ) {
+ if ( tc == -4 ) {
/* If we got -4 (not found) then see if we have
defined a replacement character (U+FFFD) */
- tc = conv_uni_to_pc(vc, 0xfffd);
+insert_replacement_glyph:
+ tc = conv_uni_to_pc(vc, 0xfffd);

/* One reason for the -4 can be that we just
did a clear_unimap();
try at least to show something. */
- if (tc == -4)
- tc = c;
- } else if ( tc == -3 ) {
+ if (tc & ~charmask) {
+ if ( c & ~0x7f )
+ tc = '?';
+ else
+ tc = c;
+ }
+ } else if ( tc == -3 ) {
/* Bad hash table -- hope for the best */
- tc = c;
+ if ( c & ~0x7f )
+ tc = '?';
+ else
+ tc = c;
+ } else
+ continue; /* Conversion failed */
}
- if (tc & ~charmask)
- continue; /* Conversion failed */

+repeat_replacement_glyph:
if (vc->vc_need_wrap || vc->vc_decim)
FLUSH
if (vc->vc_need_wrap) {
@@ -2088,6 +2115,15 @@ static int do_con_write(struct tty_struc
vc->vc_x++;
draw_to = (vc->vc_pos += 2);
}
+ if (vc->vc_utf_count) {
+ if (vc->vc_npar) {
+ vc->vc_npar--;
+ goto repeat_replacement_glyph;
+ }
+ vc->vc_utf_count = 0;
+ c = orig;
+ goto rescan_last_byte;
+ }
continue;
}
FLUSH

2006-03-07 15:06:13

by Adam Tlałka

[permalink] [raw]
Subject: Re: [PATCH]console:UTF-8 mode compatibility fixes

Thomas Dickey wrote:

> On Sun, 19 Feb 2006, Adam Tla/lka wrote:
>
>> OK but my fix is for all not only curses programs. Anyway from my
>> point of view programs should be written in a way so they are easy to
>> use and work correctly without user special intervention and
>> knowledge. So ncurses hack is not
>
>
> of course (which is why I have xterm doing the same thing).

Yes, I tested VT100 graphics in UTF8-mode with various terminal
emulator programs - xterm, gnome-terminal, kconsole, mlterm, rxvt - and
they all work the same way accepting ^N as smacs and VT100 graphics
chars. So this is the reason why Linux console should not break this
compatibility in UTF8 mode too.

There are some programs which use ascs even if they can do it by UTF8
sequences - w3m for example. Without supporting
acsc in UTF8 mode using Linux console and w3m gives not so good visual
results. Second reason ;-).
There are other programs of course - aptitude for example - working the
same way as w3m.

Using smacs=\e[11m which means smacs == smpch is some kind of a hack and
not the proper solution. Terminal description should describe sequences
interpreted by a device and in case of smacs == smpch sequences like ^N
and ^O do not exist in terminal description but they are interpreted as
before. So the description is not complete. The acsc string should be
set so it sticks to VT100 map in Linux console sources too:

acsc=++\,\,--..00``aaffgghhiijjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~

now adding proper USC2 to font mapping to used console font makes it
working correctly in both normal and UTF8 modes. Other solutions like
modifying acsc to obtain some different chars for nonexisting glyphs are
only partially correct because we have only half functioning terminal -
in normal mode we see some chars but in UTF8 mode there are only
replacemnt glyphs. So only the compatible definition and then correction
to console fonts included UCS2 to glyph maps solves the problem IMHO.
Also if we have different definitions on different systems and VT100
compatibility is lost we have problems too.

After some discussion I've sended final patch

2006-02-27 13:06 [PATCH]console compatibility fixes to UTF-8 mode

and now I am waiting for some reaction.
I am using this patch with my work machines now.

Regards

--
Adam Tlałka mailto:[email protected] ^v^ ^v^ ^v^
System & Network Administration Group - - - ~~~~~~
Computer Center, Gdańsk University of Technology, Poland
PGP public key: finger [email protected]