[PATCH] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode

Linux-Serial Archive mirror
 help / color / mirror / Atom feed

* [PATCH] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode
@ 2023-12-12  7:40 Roman Zilka
  2023-12-12  8:24 ` Greg KH
  2023-12-12  9:20 ` Jiri Slaby
  0 siblings, 2 replies; 10+ messages in thread
From: Roman Zilka @ 2023-12-12  7:40 UTC (permalink / raw)
  To: gregkh, jirislaby; +Cc: linux-serial

[-- Attachment #1: Type: text/plain, Size: 993 bytes --]

vc_translate_unicode(), vc_sanitize_unicode():
1. Limit codepoint space to 0x10FFFF. The old algorithm followed an ancient
   version of Unicode.
2. Corrected vc_translate_unicode() doc (@rescan).
3. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in Unicode -
   - accept them. Another option was to complete the set of noncharacters (used
   to be those two, now there's more) and preserve the substitution. This is
   indeed what Unicode suggests (v15.1, chap. 23.7) (not requires), but most
   codepoints are !iswprint(), so substituting just the noncharacters seemed
   futile. Also, I've never seen noncharacters treated in a special way.
4. Moved what remained of vc_sanitize_unicode() into vc_translate_unicode().

Signed-off-by: Roman Žilka <roman.zilka@gmail.com>
---
 drivers/tty/vt/vt.c | 36 +++++++-----------------------------
 1 file changed, 7 insertions(+), 29 deletions(-)

base-commit: a39b6ac3781d46ba18193c9dbb2110f31e9bffe9
-- 
2.41.0

[-- Attachment #2: 0001-tty-vt-UTF-8-parsing-update-according-to-RFC-3629-mo.patch.xz --]
[-- Type: application/x-xz, Size: 1732 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode
  2023-12-12  7:40 [PATCH] " Roman Zilka
@ 2023-12-12  8:24 ` Greg KH
  2023-12-12  9:20 ` Jiri Slaby
  1 sibling, 0 replies; 10+ messages in thread
From: Greg KH @ 2023-12-12  8:24 UTC (permalink / raw)
  To: Roman Zilka; +Cc: jirislaby, linux-serial

On Tue, Dec 12, 2023 at 08:40:42AM +0100, Roman Zilka wrote:
> vc_translate_unicode(), vc_sanitize_unicode():
> 1. Limit codepoint space to 0x10FFFF. The old algorithm followed an ancient
>    version of Unicode.
> 2. Corrected vc_translate_unicode() doc (@rescan).
> 3. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in Unicode -
>    - accept them. Another option was to complete the set of noncharacters (used
>    to be those two, now there's more) and preserve the substitution. This is
>    indeed what Unicode suggests (v15.1, chap. 23.7) (not requires), but most
>    codepoints are !iswprint(), so substituting just the noncharacters seemed
>    futile. Also, I've never seen noncharacters treated in a special way.
> 4. Moved what remained of vc_sanitize_unicode() into vc_translate_unicode().
> 
> Signed-off-by: Roman Žilka <roman.zilka@gmail.com>
> ---
>  drivers/tty/vt/vt.c | 36 +++++++-----------------------------
>  1 file changed, 7 insertions(+), 29 deletions(-)
> 
> base-commit: a39b6ac3781d46ba18193c9dbb2110f31e9bffe9
> -- 
> 2.41.0


Hi,

This is the friendly patch-bot of Greg Kroah-Hartman.  You have sent him
a patch that has triggered this response.  He used to manually respond
to these common problems, but in order to save his sanity (he kept
writing the same thing over and over, yet to different people), I was
created.  Hopefully you will not take offence and will fix the problem
in your patch and resubmit it so that it can be accepted into the Linux
kernel tree.

You are receiving this message because of the following common error(s)
as indicated below:

- Your patch was attached, please place it inline so that it can be
  applied directly from the email message itself.

- Your patch did many different things all at once, making it difficult
  to review.  All Linux kernel patches need to only do one thing at a
  time.  If you need to do multiple things (such as clean up all coding
  style issues in a file/driver), do it in a sequence of patches, each
  one doing only one thing.  This will make it easier to review the
  patches to ensure that they are correct, and to help alleviate any
  merge issues that larger patches can cause.


If you wish to discuss this problem further, or you have questions about
how to resolve this issue, please feel free to respond to this email and
Greg will reply once he has dug out from the pending patches received
from other developers.

thanks,

greg k-h's patch email bot

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode
  2023-12-12  7:40 [PATCH] " Roman Zilka
  2023-12-12  8:24 ` Greg KH
@ 2023-12-12  9:20 ` Jiri Slaby
  1 sibling, 0 replies; 10+ messages in thread
From: Jiri Slaby @ 2023-12-12  9:20 UTC (permalink / raw)
  To: Roman Zilka, gregkh; +Cc: linux-serial

On 12. 12. 23, 8:40, Roman Zilka wrote:
> vc_translate_unicode(), vc_sanitize_unicode():
> 1. Limit codepoint space to 0x10FFFF. The old algorithm followed an ancient
>     version of Unicode.
> 2. Corrected vc_translate_unicode() doc (@rescan).
> 3. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in Unicode -
>     - accept them. Another option was to complete the set of noncharacters (used
>     to be those two, now there's more) and preserve the substitution. This is
>     indeed what Unicode suggests (v15.1, chap. 23.7) (not requires), but most
>     codepoints are !iswprint(), so substituting just the noncharacters seemed
>     futile. Also, I've never seen noncharacters treated in a special way.
> 4. Moved what remained of vc_sanitize_unicode() into vc_translate_unicode().

Whatever the patch contains (a _packed_ attachment really?), you should 
spell "Why" part in here.

thanks,
-- 
js


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode
@ 2023-12-12 15:13 Roman Žilka
  2023-12-12 15:36 ` Greg KH
  0 siblings, 1 reply; 10+ messages in thread
From: Roman Žilka @ 2023-12-12 15:13 UTC (permalink / raw)
  To: gregkh, jirislaby; +Cc: linux-serial, roman.zilka

vc_translate_unicode() and vc_sanitize_unicode() parse input to the
UTF-8-enabled console, marking invalid byte sequences and producing Unicode
codepoints. The current algorithm follows ancient Unicode and may accept invalid
byte sequences, pass on non-existent codepoints and reject valid sequences.

The patch restores the functions' compliance with modern Unicode (v15.1 + many
previous versions) as well as RFC 3629.
1. Codepoint space is limited to 0x10FFFF.
2. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in Unicode and
   will be accepted. Another option was to complete the set of noncharacters
   (used to be just those two, now there's more) and preserve the rejection
   step. This is indeed what Unicode suggests (v15.1, chap. 23.7) (not
   requires), but most codepoints are !iswprint(), so selecting just the
   noncharacters seemed arbitrary and futile (and unnecessary).

On the side:
3. What remained of vc_sanitize_unicode() is in vc_translate_unicode().
4. Corrected vc_translate_unicode() doc (@rescan).

This is not a security patch. I'm not aware of any present security implications
of the old code.

Signed-off-by: Roman Žilka <roman.zilka@gmail.com>
---
 drivers/tty/vt/vt.c | 36 +++++++-----------------------------
 1 file changed, 7 insertions(+), 29 deletions(-)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 156efda7c80d..215e162ec8af 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2587,23 +2587,11 @@ static inline int vc_translate_ascii(const struct vc_data *vc, int c)
 }
 
 
-/**
- * vc_sanitize_unicode - Replace invalid Unicode code points with U+FFFD
- * @c: the received character, or U+FFFD for invalid sequences.
- */
-static inline int vc_sanitize_unicode(const int c)
-{
-	if ((c >= 0xd800 && c <= 0xdfff) || c == 0xfffe || c == 0xffff)
-		return 0xfffd;
-
-	return c;
-}
-
 /**
  * vc_translate_unicode - Combine UTF-8 into Unicode in @vc_utf_char
  * @vc: virtual console
- * @c: character to translate
- * @rescan: we return true if we need more (continuation) data
+ * @c: UTF-8 byte to translate
+ * @rescan: true => @c wasn't translated here and needs to be re-processed
  *
  * @vc_utf_char is the being-constructed unicode character.
  * @vc_utf_count is the number of continuation bytes still expected to arrive.
@@ -2611,10 +2599,7 @@ static inline int vc_sanitize_unicode(const int c)
  */
 static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 {
-	static const u32 utf8_length_changes[] = {
-		0x0000007f, 0x000007ff, 0x0000ffff,
-		0x001fffff, 0x03ffffff, 0x7fffffff
-	};
+	static const u32 utf8_length_changes[] = {0x7f, 0x7ff, 0xffff, 0x10ffff};
 
 	/* Continuation byte received */
 	if ((c & 0xc0) == 0x80) {
@@ -2629,12 +2614,12 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 
 		/* Got a whole character */
 		c = vc->vc_utf_char;
-		/* Reject overlong sequences */
+		/* Reject overlong sequences and surrogates */
 		if (c <= utf8_length_changes[vc->vc_npar - 1] ||
-				c > utf8_length_changes[vc->vc_npar])
+				c > utf8_length_changes[vc->vc_npar] ||
+				(c & 0xfff800) == 0x00d800)
 			return 0xfffd;
-
-		return vc_sanitize_unicode(c);
+		return c;
 	}
 
 	/* Single ASCII byte or first byte of a sequence received */
@@ -2660,14 +2645,7 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 	} else if ((c & 0xf8) == 0xf0) {
 		vc->vc_utf_count = 3;
 		vc->vc_utf_char = (c & 0x07);
-	} else if ((c & 0xfc) == 0xf8) {
-		vc->vc_utf_count = 4;
-		vc->vc_utf_char = (c & 0x03);
-	} else if ((c & 0xfe) == 0xfc) {
-		vc->vc_utf_count = 5;
-		vc->vc_utf_char = (c & 0x01);
 	} else {
-		/* 254 and 255 are invalid */
 		return 0xfffd;
 	}
 

base-commit: a39b6ac3781d46ba18193c9dbb2110f31e9bffe9
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode
  2023-12-12 15:13 [PATCH] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode Roman Žilka
@ 2023-12-12 15:36 ` Greg KH
  2023-12-12 16:23   ` [PATCH v2] " Roman Žilka
  0 siblings, 1 reply; 10+ messages in thread
From: Greg KH @ 2023-12-12 15:36 UTC (permalink / raw)
  To: Roman Žilka; +Cc: jirislaby, linux-serial

On Tue, Dec 12, 2023 at 04:13:20PM +0100, Roman Žilka wrote:
> vc_translate_unicode() and vc_sanitize_unicode() parse input to the
> UTF-8-enabled console, marking invalid byte sequences and producing Unicode
> codepoints. The current algorithm follows ancient Unicode and may accept invalid
> byte sequences, pass on non-existent codepoints and reject valid sequences.
> 
> The patch restores the functions' compliance with modern Unicode (v15.1 + many
> previous versions) as well as RFC 3629.
> 1. Codepoint space is limited to 0x10FFFF.
> 2. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in Unicode and
>    will be accepted. Another option was to complete the set of noncharacters
>    (used to be just those two, now there's more) and preserve the rejection
>    step. This is indeed what Unicode suggests (v15.1, chap. 23.7) (not
>    requires), but most codepoints are !iswprint(), so selecting just the
>    noncharacters seemed arbitrary and futile (and unnecessary).
> 
> On the side:
> 3. What remained of vc_sanitize_unicode() is in vc_translate_unicode().
> 4. Corrected vc_translate_unicode() doc (@rescan).
> 
> This is not a security patch. I'm not aware of any present security implications
> of the old code.
> 
> Signed-off-by: Roman Žilka <roman.zilka@gmail.com>
> ---
>  drivers/tty/vt/vt.c | 36 +++++++-----------------------------
>  1 file changed, 7 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
> index 156efda7c80d..215e162ec8af 100644
> --- a/drivers/tty/vt/vt.c
> +++ b/drivers/tty/vt/vt.c
> @@ -2587,23 +2587,11 @@ static inline int vc_translate_ascii(const struct vc_data *vc, int c)
>  }
>  
>  
> -/**
> - * vc_sanitize_unicode - Replace invalid Unicode code points with U+FFFD
> - * @c: the received character, or U+FFFD for invalid sequences.
> - */
> -static inline int vc_sanitize_unicode(const int c)
> -{
> -	if ((c >= 0xd800 && c <= 0xdfff) || c == 0xfffe || c == 0xffff)
> -		return 0xfffd;
> -
> -	return c;
> -}
> -
>  /**
>   * vc_translate_unicode - Combine UTF-8 into Unicode in @vc_utf_char
>   * @vc: virtual console
> - * @c: character to translate
> - * @rescan: we return true if we need more (continuation) data
> + * @c: UTF-8 byte to translate
> + * @rescan: true => @c wasn't translated here and needs to be re-processed
>   *
>   * @vc_utf_char is the being-constructed unicode character.
>   * @vc_utf_count is the number of continuation bytes still expected to arrive.
> @@ -2611,10 +2599,7 @@ static inline int vc_sanitize_unicode(const int c)
>   */
>  static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
>  {
> -	static const u32 utf8_length_changes[] = {
> -		0x0000007f, 0x000007ff, 0x0000ffff,
> -		0x001fffff, 0x03ffffff, 0x7fffffff
> -	};
> +	static const u32 utf8_length_changes[] = {0x7f, 0x7ff, 0xffff, 0x10ffff};
>  
>  	/* Continuation byte received */
>  	if ((c & 0xc0) == 0x80) {
> @@ -2629,12 +2614,12 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
>  
>  		/* Got a whole character */
>  		c = vc->vc_utf_char;
> -		/* Reject overlong sequences */
> +		/* Reject overlong sequences and surrogates */
>  		if (c <= utf8_length_changes[vc->vc_npar - 1] ||
> -				c > utf8_length_changes[vc->vc_npar])
> +				c > utf8_length_changes[vc->vc_npar] ||
> +				(c & 0xfff800) == 0x00d800)
>  			return 0xfffd;
> -
> -		return vc_sanitize_unicode(c);
> +		return c;
>  	}
>  
>  	/* Single ASCII byte or first byte of a sequence received */
> @@ -2660,14 +2645,7 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
>  	} else if ((c & 0xf8) == 0xf0) {
>  		vc->vc_utf_count = 3;
>  		vc->vc_utf_char = (c & 0x07);
> -	} else if ((c & 0xfc) == 0xf8) {
> -		vc->vc_utf_count = 4;
> -		vc->vc_utf_char = (c & 0x03);
> -	} else if ((c & 0xfe) == 0xfc) {
> -		vc->vc_utf_count = 5;
> -		vc->vc_utf_char = (c & 0x01);
>  	} else {
> -		/* 254 and 255 are invalid */
>  		return 0xfffd;
>  	}
>  
> 
> base-commit: a39b6ac3781d46ba18193c9dbb2110f31e9bffe9
> -- 
> 2.41.0
> 
> 

Hi,

This is the friendly patch-bot of Greg Kroah-Hartman.  You have sent him
a patch that has triggered this response.  He used to manually respond
to these common problems, but in order to save his sanity (he kept
writing the same thing over and over, yet to different people), I was
created.  Hopefully you will not take offence and will fix the problem
in your patch and resubmit it so that it can be accepted into the Linux
kernel tree.

You are receiving this message because of the following common error(s)
as indicated below:

- Your patch did many different things all at once, making it difficult
  to review.  All Linux kernel patches need to only do one thing at a
  time.  If you need to do multiple things (such as clean up all coding
  style issues in a file/driver), do it in a sequence of patches, each
  one doing only one thing.  This will make it easier to review the
  patches to ensure that they are correct, and to help alleviate any
  merge issues that larger patches can cause.

- This looks like a new version of a previously submitted patch, but you
  did not list below the --- line any changes from the previous version.
  Please read the section entitled "The canonical patch format" in the
  kernel file, Documentation/process/submitting-patches.rst for what
  needs to be done here to properly describe this.

If you wish to discuss this problem further, or you have questions about
how to resolve this issue, please feel free to respond to this email and
Greg will reply once he has dug out from the pending patches received
from other developers.

thanks,

greg k-h's patch email bot

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode
  2023-12-12 15:36 ` Greg KH
@ 2023-12-12 16:23   ` Roman Žilka
  2023-12-12 20:26     ` [PATCH v3] " Roman Žilka
  0 siblings, 1 reply; 10+ messages in thread
From: Roman Žilka @ 2023-12-12 16:23 UTC (permalink / raw)
  To: Greg KH, jirislaby; +Cc: linux-serial, roman.zilka

From: Roman Žilka <roman.zilka@gmail.com>

vc_translate_unicode() and vc_sanitize_unicode() parse input to the
UTF-8-enabled console, marking invalid byte sequences and producing Unicode
codepoints. The current algorithm follows ancient Unicode and may accept
invalid byte sequences, pass on non-existent codepoints and reject valid
sequences.

The patch restores the functions' compliance with modern Unicode (v15.1 +
+ many previous versions) as well as RFC 3629.
1. Codepoint space is limited to 0x10FFFF.
2. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in
   Unicode and will be accepted. Another option was to complete the set of
   noncharacters (used to be just those two, now there's more) and preserve
   the rejection step. This is indeed what Unicode suggests (v15.1, chap.
   23.7) (not requires), but most codepoints are !iswprint(), so selecting
   just the noncharacters seemed arbitrary and futile (and unnecessary).

On the side:
3. What remained of vc_sanitize_unicode() is in vc_translate_unicode().
4. Corrected vc_translate_unicode() doc (@rescan).

This is not a security patch. I'm not aware of any present security
implications of the old code.

Signed-off-by: Roman Žilka <roman.zilka@gmail.com>
---
v2: A more elaborate commit msg, e-mail formatting corrections.

 drivers/tty/vt/vt.c | 36 +++++++-----------------------------
 1 file changed, 7 insertions(+), 29 deletions(-)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 156efda7c80d..215e162ec8af 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2587,23 +2587,11 @@ static inline int vc_translate_ascii(const struct vc_data *vc, int c)
 }
 
 
-/**
- * vc_sanitize_unicode - Replace invalid Unicode code points with U+FFFD
- * @c: the received character, or U+FFFD for invalid sequences.
- */
-static inline int vc_sanitize_unicode(const int c)
-{
-	if ((c >= 0xd800 && c <= 0xdfff) || c == 0xfffe || c == 0xffff)
-		return 0xfffd;
-
-	return c;
-}
-
 /**
  * vc_translate_unicode - Combine UTF-8 into Unicode in @vc_utf_char
  * @vc: virtual console
- * @c: character to translate
- * @rescan: we return true if we need more (continuation) data
+ * @c: UTF-8 byte to translate
+ * @rescan: true => @c wasn't translated here and needs to be re-processed
  *
  * @vc_utf_char is the being-constructed unicode character.
  * @vc_utf_count is the number of continuation bytes still expected to arrive.
@@ -2611,10 +2599,7 @@ static inline int vc_sanitize_unicode(const int c)
  */
 static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 {
-	static const u32 utf8_length_changes[] = {
-		0x0000007f, 0x000007ff, 0x0000ffff,
-		0x001fffff, 0x03ffffff, 0x7fffffff
-	};
+	static const u32 utf8_length_changes[] = {0x7f, 0x7ff, 0xffff, 0x10ffff};
 
 	/* Continuation byte received */
 	if ((c & 0xc0) == 0x80) {
@@ -2629,12 +2614,12 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 
 		/* Got a whole character */
 		c = vc->vc_utf_char;
-		/* Reject overlong sequences */
+		/* Reject overlong sequences and surrogates */
 		if (c <= utf8_length_changes[vc->vc_npar - 1] ||
-				c > utf8_length_changes[vc->vc_npar])
+				c > utf8_length_changes[vc->vc_npar] ||
+				(c & 0xfff800) == 0x00d800)
 			return 0xfffd;
-
-		return vc_sanitize_unicode(c);
+		return c;
 	}
 
 	/* Single ASCII byte or first byte of a sequence received */
@@ -2660,14 +2645,7 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 	} else if ((c & 0xf8) == 0xf0) {
 		vc->vc_utf_count = 3;
 		vc->vc_utf_char = (c & 0x07);
-	} else if ((c & 0xfc) == 0xf8) {
-		vc->vc_utf_count = 4;
-		vc->vc_utf_char = (c & 0x03);
-	} else if ((c & 0xfe) == 0xfc) {
-		vc->vc_utf_count = 5;
-		vc->vc_utf_char = (c & 0x01);
 	} else {
-		/* 254 and 255 are invalid */
 		return 0xfffd;
 	}
 

base-commit: a39b6ac3781d46ba18193c9dbb2110f31e9bffe9
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v3] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode
  2023-12-12 16:23   ` [PATCH v2] " Roman Žilka
@ 2023-12-12 20:26     ` Roman Žilka
  2024-01-04 15:28       ` Greg KH
  0 siblings, 1 reply; 10+ messages in thread
From: Roman Žilka @ 2023-12-12 20:26 UTC (permalink / raw)
  To: Greg KH, jirislaby; +Cc: linux-serial, roman.zilka

vc_translate_unicode() and vc_sanitize_unicode() parse input to the
UTF-8-enabled console, marking invalid byte sequences and producing Unicode
codepoints. The current algorithm follows ancient Unicode and may accept
invalid byte sequences, pass on non-existent codepoints and reject valid
sequences.

The patch restores the functions' compliance with modern Unicode (v15.1 [1]
+ many previous versions) as well as RFC 3629 [2].
1. Codepoint space is limited to 0x10FFFF.
2. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in
   Unicode and will be accepted. Another option was to complete the set of
   noncharacters (used to be just those two, now there's more) and preserve
   the rejection step. This is indeed what Unicode suggests (v15.1, chap.
   23.7) (not requires), but most codepoints are !iswprint(), so selecting
   just the noncharacters seemed arbitrary and futile (and unnecessary).

On the side:
3. Corrected/improved the doc of the two functions (esp. @rescan).

This is not a security patch. I'm not aware of any present security
implications of the old code.

[1] https://www.unicode.org/versions/Unicode15.1.0
[2] https://datatracker.ietf.org/doc/html/rfc3629

Signed-off-by: Roman Žilka <roman.zilka@gmail.com>
---

v2: A more elaborate commit msg, e-mail formatting corrections.
v3: Shortened patch as requested. The gist of it is unchanged. Added links
    to commit msg. Changed base to current tty-next.

 drivers/tty/vt/vt.c | 20 +++++---------------
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 156efda7c80d..373f94f55ff2 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2589,11 +2589,11 @@ static inline int vc_translate_ascii(const struct vc_data *vc, int c)
 
 /**
  * vc_sanitize_unicode - Replace invalid Unicode code points with U+FFFD
- * @c: the received character, or U+FFFD for invalid sequences.
+ * @c: the received code point
  */
 static inline int vc_sanitize_unicode(const int c)
 {
-	if ((c >= 0xd800 && c <= 0xdfff) || c == 0xfffe || c == 0xffff)
+	if (c >= 0xd800 && c <= 0xdfff)
 		return 0xfffd;
 
 	return c;
@@ -2602,8 +2602,8 @@ static inline int vc_sanitize_unicode(const int c)
 /**
  * vc_translate_unicode - Combine UTF-8 into Unicode in @vc_utf_char
  * @vc: virtual console
- * @c: character to translate
- * @rescan: we return true if we need more (continuation) data
+ * @c: UTF-8 byte to translate
+ * @rescan: true => @c wasn't translated here and needs to be re-processed
  *
  * @vc_utf_char is the being-constructed unicode character.
  * @vc_utf_count is the number of continuation bytes still expected to arrive.
@@ -2611,10 +2611,7 @@ static inline int vc_sanitize_unicode(const int c)
  */
 static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 {
-	static const u32 utf8_length_changes[] = {
-		0x0000007f, 0x000007ff, 0x0000ffff,
-		0x001fffff, 0x03ffffff, 0x7fffffff
-	};
+	static const u32 utf8_length_changes[] = {0x7f, 0x7ff, 0xffff, 0x10ffff};
 
 	/* Continuation byte received */
 	if ((c & 0xc0) == 0x80) {
@@ -2660,14 +2657,7 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 	} else if ((c & 0xf8) == 0xf0) {
 		vc->vc_utf_count = 3;
 		vc->vc_utf_char = (c & 0x07);
-	} else if ((c & 0xfc) == 0xf8) {
-		vc->vc_utf_count = 4;
-		vc->vc_utf_char = (c & 0x03);
-	} else if ((c & 0xfe) == 0xfc) {
-		vc->vc_utf_count = 5;
-		vc->vc_utf_char = (c & 0x01);
 	} else {
-		/* 254 and 255 are invalid */
 		return 0xfffd;
 	}
 

base-commit: e045e18dbf3eaac32cdeb2799a5ec84fa694636c
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode
  2023-12-12 20:26     ` [PATCH v3] " Roman Žilka
@ 2024-01-04 15:28       ` Greg KH
  2024-01-09 10:28         ` Roman Žilka
  2024-01-09 10:43         ` [PATCH v4] " Roman Žilka
  0 siblings, 2 replies; 10+ messages in thread
From: Greg KH @ 2024-01-04 15:28 UTC (permalink / raw)
  To: Roman Žilka; +Cc: jirislaby, linux-serial

On Tue, Dec 12, 2023 at 09:26:53PM +0100, Roman Žilka wrote:
> vc_translate_unicode() and vc_sanitize_unicode() parse input to the
> UTF-8-enabled console, marking invalid byte sequences and producing Unicode
> codepoints. The current algorithm follows ancient Unicode and may accept
> invalid byte sequences, pass on non-existent codepoints and reject valid
> sequences.
> 
> The patch restores the functions' compliance with modern Unicode (v15.1 [1]
> + many previous versions) as well as RFC 3629 [2].
> 1. Codepoint space is limited to 0x10FFFF.

Wait, why?  And shouldn't this be an individual patch on it's own?  What
is wrong with the checking we currently have.

> 2. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in
>    Unicode and will be accepted.

Accepted when?

> Another option was to complete the set of
>    noncharacters (used to be just those two, now there's more) and preserve
>    the rejection step. This is indeed what Unicode suggests (v15.1, chap.
>    23.7) (not requires), but most codepoints are !iswprint(), so selecting
>    just the noncharacters seemed arbitrary and futile (and unnecessary).

What is this change going to break with existing systems that were
thinking these were invalid characters?

> On the side:
> 3. Corrected/improved the doc of the two functions (esp. @rescan).

Again, a separate commit.  When you have to list the changes out, that
is a huge hint it needs to be broken up into smaller pieces.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode
  2024-01-04 15:28       ` Greg KH
@ 2024-01-09 10:28         ` Roman Žilka
  2024-01-09 10:43         ` [PATCH v4] " Roman Žilka
  1 sibling, 0 replies; 10+ messages in thread
From: Roman Žilka @ 2024-01-09 10:28 UTC (permalink / raw)
  To: Greg KH; +Cc: jirislaby, linux-serial, roman.zilka

On 1/4/24 4:28 PM, Greg KH wrote:
> On Tue, Dec 12, 2023 at 09:26:53PM +0100, Roman Žilka wrote:
>> vc_translate_unicode() and vc_sanitize_unicode() parse input to the
>> UTF-8-enabled console, marking invalid byte sequences and producing Unicode
>> codepoints. The current algorithm follows ancient Unicode and may accept
>> invalid byte sequences, pass on non-existent codepoints and reject valid
>> sequences.
>>
>> The patch restores the functions' compliance with modern Unicode (v15.1 [1]
>> + many previous versions) as well as RFC 3629 [2].
>> 1. Codepoint space is limited to 0x10FFFF.
> 
> Wait, why?  And shouldn't this be an individual patch on it's own?  What
> is wrong with the checking we currently have.

This is the main point of this patch. The codepoint space got shortened in Unicode at some point between v3.0 (1999) and v4.0 (2003). The reason why is expressed by the first sentence in the commit msg. The affected functions validate input coming into the subsystem from the user, which makes it a red flag that they do not do so correctly (i.e., according to a generally accepted standard). As they stand, these functions are a potential source of compatibility and security issues. The may not be a bomb, they may be a time bomb.

Note how very old the old parsing algorithm is. I made a quick grep of the kernel source for tell-tale signs of utf8 parsing to see if there's any other place where the old algorithm is still being used. I found none, and I found these which do the 0x10ffff limiting (I didn't check the "noncharacters" handling):

fs/unicode/mkutf8data.c
fs/unicode/utf8-norm.c
fs/udf/unicode.c
fs/nls/nls_base.c (has many users outside fs/)
drivers/tty/vt/keyboard.c

I didn't check, but I have no doubt that Perl implements Unicode correctly as well.

>> 2. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in
>>    Unicode and will be accepted.
> 
> Accepted when?

Currently, the two affected functions mark these codepoints as invalid by substituting them with the placeholder U+FFFD. After the patch, U+FFFE and U+FFFF are treated as ordinary valid codepoints.

Let me point out that I've never seen a utf8 validator where "noncharacters" were treated in a special way. Of course, there're only so many validator implementations that I have seen. Checking for iswprint()-ability is common, but that's something very different. IMHO, the validator in vt.c is not a place to pay any special regard to "noncharacters".

>> Another option was to complete the set of
>>    noncharacters (used to be just those two, now there's more) and preserve
>>    the rejection step. This is indeed what Unicode suggests (v15.1, chap.
>>    23.7) (not requires), but most codepoints are !iswprint(), so selecting
>>    just the noncharacters seemed arbitrary and futile (and unnecessary).
> 
> What is this change going to break with existing systems that were
> thinking these were invalid characters?

This is mostly answered above. I don't work with the kernel in a developer capacity. I found this parsing error by accident while researching some CONFIG options. I'm not qualified to say that this patch won't break anything and it would take me an abhorrent amount of time to verify that to a reasonable degree. I don't write code for kbd or other console-related userspace tools either. I ran the patched kernel for a while, played around with fonts and various TUI utilities. I found no issues. The red flag, which I talked about earlier, was _the_ reason I submitted my patch.

>> On the side:
>> 3. Corrected/improved the doc of the two functions (esp. @rescan).
> 
> Again, a separate commit.  When you have to list the changes out, that
> is a huge hint it needs to be broken up into smaller pieces.

Ok, patch v4 coming up with this removed and I'll take care of it in a subsequent submission. That'll be one truly trivial commit, though.

-rz

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v4] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode
  2024-01-04 15:28       ` Greg KH
  2024-01-09 10:28         ` Roman Žilka
@ 2024-01-09 10:43         ` Roman Žilka
  1 sibling, 0 replies; 10+ messages in thread
From: Roman Žilka @ 2024-01-09 10:43 UTC (permalink / raw)
  To: Greg KH; +Cc: jirislaby, linux-serial, roman.zilka

vc_translate_unicode() and vc_sanitize_unicode() parse input to the
UTF-8-enabled console, marking invalid byte sequences and producing Unicode
codepoints. The current algorithm follows ancient Unicode and may accept
invalid byte sequences, pass on non-existent codepoints and reject valid
sequences.

The patch restores the functions' compliance with modern Unicode (v15.1 [1]
+ many previous versions) as well as RFC 3629 [2].
1. Codepoint space is limited to 0x10FFFF.
2. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in
   Unicode and will be accepted. Another option was to complete the set of
   noncharacters (used to be just those two, now there's more) and preserve
   the rejection step. This is indeed what Unicode suggests ([1] chap.
   23.7) (not requires), but most codepoints are !iswprint(), so selecting
   just the noncharacters seemed arbitrary and futile (and unnecessary).

This is not a security patch. I'm not aware of any present security
implications of the old code.

[1] https://www.unicode.org/versions/Unicode15.1.0
[2] https://datatracker.ietf.org/doc/html/rfc3629

Signed-off-by: Roman Žilka <roman.zilka@gmail.com>
---

v2: A more elaborate commit msg, e-mail formatting corrections.
v3: Shortened patch as requested. The gist of it is unchanged. Added links
    to commit msg. Changed base to current tty-next.
v4: Removed func doc correction as requested. Updated base to current
    tty-next.

 drivers/tty/vt/vt.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 156efda7c80d..35c2ab8c5280 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2593,7 +2593,7 @@ static inline int vc_translate_ascii(const struct vc_data *vc, int c)
  */
 static inline int vc_sanitize_unicode(const int c)
 {
-	if ((c >= 0xd800 && c <= 0xdfff) || c == 0xfffe || c == 0xffff)
+	if (c >= 0xd800 && c <= 0xdfff)
 		return 0xfffd;
 
 	return c;
@@ -2611,10 +2611,7 @@ static inline int vc_sanitize_unicode(const int c)
  */
 static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 {
-	static const u32 utf8_length_changes[] = {
-		0x0000007f, 0x000007ff, 0x0000ffff,
-		0x001fffff, 0x03ffffff, 0x7fffffff
-	};
+	static const u32 utf8_length_changes[] = {0x7f, 0x7ff, 0xffff, 0x10ffff};
 
 	/* Continuation byte received */
 	if ((c & 0xc0) == 0x80) {
@@ -2660,14 +2657,7 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 	} else if ((c & 0xf8) == 0xf0) {
 		vc->vc_utf_count = 3;
 		vc->vc_utf_char = (c & 0x07);
-	} else if ((c & 0xfc) == 0xf8) {
-		vc->vc_utf_count = 4;
-		vc->vc_utf_char = (c & 0x03);
-	} else if ((c & 0xfe) == 0xfc) {
-		vc->vc_utf_count = 5;
-		vc->vc_utf_char = (c & 0x01);
 	} else {
-		/* 254 and 255 are invalid */
 		return 0xfffd;
 	}
 

base-commit: 0c84bea0cabc4e2b98a3de88eeb4ff798931f056
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-01-09 10:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-12 15:13 [PATCH] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode Roman Žilka
2023-12-12 15:36 ` Greg KH
2023-12-12 16:23   ` [PATCH v2] " Roman Žilka
2023-12-12 20:26     ` [PATCH v3] " Roman Žilka
2024-01-04 15:28       ` Greg KH
2024-01-09 10:28         ` Roman Žilka
2024-01-09 10:43         ` [PATCH v4] " Roman Žilka
  -- strict thread matches above, loose matches on Subject: below --
2023-12-12  7:40 [PATCH] " Roman Zilka
2023-12-12  8:24 ` Greg KH
2023-12-12  9:20 ` Jiri Slaby

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).