Dash Archive mirror
 help / color / mirror / Atom feed
* alias confusion due to internal word representation
@ 2023-01-11  2:01 Harald van Dijk
  2023-01-12 15:56 ` Michael Greenberg
  2024-04-12  9:51 ` [PATCH] alias: Disallow non-CWORD characters Herbert Xu
  0 siblings, 2 replies; 5+ messages in thread
From: Harald van Dijk @ 2023-01-11  2:01 UTC (permalink / raw
  To: DASH shell mailing list

Hi,

Please consider

   alias $(printf "\204")="exit 2"
   $(:)
   echo ok

This is a perfectly valid shell script. The alias command is permitted 
to either succeed or fail, and if it succeeds, it defines an alias whose 
name is the single byte '\204'. No command by that name is ever 
executed, so this script is required to then print 'ok' and exit 
successfully.

This is not what happens. Internally, '\204' is the value of CTLBACKQ, 
and the word $(:) gets translated to an internal representation of just 
that -- coupled with a pointer to the parsed command. Since the word 
$(:) contains zero quote characters, it is subjected to alias expansion, 
and picks up this alias definition that makes the command expand to
exit 2.

Consider also:

   alias $(printf "\201")="echo ok" $(printf "\201\201")="echo bad" &&
   eval $(printf "\201")

This should either print "ok", or reject the aliases. Instead, it prints 
"bad". This happens because '\201' is the internal representation of 
CTLESC, and a literal byte of that value is represented by escaping it 
with CTLESC. Therefore, it triggers the expansion of the \201\201 alias.

Supporting alias names containing non-ASCII characters, while not 
required by POSIX, seems desirable, and almost all other shells (mksh 
being the exception) do appear to support this. I am not yet seeing a 
good way of solving this.

Cheers,
Harald van Dijk

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: alias confusion due to internal word representation
  2023-01-11  2:01 alias confusion due to internal word representation Harald van Dijk
@ 2023-01-12 15:56 ` Michael Greenberg
  2023-01-13 13:41   ` Harald van Dijk
  2024-04-12  9:51 ` [PATCH] alias: Disallow non-CWORD characters Herbert Xu
  1 sibling, 1 reply; 5+ messages in thread
From: Michael Greenberg @ 2023-01-12 15:56 UTC (permalink / raw
  To: Harald van Dijk, DASH shell mailing list

I've seen this issue in interop with modernish, where dash doesn't
properly support multibyte character encodings. It would be great to
have proper support for more interesting character sets.

It's worth looking at how other shells implement this, but a natural
choice is to choose a more sophisticated representation of string, where
control characters are represented out-of-band rather than by using
'unused' values (well, unused by ASCII, but not by
others). Unfortunately, this rope-like representation is not very
convenient to work with.

Cheers,
Michael

On 2023-01-11 at 02:01:03 AM, Harald van Dijk wrote:

> Hi,
>
> Please consider
>
>    alias $(printf "\204")="exit 2"
>    $(:)
>    echo ok
>
> This is a perfectly valid shell script. The alias command is permitted 
> to either succeed or fail, and if it succeeds, it defines an alias whose 
> name is the single byte '\204'. No command by that name is ever 
> executed, so this script is required to then print 'ok' and exit 
> successfully.
>
> This is not what happens. Internally, '\204' is the value of CTLBACKQ, 
> and the word $(:) gets translated to an internal representation of just 
> that -- coupled with a pointer to the parsed command. Since the word 
> $(:) contains zero quote characters, it is subjected to alias expansion, 
> and picks up this alias definition that makes the command expand to
> exit 2.
>
> Consider also:
>
>    alias $(printf "\201")="echo ok" $(printf "\201\201")="echo bad" &&
>    eval $(printf "\201")
>
> This should either print "ok", or reject the aliases. Instead, it prints 
> "bad". This happens because '\201' is the internal representation of 
> CTLESC, and a literal byte of that value is represented by escaping it 
> with CTLESC. Therefore, it triggers the expansion of the \201\201 alias.
>
> Supporting alias names containing non-ASCII characters, while not 
> required by POSIX, seems desirable, and almost all other shells (mksh 
> being the exception) do appear to support this. I am not yet seeing a 
> good way of solving this.
>
> Cheers,
> Harald van Dijk

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: alias confusion due to internal word representation
  2023-01-12 15:56 ` Michael Greenberg
@ 2023-01-13 13:41   ` Harald van Dijk
  2024-04-15  8:47     ` [PATCH] parser: Extend coverage of CHKEOFMARK Herbert Xu
  0 siblings, 1 reply; 5+ messages in thread
From: Harald van Dijk @ 2023-01-13 13:41 UTC (permalink / raw
  To: Michael Greenberg, DASH shell mailing list

On 12/01/2023 15:56, Michael Greenberg wrote:
> I've seen this issue in interop with modernish, where dash doesn't
> properly support multibyte character encodings. It would be great to
> have proper support for more interesting character sets.

Agreed, but personally, I see that as a separate issue. Non-ASCII bytes 
in alias names are potentially useful even without multibyte character 
support, and multibyte character support is definitely useful regardless 
of anything that aliases do.

> It's worth looking at how other shells implement this, but a natural
> choice is to choose a more sophisticated representation of string, where
> control characters are represented out-of-band rather than by using
> 'unused' values (well, unused by ASCII, but not by
> others). Unfortunately, this rope-like representation is not very
> convenient to work with.

There are a few shells that store words in the exact form that they 
appear in the script. This has interesting consequences: it shows the 
same effect, but in an arguably less problematic form. In bosh:

   $ alias '\a=echo what'
   $ a
   a: not found
   $ \a
   what

Other shells that take this approach restrict alias names so that shell 
special characters, including '\', cannot appear, but other characters can.

Although this representation is not without its problems, it would 
handle this transparently, and has the arguable benefit of automatically 
handling things like

   $ cat <<`this is problematic`
   hello world
   `this is problematic`

as well. This has come up on the list before; this works in bash, ksh, 
yash, and zsh, and POSIX places no restrictions on what words can be 
used as heredoc delimiters, so I think shells are required to accept 
this and it is technically a bug that dash doesn't (as well as several 
other shells), even if no one would ever make use of it.

It would require massive changes and is almost certainly not appropriate 
for dash.

But we can take inspiration from it and think of a more limited fix, one 
that works to handle the alias issue, but nothing else: it should be 
possible to change the representation of CTLESC and other control 
characters. If their representation is changed to that of characters 
that cannot appear in alias names without quoting anyway, e.g. by 
changing the value of CTLESC to \ and similarly for other control 
characters, the second problem is avoided. By changing the parser to set 
quoteflag when processing command substitutions etc., the first problem 
is avoided. (This may involve renaming quoteflag, since it would no 
longer reflect simply whether any part of the word was quoted. Or a new 
variable may be added to track this.) This should be enough to make it work.

Cheers,
Harald van Dijk

> Cheers,
> Michael
> 
> On 2023-01-11 at 02:01:03 AM, Harald van Dijk wrote:
> 
>> Hi,
>>
>> Please consider
>>
>>     alias $(printf "\204")="exit 2"
>>     $(:)
>>     echo ok
>>
>> This is a perfectly valid shell script. The alias command is permitted
>> to either succeed or fail, and if it succeeds, it defines an alias whose
>> name is the single byte '\204'. No command by that name is ever
>> executed, so this script is required to then print 'ok' and exit
>> successfully.
>>
>> This is not what happens. Internally, '\204' is the value of CTLBACKQ,
>> and the word $(:) gets translated to an internal representation of just
>> that -- coupled with a pointer to the parsed command. Since the word
>> $(:) contains zero quote characters, it is subjected to alias expansion,
>> and picks up this alias definition that makes the command expand to
>> exit 2.
>>
>> Consider also:
>>
>>     alias $(printf "\201")="echo ok" $(printf "\201\201")="echo bad" &&
>>     eval $(printf "\201")
>>
>> This should either print "ok", or reject the aliases. Instead, it prints
>> "bad". This happens because '\201' is the internal representation of
>> CTLESC, and a literal byte of that value is represented by escaping it
>> with CTLESC. Therefore, it triggers the expansion of the \201\201 alias.
>>
>> Supporting alias names containing non-ASCII characters, while not
>> required by POSIX, seems desirable, and almost all other shells (mksh
>> being the exception) do appear to support this. I am not yet seeing a
>> good way of solving this.
>>
>> Cheers,
>> Harald van Dijk
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] alias: Disallow non-CWORD characters
  2023-01-11  2:01 alias confusion due to internal word representation Harald van Dijk
  2023-01-12 15:56 ` Michael Greenberg
@ 2024-04-12  9:51 ` Herbert Xu
  1 sibling, 0 replies; 5+ messages in thread
From: Herbert Xu @ 2024-04-12  9:51 UTC (permalink / raw
  To: Harald van Dijk; +Cc: DASH shell mailing list

On Wed, Jan 11, 2023 at 02:01:03AM +0000, Harald van Dijk wrote:
>
> Supporting alias names containing non-ASCII characters, while not required
> by POSIX, seems desirable, and almost all other shells (mksh being the
> exception) do appear to support this. I am not yet seeing a good way of
> solving this.

I looked at supporting this properly by either adding or removing
quotes but the complexity is just not worth it.  Let's just ban
these characters.

---8<---
Alias names containing control characters may match words from
the parser that shouldn't be matched.  Disallow such characters
from appearing in an alias name.

Reported-by: Harald van Dijk <harald@gigawatt.nl>

diff --git a/src/alias.c b/src/alias.c
index cee07e9..3cd3413 100644
--- a/src/alias.c
+++ b/src/alias.c
@@ -41,6 +41,7 @@
 #include "mystring.h"
 #include "alias.h"
 #include "options.h"	/* XXX for argptr (should remove?) */
+#include "syntax.h"
 
 #define ATABSIZE 39
 
@@ -55,6 +56,11 @@ void
 setalias(const char *name, const char *val)
 {
 	struct alias *ap, **app;
+	const char *p;
+
+	for (p = name; *p; p++)
+		if (BASESYNTAX[(signed char)*p] != CWORD)
+			sh_error("Invalid alias name: %s", name);
 
 	app = __lookupalias(name);
 	ap = *app;
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH] parser: Extend coverage of CHKEOFMARK
  2023-01-13 13:41   ` Harald van Dijk
@ 2024-04-15  8:47     ` Herbert Xu
  0 siblings, 0 replies; 5+ messages in thread
From: Herbert Xu @ 2024-04-15  8:47 UTC (permalink / raw
  To: Harald van Dijk; +Cc: Michael Greenberg, DASH shell mailing list

On Fri, Jan 13, 2023 at 01:41:44PM +0000, Harald van Dijk wrote:
>
> Although this representation is not without its problems, it would handle
> this transparently, and has the arguable benefit of automatically handling
> things like
> 
>   $ cat <<`this is problematic`
>   hello world
>   `this is problematic`
> 
> as well. This has come up on the list before; this works in bash, ksh, yash,
> and zsh, and POSIX places no restrictions on what words can be used as
> heredoc delimiters, so I think shells are required to accept this and it is
> technically a bug that dash doesn't (as well as several other shells), even
> if no one would ever make use of it.

Alright.  I've bitten the bullet and extended the existing CHKEOFMARK
to cover all the cases.

---8<---
Extend the coverage of CHKEOFMARK to cover parameter expansion,
arithmetic expansion, and command substitution.

For command substitution, use the reconstruction from commandtext
as the here-document marker.

Reported-by: Harald van Dijk <harald@gigawatt.nl>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/src/jobs.c b/src/jobs.c
index 2a2fe22..ac22ae5 100644
--- a/src/jobs.c
+++ b/src/jobs.c
@@ -1242,13 +1242,19 @@ commandtext(union node *n)
 {
 	char *name;
 
-	STARTSTACKSTR(cmdnextc);
-	cmdtxt(n);
+	STARTSTACKSTR(name);
+	commandtextcont(n, name);
 	name = stackblock();
 	TRACE(("commandtext: name %p, end %p\n", name, cmdnextc));
 	return savestr(name);
 }
 
+char *commandtextcont(union node *n, char *next)
+{
+	cmdnextc = next;
+	cmdtxt(n);
+	return cmdnextc;
+}
 
 STATIC void
 cmdtxt(union node *n)
diff --git a/src/jobs.h b/src/jobs.h
index 2832d64..a58d2a2 100644
--- a/src/jobs.h
+++ b/src/jobs.h
@@ -107,6 +107,7 @@ int forkshell(struct job *, union node *, int);
 struct job *vforkexec(union node *n, char **argv, const char *path, int idx);
 int waitforjob(struct job *);
 int stoppedjobs(void);
+char *commandtextcont(union node *n, char *next);
 
 #if ! JOBS
 #define setjobctl(on) ((void)(on))	/* do nothing */
diff --git a/src/parser.c b/src/parser.c
index 299c260..e3168de 100644
--- a/src/parser.c
+++ b/src/parser.c
@@ -46,6 +46,7 @@
 #include "syntax.h"
 #include "options.h"
 #include "input.h"
+#include "jobs.h"
 #include "output.h"
 #include "var.h"
 #include "error.h"
@@ -628,9 +629,10 @@ parsefname(void)
 	union node *n = redirnode;
 
 	if (n->type == NHERE)
-		checkkwd = CHKEOFMARK;
+		checkkwd |= CHKEOFMARK;
 	if (readtoken() != TWORD)
 		synexpect(-1);
+	checkkwd &= ~CHKEOFMARK;
 	if (n->type == NHERE) {
 		struct heredoc *here = heredoc;
 		struct heredoc *p;
@@ -901,6 +903,7 @@ readtoken1(int firstc, char const *syntax, char *eofmark, int striptabs)
 	/* syntax stack */
 	struct synstack synbase = { .syntax = syntax };
 	struct synstack *synstack = &synbase;
+	int chkeofmark = checkkwd & CHKEOFMARK;
 
 	if (syntax == DQSYNTAX)
 		synstack->dblquote = 1;
@@ -1010,39 +1013,35 @@ toggledq:
 						synstack_pop(&synstack);
 					else if (synstack->dqvarnest > 0)
 						synstack->dqvarnest--;
-					USTPUTC(CTLENDVAR, out);
-				} else {
-					USTPUTC(c, out);
+					if (!chkeofmark)
+						c = CTLENDVAR;
 				}
+				USTPUTC(c, out);
 				break;
 			case CLP:	/* '(' in arithmetic */
 				synstack->parenlevel++;
 				USTPUTC(c, out);
 				break;
 			case CRP:	/* ')' in arithmetic */
-				if (synstack->parenlevel > 0) {
-					USTPUTC(c, out);
+				if (synstack->parenlevel > 0)
 					--synstack->parenlevel;
+				else if (pgetc_eatbnl() == ')') {
+					synstack_pop(&synstack);
+					if (chkeofmark)
+						USTPUTC(c, out);
+					else
+						c = CTLENDARI;
 				} else {
-					if (pgetc_eatbnl() == ')') {
-						USTPUTC(CTLENDARI, out);
-						synstack_pop(&synstack);
-					} else {
-						/*
-						 * unbalanced parens
-						 *  (don't 2nd guess - no error)
-						 */
-						pungetc();
-						USTPUTC(')', out);
-					}
+					/*
+					 * unbalanced parens
+					 *  (don't 2nd guess - no error)
+					 */
+					pungetc();
 				}
+				USTPUTC(c, out);
 				break;
 			case CBQUOTE:	/* '`' */
-				if (checkkwd & CHKEOFMARK) {
-					USTPUTC('`', out);
-					break;
-				}
-
+				USTPUTC('`', out);
 				PARSEBACKQOLD();
 				break;
 			case CEOF:
@@ -1218,13 +1217,16 @@ parsesub: {
 	static const char types[] = "}-+?=";
 
 	c = pgetc_eatbnl();
-	if (
-		(checkkwd & CHKEOFMARK) ||
-		(c != '(' && c != '{' && !is_name(c) && !is_special(c))
-	) {
+	if (c != '(' && c != '{' && !is_name(c) && !is_special(c)) {
 		USTPUTC('$', out);
 		pungetc();
-	} else if (c == '(') {	/* $(command) or $((arith)) */
+		goto parsesub_return;
+	}
+
+	USTPUTC('$', out);
+
+	if (c == '(') {		/* $(command) or $((arith)) */
+		USTPUTC(c, out);
 		if (pgetc_eatbnl() == '(') {
 			PARSEARITH();
 		} else {
@@ -1234,11 +1236,15 @@ parsesub: {
 	} else {
 		const char *newsyn = synstack->syntax;
 
-		USTPUTC(CTLVAR, out);
 		typeloc = out - (char *)stackblock();
-		STADJUST(1, out);
+		if (!chkeofmark) {
+			out[-1] = CTLVAR;
+			STADJUST(1, out);
+		}
 		subtype = VSNORMAL;
 		if (likely(c == '{')) {
+			if (chkeofmark)
+				USTPUTC('{', out);
 			c = pgetc_eatbnl();
 			subtype = 0;
 		}
@@ -1262,8 +1268,11 @@ varname:
 			if (!subtype && cc == '#') {
 				subtype = VSLENGTH;
 
-				if (c == '_' || isalnum(c))
+				if (c == '_' || isalnum(c)) {
+					if (chkeofmark)
+						USTPUTC('#', out);
 					goto varname;
+				}
 
 				cc = c;
 				c = pgetc_eatbnl();
@@ -1272,7 +1281,8 @@ varname:
 					subtype = 0;
 					c = cc;
 					cc = '#';
-				}
+				} else if (chkeofmark)
+					USTPUTC('#', out);
 			}
 
 			if (!is_special(cc)) {
@@ -1288,10 +1298,15 @@ varname:
 		if (subtype == 0) {
 			int cc = c;
 
+			if (chkeofmark)
+				STPUTC(c, out);
+
 			switch (c) {
 			case ':':
 				subtype = VSNUL;
 				c = pgetc_eatbnl();
+				if (chkeofmark)
+					STPUTC(c, out);
 				/*FALLTHROUGH*/
 			default:
 				p = strchr(types, c);
@@ -1304,9 +1319,11 @@ varname:
 				subtype = c == '#' ? VSTRIMLEFT :
 						     VSTRIMRIGHT;
 				c = pgetc_eatbnl();
-				if (c == cc)
+				if (c == cc) {
+					if (chkeofmark)
+						STPUTC(c, out);
 					subtype++;
-				else
+				} else
 					pungetc();
 
 				newsyn = BASESYNTAX;
@@ -1333,13 +1350,15 @@ badsub:
 			synstack->dblquote = newsyn != BASESYNTAX;
 		}
 
-		*((char *)stackblock() + typeloc) = subtype | VSBIT;
 		if (subtype != VSNORMAL) {
 			synstack->varnest++;
 			if (synstack->dblquote)
 				synstack->dqvarnest++;
 		}
-		STPUTC('=', out);
+		if (!chkeofmark) {
+			*((char *)stackblock() + typeloc) = subtype | VSBIT;
+			STPUTC('=', out);
+		}
 	}
 	goto parsesub_return;
 }
@@ -1353,14 +1372,19 @@ badsub:
  */
 
 parsebackq: {
-	struct nodelist **nlpp;
-	union node *n;
-	char *str;
-	size_t savelen;
-	struct heredoc *saveheredoclist;
 	int uninitialized_var(saveprompt);
+	struct heredoc *saveheredoclist;
+	struct nodelist **nlpp;
+	size_t psavelen;
+	size_t savelen;
+	union node *n;
+	char *pstr;
+	char *str;
 
-	USTPUTC(CTLBACKQ, out);
+	if (!chkeofmark) {
+		STADJUST(oldstyle - 1, out);
+		out[-1] = CTLBACKQ;
+	}
 	str = stackblock();
 	savelen = out - (char *)stackblock();
 	grabstackblock(savelen);
@@ -1370,9 +1394,6 @@ parsebackq: {
                    reread it as input, interpreting it normally.  */
                 char *pout;
                 int pc;
-                size_t psavelen;
-                char *pstr;
-
 
                 STARTSTACKSTR(pout);
 		for (;;) {
@@ -1405,10 +1426,8 @@ parsebackq: {
 done:
                 STPUTC('\0', pout);
                 psavelen = pout - (char *)stackblock();
-                if (psavelen > 0) {
-			pstr = grabstackstr(pout);
-			setinputstring(pstr);
-                }
+		pstr = grabstackstr(pout);
+		setinputstring(pstr);
         }
 	nlpp = &bqlist;
 	while (*nlpp)
@@ -1440,14 +1459,26 @@ done:
 	(*nlpp)->n = n;
 	/* Start reading from old file again. */
 	popfile();
-	/* Ignore any pushed back tokens left from the backquote parsing. */
-	if (oldstyle)
-		tokpushback = 0;
+
 	out = stnputs(str, savelen, stackblock());
-	if (oldstyle)
+
+	if (oldstyle) {
+		/* Ignore any pushed back tokens left from the backquote
+		 * parsing.
+		 */
+		tokpushback = 0;
+		if (chkeofmark) {
+			pstr[psavelen - 1] = '`';
+			out = stnputs(pstr, psavelen, out);
+		}
 		goto parsebackq_oldreturn;
-	else
+	} else {
+		if (chkeofmark) {
+			out = commandtextcont(n, out);
+			USTPUTC(')', out);
+		}
 		goto parsebackq_newreturn;
+	}
 }
 
 /*
@@ -1459,7 +1490,12 @@ parsearith: {
 		      synstack->prev ?: alloca(sizeof(*synstack)),
 		      ARISYNTAX);
 	synstack->dblquote = 1;
-	USTPUTC(CTLARI, out);
+	if (chkeofmark)
+		USTPUTC(c, out);
+	else {
+		STADJUST(-1, out);
+		out[-1] = CTLARI;
+	}
 	goto parsearith_return;
 }
 
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-04-15  8:47 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-11  2:01 alias confusion due to internal word representation Harald van Dijk
2023-01-12 15:56 ` Michael Greenberg
2023-01-13 13:41   ` Harald van Dijk
2024-04-15  8:47     ` [PATCH] parser: Extend coverage of CHKEOFMARK Herbert Xu
2024-04-12  9:51 ` [PATCH] alias: Disallow non-CWORD characters Herbert Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).