* alias confusion due to internal word representation
@ 2023-01-11 2:01 Harald van Dijk
2023-01-12 15:56 ` Michael Greenberg
2024-04-12 9:51 ` [PATCH] alias: Disallow non-CWORD characters Herbert Xu
0 siblings, 2 replies; 5+ messages in thread
From: Harald van Dijk @ 2023-01-11 2:01 UTC (permalink / raw
To: DASH shell mailing list
Hi,
Please consider
alias $(printf "\204")="exit 2"
$(:)
echo ok
This is a perfectly valid shell script. The alias command is permitted
to either succeed or fail, and if it succeeds, it defines an alias whose
name is the single byte '\204'. No command by that name is ever
executed, so this script is required to then print 'ok' and exit
successfully.
This is not what happens. Internally, '\204' is the value of CTLBACKQ,
and the word $(:) gets translated to an internal representation of just
that -- coupled with a pointer to the parsed command. Since the word
$(:) contains zero quote characters, it is subjected to alias expansion,
and picks up this alias definition that makes the command expand to
exit 2.
Consider also:
alias $(printf "\201")="echo ok" $(printf "\201\201")="echo bad" &&
eval $(printf "\201")
This should either print "ok", or reject the aliases. Instead, it prints
"bad". This happens because '\201' is the internal representation of
CTLESC, and a literal byte of that value is represented by escaping it
with CTLESC. Therefore, it triggers the expansion of the \201\201 alias.
Supporting alias names containing non-ASCII characters, while not
required by POSIX, seems desirable, and almost all other shells (mksh
being the exception) do appear to support this. I am not yet seeing a
good way of solving this.
Cheers,
Harald van Dijk
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: alias confusion due to internal word representation
2023-01-11 2:01 alias confusion due to internal word representation Harald van Dijk
@ 2023-01-12 15:56 ` Michael Greenberg
2023-01-13 13:41 ` Harald van Dijk
2024-04-12 9:51 ` [PATCH] alias: Disallow non-CWORD characters Herbert Xu
1 sibling, 1 reply; 5+ messages in thread
From: Michael Greenberg @ 2023-01-12 15:56 UTC (permalink / raw
To: Harald van Dijk, DASH shell mailing list
I've seen this issue in interop with modernish, where dash doesn't
properly support multibyte character encodings. It would be great to
have proper support for more interesting character sets.
It's worth looking at how other shells implement this, but a natural
choice is to choose a more sophisticated representation of string, where
control characters are represented out-of-band rather than by using
'unused' values (well, unused by ASCII, but not by
others). Unfortunately, this rope-like representation is not very
convenient to work with.
Cheers,
Michael
On 2023-01-11 at 02:01:03 AM, Harald van Dijk wrote:
> Hi,
>
> Please consider
>
> alias $(printf "\204")="exit 2"
> $(:)
> echo ok
>
> This is a perfectly valid shell script. The alias command is permitted
> to either succeed or fail, and if it succeeds, it defines an alias whose
> name is the single byte '\204'. No command by that name is ever
> executed, so this script is required to then print 'ok' and exit
> successfully.
>
> This is not what happens. Internally, '\204' is the value of CTLBACKQ,
> and the word $(:) gets translated to an internal representation of just
> that -- coupled with a pointer to the parsed command. Since the word
> $(:) contains zero quote characters, it is subjected to alias expansion,
> and picks up this alias definition that makes the command expand to
> exit 2.
>
> Consider also:
>
> alias $(printf "\201")="echo ok" $(printf "\201\201")="echo bad" &&
> eval $(printf "\201")
>
> This should either print "ok", or reject the aliases. Instead, it prints
> "bad". This happens because '\201' is the internal representation of
> CTLESC, and a literal byte of that value is represented by escaping it
> with CTLESC. Therefore, it triggers the expansion of the \201\201 alias.
>
> Supporting alias names containing non-ASCII characters, while not
> required by POSIX, seems desirable, and almost all other shells (mksh
> being the exception) do appear to support this. I am not yet seeing a
> good way of solving this.
>
> Cheers,
> Harald van Dijk
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: alias confusion due to internal word representation
2023-01-12 15:56 ` Michael Greenberg
@ 2023-01-13 13:41 ` Harald van Dijk
2024-04-15 8:47 ` [PATCH] parser: Extend coverage of CHKEOFMARK Herbert Xu
0 siblings, 1 reply; 5+ messages in thread
From: Harald van Dijk @ 2023-01-13 13:41 UTC (permalink / raw
To: Michael Greenberg, DASH shell mailing list
On 12/01/2023 15:56, Michael Greenberg wrote:
> I've seen this issue in interop with modernish, where dash doesn't
> properly support multibyte character encodings. It would be great to
> have proper support for more interesting character sets.
Agreed, but personally, I see that as a separate issue. Non-ASCII bytes
in alias names are potentially useful even without multibyte character
support, and multibyte character support is definitely useful regardless
of anything that aliases do.
> It's worth looking at how other shells implement this, but a natural
> choice is to choose a more sophisticated representation of string, where
> control characters are represented out-of-band rather than by using
> 'unused' values (well, unused by ASCII, but not by
> others). Unfortunately, this rope-like representation is not very
> convenient to work with.
There are a few shells that store words in the exact form that they
appear in the script. This has interesting consequences: it shows the
same effect, but in an arguably less problematic form. In bosh:
$ alias '\a=echo what'
$ a
a: not found
$ \a
what
Other shells that take this approach restrict alias names so that shell
special characters, including '\', cannot appear, but other characters can.
Although this representation is not without its problems, it would
handle this transparently, and has the arguable benefit of automatically
handling things like
$ cat <<`this is problematic`
hello world
`this is problematic`
as well. This has come up on the list before; this works in bash, ksh,
yash, and zsh, and POSIX places no restrictions on what words can be
used as heredoc delimiters, so I think shells are required to accept
this and it is technically a bug that dash doesn't (as well as several
other shells), even if no one would ever make use of it.
It would require massive changes and is almost certainly not appropriate
for dash.
But we can take inspiration from it and think of a more limited fix, one
that works to handle the alias issue, but nothing else: it should be
possible to change the representation of CTLESC and other control
characters. If their representation is changed to that of characters
that cannot appear in alias names without quoting anyway, e.g. by
changing the value of CTLESC to \ and similarly for other control
characters, the second problem is avoided. By changing the parser to set
quoteflag when processing command substitutions etc., the first problem
is avoided. (This may involve renaming quoteflag, since it would no
longer reflect simply whether any part of the word was quoted. Or a new
variable may be added to track this.) This should be enough to make it work.
Cheers,
Harald van Dijk
> Cheers,
> Michael
>
> On 2023-01-11 at 02:01:03 AM, Harald van Dijk wrote:
>
>> Hi,
>>
>> Please consider
>>
>> alias $(printf "\204")="exit 2"
>> $(:)
>> echo ok
>>
>> This is a perfectly valid shell script. The alias command is permitted
>> to either succeed or fail, and if it succeeds, it defines an alias whose
>> name is the single byte '\204'. No command by that name is ever
>> executed, so this script is required to then print 'ok' and exit
>> successfully.
>>
>> This is not what happens. Internally, '\204' is the value of CTLBACKQ,
>> and the word $(:) gets translated to an internal representation of just
>> that -- coupled with a pointer to the parsed command. Since the word
>> $(:) contains zero quote characters, it is subjected to alias expansion,
>> and picks up this alias definition that makes the command expand to
>> exit 2.
>>
>> Consider also:
>>
>> alias $(printf "\201")="echo ok" $(printf "\201\201")="echo bad" &&
>> eval $(printf "\201")
>>
>> This should either print "ok", or reject the aliases. Instead, it prints
>> "bad". This happens because '\201' is the internal representation of
>> CTLESC, and a literal byte of that value is represented by escaping it
>> with CTLESC. Therefore, it triggers the expansion of the \201\201 alias.
>>
>> Supporting alias names containing non-ASCII characters, while not
>> required by POSIX, seems desirable, and almost all other shells (mksh
>> being the exception) do appear to support this. I am not yet seeing a
>> good way of solving this.
>>
>> Cheers,
>> Harald van Dijk
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH] alias: Disallow non-CWORD characters
2023-01-11 2:01 alias confusion due to internal word representation Harald van Dijk
2023-01-12 15:56 ` Michael Greenberg
@ 2024-04-12 9:51 ` Herbert Xu
1 sibling, 0 replies; 5+ messages in thread
From: Herbert Xu @ 2024-04-12 9:51 UTC (permalink / raw
To: Harald van Dijk; +Cc: DASH shell mailing list
On Wed, Jan 11, 2023 at 02:01:03AM +0000, Harald van Dijk wrote:
>
> Supporting alias names containing non-ASCII characters, while not required
> by POSIX, seems desirable, and almost all other shells (mksh being the
> exception) do appear to support this. I am not yet seeing a good way of
> solving this.
I looked at supporting this properly by either adding or removing
quotes but the complexity is just not worth it. Let's just ban
these characters.
---8<---
Alias names containing control characters may match words from
the parser that shouldn't be matched. Disallow such characters
from appearing in an alias name.
Reported-by: Harald van Dijk <harald@gigawatt.nl>
diff --git a/src/alias.c b/src/alias.c
index cee07e9..3cd3413 100644
--- a/src/alias.c
+++ b/src/alias.c
@@ -41,6 +41,7 @@
#include "mystring.h"
#include "alias.h"
#include "options.h" /* XXX for argptr (should remove?) */
+#include "syntax.h"
#define ATABSIZE 39
@@ -55,6 +56,11 @@ void
setalias(const char *name, const char *val)
{
struct alias *ap, **app;
+ const char *p;
+
+ for (p = name; *p; p++)
+ if (BASESYNTAX[(signed char)*p] != CWORD)
+ sh_error("Invalid alias name: %s", name);
app = __lookupalias(name);
ap = *app;
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH] parser: Extend coverage of CHKEOFMARK
2023-01-13 13:41 ` Harald van Dijk
@ 2024-04-15 8:47 ` Herbert Xu
0 siblings, 0 replies; 5+ messages in thread
From: Herbert Xu @ 2024-04-15 8:47 UTC (permalink / raw
To: Harald van Dijk; +Cc: Michael Greenberg, DASH shell mailing list
On Fri, Jan 13, 2023 at 01:41:44PM +0000, Harald van Dijk wrote:
>
> Although this representation is not without its problems, it would handle
> this transparently, and has the arguable benefit of automatically handling
> things like
>
> $ cat <<`this is problematic`
> hello world
> `this is problematic`
>
> as well. This has come up on the list before; this works in bash, ksh, yash,
> and zsh, and POSIX places no restrictions on what words can be used as
> heredoc delimiters, so I think shells are required to accept this and it is
> technically a bug that dash doesn't (as well as several other shells), even
> if no one would ever make use of it.
Alright. I've bitten the bullet and extended the existing CHKEOFMARK
to cover all the cases.
---8<---
Extend the coverage of CHKEOFMARK to cover parameter expansion,
arithmetic expansion, and command substitution.
For command substitution, use the reconstruction from commandtext
as the here-document marker.
Reported-by: Harald van Dijk <harald@gigawatt.nl>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/src/jobs.c b/src/jobs.c
index 2a2fe22..ac22ae5 100644
--- a/src/jobs.c
+++ b/src/jobs.c
@@ -1242,13 +1242,19 @@ commandtext(union node *n)
{
char *name;
- STARTSTACKSTR(cmdnextc);
- cmdtxt(n);
+ STARTSTACKSTR(name);
+ commandtextcont(n, name);
name = stackblock();
TRACE(("commandtext: name %p, end %p\n", name, cmdnextc));
return savestr(name);
}
+char *commandtextcont(union node *n, char *next)
+{
+ cmdnextc = next;
+ cmdtxt(n);
+ return cmdnextc;
+}
STATIC void
cmdtxt(union node *n)
diff --git a/src/jobs.h b/src/jobs.h
index 2832d64..a58d2a2 100644
--- a/src/jobs.h
+++ b/src/jobs.h
@@ -107,6 +107,7 @@ int forkshell(struct job *, union node *, int);
struct job *vforkexec(union node *n, char **argv, const char *path, int idx);
int waitforjob(struct job *);
int stoppedjobs(void);
+char *commandtextcont(union node *n, char *next);
#if ! JOBS
#define setjobctl(on) ((void)(on)) /* do nothing */
diff --git a/src/parser.c b/src/parser.c
index 299c260..e3168de 100644
--- a/src/parser.c
+++ b/src/parser.c
@@ -46,6 +46,7 @@
#include "syntax.h"
#include "options.h"
#include "input.h"
+#include "jobs.h"
#include "output.h"
#include "var.h"
#include "error.h"
@@ -628,9 +629,10 @@ parsefname(void)
union node *n = redirnode;
if (n->type == NHERE)
- checkkwd = CHKEOFMARK;
+ checkkwd |= CHKEOFMARK;
if (readtoken() != TWORD)
synexpect(-1);
+ checkkwd &= ~CHKEOFMARK;
if (n->type == NHERE) {
struct heredoc *here = heredoc;
struct heredoc *p;
@@ -901,6 +903,7 @@ readtoken1(int firstc, char const *syntax, char *eofmark, int striptabs)
/* syntax stack */
struct synstack synbase = { .syntax = syntax };
struct synstack *synstack = &synbase;
+ int chkeofmark = checkkwd & CHKEOFMARK;
if (syntax == DQSYNTAX)
synstack->dblquote = 1;
@@ -1010,39 +1013,35 @@ toggledq:
synstack_pop(&synstack);
else if (synstack->dqvarnest > 0)
synstack->dqvarnest--;
- USTPUTC(CTLENDVAR, out);
- } else {
- USTPUTC(c, out);
+ if (!chkeofmark)
+ c = CTLENDVAR;
}
+ USTPUTC(c, out);
break;
case CLP: /* '(' in arithmetic */
synstack->parenlevel++;
USTPUTC(c, out);
break;
case CRP: /* ')' in arithmetic */
- if (synstack->parenlevel > 0) {
- USTPUTC(c, out);
+ if (synstack->parenlevel > 0)
--synstack->parenlevel;
+ else if (pgetc_eatbnl() == ')') {
+ synstack_pop(&synstack);
+ if (chkeofmark)
+ USTPUTC(c, out);
+ else
+ c = CTLENDARI;
} else {
- if (pgetc_eatbnl() == ')') {
- USTPUTC(CTLENDARI, out);
- synstack_pop(&synstack);
- } else {
- /*
- * unbalanced parens
- * (don't 2nd guess - no error)
- */
- pungetc();
- USTPUTC(')', out);
- }
+ /*
+ * unbalanced parens
+ * (don't 2nd guess - no error)
+ */
+ pungetc();
}
+ USTPUTC(c, out);
break;
case CBQUOTE: /* '`' */
- if (checkkwd & CHKEOFMARK) {
- USTPUTC('`', out);
- break;
- }
-
+ USTPUTC('`', out);
PARSEBACKQOLD();
break;
case CEOF:
@@ -1218,13 +1217,16 @@ parsesub: {
static const char types[] = "}-+?=";
c = pgetc_eatbnl();
- if (
- (checkkwd & CHKEOFMARK) ||
- (c != '(' && c != '{' && !is_name(c) && !is_special(c))
- ) {
+ if (c != '(' && c != '{' && !is_name(c) && !is_special(c)) {
USTPUTC('$', out);
pungetc();
- } else if (c == '(') { /* $(command) or $((arith)) */
+ goto parsesub_return;
+ }
+
+ USTPUTC('$', out);
+
+ if (c == '(') { /* $(command) or $((arith)) */
+ USTPUTC(c, out);
if (pgetc_eatbnl() == '(') {
PARSEARITH();
} else {
@@ -1234,11 +1236,15 @@ parsesub: {
} else {
const char *newsyn = synstack->syntax;
- USTPUTC(CTLVAR, out);
typeloc = out - (char *)stackblock();
- STADJUST(1, out);
+ if (!chkeofmark) {
+ out[-1] = CTLVAR;
+ STADJUST(1, out);
+ }
subtype = VSNORMAL;
if (likely(c == '{')) {
+ if (chkeofmark)
+ USTPUTC('{', out);
c = pgetc_eatbnl();
subtype = 0;
}
@@ -1262,8 +1268,11 @@ varname:
if (!subtype && cc == '#') {
subtype = VSLENGTH;
- if (c == '_' || isalnum(c))
+ if (c == '_' || isalnum(c)) {
+ if (chkeofmark)
+ USTPUTC('#', out);
goto varname;
+ }
cc = c;
c = pgetc_eatbnl();
@@ -1272,7 +1281,8 @@ varname:
subtype = 0;
c = cc;
cc = '#';
- }
+ } else if (chkeofmark)
+ USTPUTC('#', out);
}
if (!is_special(cc)) {
@@ -1288,10 +1298,15 @@ varname:
if (subtype == 0) {
int cc = c;
+ if (chkeofmark)
+ STPUTC(c, out);
+
switch (c) {
case ':':
subtype = VSNUL;
c = pgetc_eatbnl();
+ if (chkeofmark)
+ STPUTC(c, out);
/*FALLTHROUGH*/
default:
p = strchr(types, c);
@@ -1304,9 +1319,11 @@ varname:
subtype = c == '#' ? VSTRIMLEFT :
VSTRIMRIGHT;
c = pgetc_eatbnl();
- if (c == cc)
+ if (c == cc) {
+ if (chkeofmark)
+ STPUTC(c, out);
subtype++;
- else
+ } else
pungetc();
newsyn = BASESYNTAX;
@@ -1333,13 +1350,15 @@ badsub:
synstack->dblquote = newsyn != BASESYNTAX;
}
- *((char *)stackblock() + typeloc) = subtype | VSBIT;
if (subtype != VSNORMAL) {
synstack->varnest++;
if (synstack->dblquote)
synstack->dqvarnest++;
}
- STPUTC('=', out);
+ if (!chkeofmark) {
+ *((char *)stackblock() + typeloc) = subtype | VSBIT;
+ STPUTC('=', out);
+ }
}
goto parsesub_return;
}
@@ -1353,14 +1372,19 @@ badsub:
*/
parsebackq: {
- struct nodelist **nlpp;
- union node *n;
- char *str;
- size_t savelen;
- struct heredoc *saveheredoclist;
int uninitialized_var(saveprompt);
+ struct heredoc *saveheredoclist;
+ struct nodelist **nlpp;
+ size_t psavelen;
+ size_t savelen;
+ union node *n;
+ char *pstr;
+ char *str;
- USTPUTC(CTLBACKQ, out);
+ if (!chkeofmark) {
+ STADJUST(oldstyle - 1, out);
+ out[-1] = CTLBACKQ;
+ }
str = stackblock();
savelen = out - (char *)stackblock();
grabstackblock(savelen);
@@ -1370,9 +1394,6 @@ parsebackq: {
reread it as input, interpreting it normally. */
char *pout;
int pc;
- size_t psavelen;
- char *pstr;
-
STARTSTACKSTR(pout);
for (;;) {
@@ -1405,10 +1426,8 @@ parsebackq: {
done:
STPUTC('\0', pout);
psavelen = pout - (char *)stackblock();
- if (psavelen > 0) {
- pstr = grabstackstr(pout);
- setinputstring(pstr);
- }
+ pstr = grabstackstr(pout);
+ setinputstring(pstr);
}
nlpp = &bqlist;
while (*nlpp)
@@ -1440,14 +1459,26 @@ done:
(*nlpp)->n = n;
/* Start reading from old file again. */
popfile();
- /* Ignore any pushed back tokens left from the backquote parsing. */
- if (oldstyle)
- tokpushback = 0;
+
out = stnputs(str, savelen, stackblock());
- if (oldstyle)
+
+ if (oldstyle) {
+ /* Ignore any pushed back tokens left from the backquote
+ * parsing.
+ */
+ tokpushback = 0;
+ if (chkeofmark) {
+ pstr[psavelen - 1] = '`';
+ out = stnputs(pstr, psavelen, out);
+ }
goto parsebackq_oldreturn;
- else
+ } else {
+ if (chkeofmark) {
+ out = commandtextcont(n, out);
+ USTPUTC(')', out);
+ }
goto parsebackq_newreturn;
+ }
}
/*
@@ -1459,7 +1490,12 @@ parsearith: {
synstack->prev ?: alloca(sizeof(*synstack)),
ARISYNTAX);
synstack->dblquote = 1;
- USTPUTC(CTLARI, out);
+ if (chkeofmark)
+ USTPUTC(c, out);
+ else {
+ STADJUST(-1, out);
+ out[-1] = CTLARI;
+ }
goto parsearith_return;
}
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-04-15 8:47 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-11 2:01 alias confusion due to internal word representation Harald van Dijk
2023-01-12 15:56 ` Michael Greenberg
2023-01-13 13:41 ` Harald van Dijk
2024-04-15 8:47 ` [PATCH] parser: Extend coverage of CHKEOFMARK Herbert Xu
2024-04-12 9:51 ` [PATCH] alias: Disallow non-CWORD characters Herbert Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).