Git Mailing List Archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] banned: mark `strok()`, `strtok_r()` as banned
@ 2023-04-13 23:31 Taylor Blau
  2023-04-13 23:31 ` [PATCH 1/5] string-list: introduce `string_list_split_in_place_multi()` Taylor Blau
                   ` (6 more replies)
  0 siblings, 7 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-13 23:31 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano

This series adds `strtok()`, and its reentrant variant to the list of
banned functions.

The rationale is described in the final patch, but the gist is that even
though `strtok_r()` is thread-safe, it imposes a burden on the caller
and has confusing dataflow. For more details, the final patch has a full
explanation.

The series is structured as follows:

  - The first patch introduces `string_list_split_in_place_multi()`,
    which allows us to split token delimited strings in place where
    the set of accepted tokens is more than a single character.

  - The next three patches replace the only in-tree uses of strtok() we
    have, which are all in test helpers.

  - The final patch marks the two functions as banned.

Thanks in advance for your review!

Taylor Blau (5):
  string-list: introduce `string_list_split_in_place_multi()`
  t/helper/test-hashmap.c: avoid using `strtok()`
  t/helper/test-oidmap.c: avoid using `strtok()`
  t/helper/test-json-writer.c: avoid using `strtok()`
  banned.h: mark `strtok()`, `strtok_r()` as banned

 banned.h                    |  6 +++
 string-list.c               | 15 ++++++--
 string-list.h               |  6 +++
 t/helper/test-hashmap.c     | 30 +++++++++++----
 t/helper/test-json-writer.c | 76 +++++++++++++++++++++++--------------
 t/helper/test-oidmap.c      | 20 +++++++---
 6 files changed, 109 insertions(+), 44 deletions(-)

-- 
2.38.0.16.g393fd4c6db

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/5] string-list: introduce `string_list_split_in_place_multi()`
  2023-04-13 23:31 [PATCH 0/5] banned: mark `strok()`, `strtok_r()` as banned Taylor Blau
@ 2023-04-13 23:31 ` Taylor Blau
  2023-04-18 10:10   ` Jeff King
  2023-04-13 23:31 ` [PATCH 2/5] t/helper/test-hashmap.c: avoid using `strtok()` Taylor Blau
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 47+ messages in thread
From: Taylor Blau @ 2023-04-13 23:31 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano

Introduce a variant of the `string_list_split_in_place()` function that
takes a string of accepted delimiters.

By contrast to its cousin `string_list_split_in_place()` which splits
the given string at every instance of the single character `delim`, the
`_multi` variant splits the given string any any character appearing in
the string `delim`.

Instead of using `strchr(2)` to locate the first occurrence of the given
delimiter character, `string_list_split_in_place_multi()` uses
`strpbrk(2)` to find the first occurrence of *any* character in the given
delimiter string.

Since the `_multi` variant is a generalization of the original
implementation, reimplement `string_list_split_in_place()` in terms of
the more general function by providing a single-character string for the
list of accepted delimiters.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 string-list.c | 15 ++++++++++++---
 string-list.h |  6 ++++++
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/string-list.c b/string-list.c
index db473f273e1..67f9ff18904 100644
--- a/string-list.c
+++ b/string-list.c
@@ -300,8 +300,8 @@ int string_list_split(struct string_list *list, const char *string,
 	}
 }
 
-int string_list_split_in_place(struct string_list *list, char *string,
-			       int delim, int maxsplit)
+int string_list_split_in_place_multi(struct string_list *list, char *string,
+				     const char *delim, int maxsplit)
 {
 	int count = 0;
 	char *p = string, *end;
@@ -315,7 +315,7 @@ int string_list_split_in_place(struct string_list *list, char *string,
 			string_list_append(list, p);
 			return count;
 		}
-		end = strchr(p, delim);
+		end = strpbrk(p, delim);
 		if (end) {
 			*end = '\0';
 			string_list_append(list, p);
@@ -326,3 +326,12 @@ int string_list_split_in_place(struct string_list *list, char *string,
 		}
 	}
 }
+
+int string_list_split_in_place(struct string_list *list, char *string,
+			       int delim, int maxsplit)
+{
+	char delim_s[2] = { delim, 0 };
+
+	return string_list_split_in_place_multi(list, string, delim_s,
+						maxsplit);
+}
diff --git a/string-list.h b/string-list.h
index c7b0d5d0008..670d4fc8fb7 100644
--- a/string-list.h
+++ b/string-list.h
@@ -268,7 +268,13 @@ int string_list_split(struct string_list *list, const char *string,
  * new string_list_items point into string (which therefore must not
  * be modified or freed while the string_list is in use).
  * list->strdup_strings must *not* be set.
+ *
+ * The "_multi" variant splits the given string on any character
+ * appearing in "delim", and the non-"_multi" variant splits only on the
+ * given character.
  */
+int string_list_split_in_place_multi(struct string_list *list, char *string,
+				     const char *delim, int maxsplit);
 int string_list_split_in_place(struct string_list *list, char *string,
 			       int delim, int maxsplit);
 #endif /* STRING_LIST_H */
-- 
2.38.0.16.g393fd4c6db


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 2/5] t/helper/test-hashmap.c: avoid using `strtok()`
  2023-04-13 23:31 [PATCH 0/5] banned: mark `strok()`, `strtok_r()` as banned Taylor Blau
  2023-04-13 23:31 ` [PATCH 1/5] string-list: introduce `string_list_split_in_place_multi()` Taylor Blau
@ 2023-04-13 23:31 ` Taylor Blau
  2023-04-18 10:23   ` Jeff King
  2023-04-13 23:31 ` [PATCH 3/5] t/helper/test-oidmap.c: " Taylor Blau
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 47+ messages in thread
From: Taylor Blau @ 2023-04-13 23:31 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano

Avoid using the non-reentrant `strtok()` to separate the parts of each
incoming command. Instead of replacing it with `strtok_r()`, let's
instead use the more friendly `string_list_split_in_place_multi()`.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-hashmap.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/t/helper/test-hashmap.c b/t/helper/test-hashmap.c
index 36ff07bd4be..902ceb55113 100644
--- a/t/helper/test-hashmap.c
+++ b/t/helper/test-hashmap.c
@@ -2,6 +2,7 @@
 #include "git-compat-util.h"
 #include "hashmap.h"
 #include "strbuf.h"
+#include "string-list.h"
 
 struct test_entry
 {
@@ -150,6 +151,7 @@ static void perf_hashmap(unsigned int method, unsigned int rounds)
  */
 int cmd__hashmap(int argc, const char **argv)
 {
+	struct string_list parts = STRING_LIST_INIT_NODUP;
 	struct strbuf line = STRBUF_INIT;
 	int icase;
 	struct hashmap map = HASHMAP_INIT(test_entry_cmp, &icase);
@@ -159,21 +161,34 @@ int cmd__hashmap(int argc, const char **argv)
 
 	/* process commands from stdin */
 	while (strbuf_getline(&line, stdin) != EOF) {
-		char *cmd, *p1 = NULL, *p2 = NULL;
+		char *cmd, *p1, *p2;
 		unsigned int hash = 0;
 		struct test_entry *entry;
 
+		/*
+		 * Because we memdup() the arguments out of the
+		 * string_list before inserting them into the hashmap,
+		 * it's OK to set its length back to zero to avoid
+		 * re-allocating the items array once per line.
+		 *
+		 * By doing so, we'll instead overwrite the existing
+		 * entries and avoid re-allocating.
+		 */
+		parts.nr = 0;
 		/* break line into command and up to two parameters */
-		cmd = strtok(line.buf, DELIM);
+		string_list_split_in_place_multi(&parts, line.buf, DELIM, 2);
+
 		/* ignore empty lines */
-		if (!cmd || *cmd == '#')
+		if (!parts.nr)
+			continue;
+		if (!*parts.items[0].string || *parts.items[0].string == '#')
 			continue;
 
-		p1 = strtok(NULL, DELIM);
-		if (p1) {
+		cmd = parts.items[0].string;
+		p1 = parts.nr >= 1 ? parts.items[1].string : NULL;
+		p2 = parts.nr >= 2 ? parts.items[2].string : NULL;
+		if (p1)
 			hash = icase ? strihash(p1) : strhash(p1);
-			p2 = strtok(NULL, DELIM);
-		}
 
 		if (!strcmp("add", cmd) && p1 && p2) {
 
@@ -260,6 +275,7 @@ int cmd__hashmap(int argc, const char **argv)
 		}
 	}
 
+	string_list_clear(&parts, 0);
 	strbuf_release(&line);
 	hashmap_clear_and_free(&map, struct test_entry, ent);
 	return 0;
-- 
2.38.0.16.g393fd4c6db


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 3/5] t/helper/test-oidmap.c: avoid using `strtok()`
  2023-04-13 23:31 [PATCH 0/5] banned: mark `strok()`, `strtok_r()` as banned Taylor Blau
  2023-04-13 23:31 ` [PATCH 1/5] string-list: introduce `string_list_split_in_place_multi()` Taylor Blau
  2023-04-13 23:31 ` [PATCH 2/5] t/helper/test-hashmap.c: avoid using `strtok()` Taylor Blau
@ 2023-04-13 23:31 ` Taylor Blau
  2023-04-13 23:31 ` [PATCH 4/5] t/helper/test-json-writer.c: " Taylor Blau
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-13 23:31 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano

Apply similar treatment as in the previous commit to remove usage of
`strtok()` from the "oidmap" test helper.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-oidmap.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/t/helper/test-oidmap.c b/t/helper/test-oidmap.c
index a7b7b38df1f..cf4d2f7c085 100644
--- a/t/helper/test-oidmap.c
+++ b/t/helper/test-oidmap.c
@@ -4,6 +4,7 @@
 #include "oidmap.h"
 #include "setup.h"
 #include "strbuf.h"
+#include "string-list.h"
 
 /* key is an oid and value is a name (could be a refname for example) */
 struct test_entry {
@@ -25,6 +26,7 @@ struct test_entry {
  */
 int cmd__oidmap(int argc UNUSED, const char **argv UNUSED)
 {
+	struct string_list parts = STRING_LIST_INIT_NODUP;
 	struct strbuf line = STRBUF_INIT;
 	struct oidmap map = OIDMAP_INIT;
 
@@ -35,19 +37,24 @@ int cmd__oidmap(int argc UNUSED, const char **argv UNUSED)
 
 	/* process commands from stdin */
 	while (strbuf_getline(&line, stdin) != EOF) {
-		char *cmd, *p1 = NULL, *p2 = NULL;
+		char *cmd, *p1, *p2;
 		struct test_entry *entry;
 		struct object_id oid;
 
+		/* see the comment in cmd__hashmap() */
+		parts.nr = 0;
 		/* break line into command and up to two parameters */
-		cmd = strtok(line.buf, DELIM);
+		string_list_split_in_place_multi(&parts, line.buf, DELIM, 2);
+
 		/* ignore empty lines */
-		if (!cmd || *cmd == '#')
+		if (!parts.nr)
+			continue;
+		if (!*parts.items[0].string || *parts.items[0].string == '#')
 			continue;
 
-		p1 = strtok(NULL, DELIM);
-		if (p1)
-			p2 = strtok(NULL, DELIM);
+		cmd = parts.items[0].string;
+		p1 = parts.nr >= 1 ? parts.items[1].string : NULL;
+		p2 = parts.nr >= 2 ? parts.items[2].string : NULL;
 
 		if (!strcmp("put", cmd) && p1 && p2) {
 
@@ -108,6 +115,7 @@ int cmd__oidmap(int argc UNUSED, const char **argv UNUSED)
 		}
 	}
 
+	string_list_clear(&parts, 0);
 	strbuf_release(&line);
 	oidmap_free(&map, 1);
 	return 0;
-- 
2.38.0.16.g393fd4c6db


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 4/5] t/helper/test-json-writer.c: avoid using `strtok()`
  2023-04-13 23:31 [PATCH 0/5] banned: mark `strok()`, `strtok_r()` as banned Taylor Blau
                   ` (2 preceding siblings ...)
  2023-04-13 23:31 ` [PATCH 3/5] t/helper/test-oidmap.c: " Taylor Blau
@ 2023-04-13 23:31 ` Taylor Blau
  2023-04-13 23:31 ` [PATCH 5/5] banned.h: mark `strtok()`, `strtok_r()` as banned Taylor Blau
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-13 23:31 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano

Apply similar treatment as in the previous commit to remove usage of
`strtok()` from the "oidmap" test helper.

Each of the different commands that the "json-writer" helper accepts
pops the next space-delimited token from the current line and interprets
it as a string, integer, or double (with the exception of the very first
token, which is the command itself).

To accommodate this, split the line in place by the space character, and
pass the corresponding string_list to each of the specialized `get_s()`,
`get_i()`, and `get_d()` functions.

`get_i()` and `get_d()` are thin wrappers around `get_s()` that convert
their result into the appropriate type by either calling `strtol()` or
`strtod()`, respectively. In `get_s()`, we mark the token as "consumed"
by incrementing the `consumed_nr` counter, indicating how many tokens we
have read up to that point.

Because each of these functions needs the string-list parts, the number
of tokens consumed, and the line number, these three are wrapped up in
to a struct representing the line state.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-json-writer.c | 76 +++++++++++++++++++++++--------------
 1 file changed, 48 insertions(+), 28 deletions(-)

diff --git a/t/helper/test-json-writer.c b/t/helper/test-json-writer.c
index 86887f53203..64bfa1c5f3c 100644
--- a/t/helper/test-json-writer.c
+++ b/t/helper/test-json-writer.c
@@ -1,5 +1,6 @@
 #include "test-tool.h"
 #include "json-writer.h"
+#include "string-list.h"
 
 static const char *expect_obj1 = "{\"a\":\"abc\",\"b\":42,\"c\":true}";
 static const char *expect_obj2 = "{\"a\":-1,\"b\":2147483647,\"c\":0}";
@@ -394,35 +395,41 @@ static int unit_tests(void)
 	return 0;
 }
 
-static void get_s(int line_nr, char **s_in)
+struct line {
+	struct string_list *parts;
+	size_t consumed_nr;
+	int nr;
+};
+
+static void get_s(struct line *line, char **s_in)
 {
-	*s_in = strtok(NULL, " ");
-	if (!*s_in)
-		die("line[%d]: expected: <s>", line_nr);
+	if (line->consumed_nr > line->parts->nr)
+		die("line[%d]: expected: <s>", line->nr);
+	*s_in = line->parts->items[line->consumed_nr++].string;
 }
 
-static void get_i(int line_nr, intmax_t *s_in)
+static void get_i(struct line *line, intmax_t *s_in)
 {
 	char *s;
 	char *endptr;
 
-	get_s(line_nr, &s);
+	get_s(line, &s);
 
 	*s_in = strtol(s, &endptr, 10);
 	if (*endptr || errno == ERANGE)
-		die("line[%d]: invalid integer value", line_nr);
+		die("line[%d]: invalid integer value", line->nr);
 }
 
-static void get_d(int line_nr, double *s_in)
+static void get_d(struct line *line, double *s_in)
 {
 	char *s;
 	char *endptr;
 
-	get_s(line_nr, &s);
+	get_s(line, &s);
 
 	*s_in = strtod(s, &endptr);
 	if (*endptr || errno == ERANGE)
-		die("line[%d]: invalid float value", line_nr);
+		die("line[%d]: invalid float value", line->nr);
 }
 
 static int pretty;
@@ -453,6 +460,7 @@ static char *get_trimmed_line(char *buf, int buf_size)
 
 static int scripted(void)
 {
+	struct string_list parts = STRING_LIST_INIT_NODUP;
 	struct json_writer jw = JSON_WRITER_INIT;
 	char buf[MAX_LINE_LENGTH];
 	char *line;
@@ -470,66 +478,77 @@ static int scripted(void)
 		die("expected first line to be 'object' or 'array'");
 
 	while ((line = get_trimmed_line(buf, MAX_LINE_LENGTH)) != NULL) {
+		struct line state = { 0 };
 		char *verb;
 		char *key;
 		char *s_value;
 		intmax_t i_value;
 		double d_value;
 
-		line_nr++;
+		state.parts = &parts;
+		state.nr = ++line_nr;
 
-		verb = strtok(line, " ");
+		/* see the comment in cmd__hashmap() */
+		parts.nr = 0;
+		/* break line into command and zero or more tokens */
+		string_list_split_in_place(&parts, line, ' ', -1);
+
+		/* ignore empty lines */
+		if (!parts.nr || !*parts.items[0].string)
+			continue;
+
+		verb = parts.items[state.consumed_nr++].string;
 
 		if (!strcmp(verb, "end")) {
 			jw_end(&jw);
 		}
 		else if (!strcmp(verb, "object-string")) {
-			get_s(line_nr, &key);
-			get_s(line_nr, &s_value);
+			get_s(&state, &key);
+			get_s(&state, &s_value);
 			jw_object_string(&jw, key, s_value);
 		}
 		else if (!strcmp(verb, "object-int")) {
-			get_s(line_nr, &key);
-			get_i(line_nr, &i_value);
+			get_s(&state, &key);
+			get_i(&state, &i_value);
 			jw_object_intmax(&jw, key, i_value);
 		}
 		else if (!strcmp(verb, "object-double")) {
-			get_s(line_nr, &key);
-			get_i(line_nr, &i_value);
-			get_d(line_nr, &d_value);
+			get_s(&state, &key);
+			get_i(&state, &i_value);
+			get_d(&state, &d_value);
 			jw_object_double(&jw, key, i_value, d_value);
 		}
 		else if (!strcmp(verb, "object-true")) {
-			get_s(line_nr, &key);
+			get_s(&state, &key);
 			jw_object_true(&jw, key);
 		}
 		else if (!strcmp(verb, "object-false")) {
-			get_s(line_nr, &key);
+			get_s(&state, &key);
 			jw_object_false(&jw, key);
 		}
 		else if (!strcmp(verb, "object-null")) {
-			get_s(line_nr, &key);
+			get_s(&state, &key);
 			jw_object_null(&jw, key);
 		}
 		else if (!strcmp(verb, "object-object")) {
-			get_s(line_nr, &key);
+			get_s(&state, &key);
 			jw_object_inline_begin_object(&jw, key);
 		}
 		else if (!strcmp(verb, "object-array")) {
-			get_s(line_nr, &key);
+			get_s(&state, &key);
 			jw_object_inline_begin_array(&jw, key);
 		}
 		else if (!strcmp(verb, "array-string")) {
-			get_s(line_nr, &s_value);
+			get_s(&state, &s_value);
 			jw_array_string(&jw, s_value);
 		}
 		else if (!strcmp(verb, "array-int")) {
-			get_i(line_nr, &i_value);
+			get_i(&state, &i_value);
 			jw_array_intmax(&jw, i_value);
 		}
 		else if (!strcmp(verb, "array-double")) {
-			get_i(line_nr, &i_value);
-			get_d(line_nr, &d_value);
+			get_i(&state, &i_value);
+			get_d(&state, &d_value);
 			jw_array_double(&jw, i_value, d_value);
 		}
 		else if (!strcmp(verb, "array-true"))
@@ -552,6 +571,7 @@ static int scripted(void)
 	printf("%s\n", jw.json.buf);
 
 	jw_release(&jw);
+	string_list_clear(&parts, 0);
 	return 0;
 }
 
-- 
2.38.0.16.g393fd4c6db


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 5/5] banned.h: mark `strtok()`, `strtok_r()` as banned
  2023-04-13 23:31 [PATCH 0/5] banned: mark `strok()`, `strtok_r()` as banned Taylor Blau
                   ` (3 preceding siblings ...)
  2023-04-13 23:31 ` [PATCH 4/5] t/helper/test-json-writer.c: " Taylor Blau
@ 2023-04-13 23:31 ` Taylor Blau
  2023-04-14  1:39   ` Junio C Hamano
  2023-04-18 19:18 ` [PATCH v2 0/6] banned: mark `strok()` " Taylor Blau
  2023-04-24 22:20 ` [PATCH v3 0/6] banned: mark `strok()`, `strtok_r()` " Taylor Blau
  6 siblings, 1 reply; 47+ messages in thread
From: Taylor Blau @ 2023-04-13 23:31 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano

`strtok_r()` is reentrant, but `strtok()` is not, meaning that using it
is not thread-safe.

We could ban `strtok()` and force callers to use its reentrant
counterpart, but there are a few drawbacks to doing so:

  - `strtok_r()` forces the caller to maintain an extra string pointer
    to pass as its `saveptr` value

  - `strtok_r()` also requires that its `saveptr` value be unmodified
    between calls.

  - `strtok()` (and by extension, `strtok_r()`) is confusing when used
    across multiple functions, since the caller is supposed to pass NULL
    as its first argument after the first call. This makes it difficult
    to determine what string is actually being tokenized without clear
    dataflow.

So while we could ban only `strtok()`, there really is no good reason to
use either when callers could instead use the much friendlier
`string_list_split_in_place()` API, which avoids the above issues.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 banned.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/banned.h b/banned.h
index 6ccf46bc197..9bd23ce5732 100644
--- a/banned.h
+++ b/banned.h
@@ -18,6 +18,12 @@
 #define strncpy(x,y,n) BANNED(strncpy)
 #undef strncat
 #define strncat(x,y,n) BANNED(strncat)
+#if 0
+#undef strtok
+#define strtok(x,y) BANNED(strtok)
+#undef strtok_r
+#define strtok_r(x,y,z) BANNED(strtok_r)
+#endif
 
 #undef sprintf
 #undef vsprintf
-- 
2.38.0.16.g393fd4c6db

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/5] banned.h: mark `strtok()`, `strtok_r()` as banned
  2023-04-13 23:31 ` [PATCH 5/5] banned.h: mark `strtok()`, `strtok_r()` as banned Taylor Blau
@ 2023-04-14  1:39   ` Junio C Hamano
  2023-04-14  2:08     ` Chris Torek
  2023-04-14 13:41     ` Taylor Blau
  0 siblings, 2 replies; 47+ messages in thread
From: Junio C Hamano @ 2023-04-14  1:39 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King

Taylor Blau <me@ttaylorr.com> writes:

>   - `strtok_r()` forces the caller to maintain an extra string pointer
>     to pass as its `saveptr` value
>
>   - `strtok_r()` also requires that its `saveptr` value be unmodified
>     between calls.
>
>   - `strtok()` (and by extension, `strtok_r()`) is confusing when used
>     across multiple functions, since the caller is supposed to pass NULL
>     as its first argument after the first call. This makes it difficult
>     to determine what string is actually being tokenized without clear
>     dataflow.

It seems that the only existing users of strtok() are all in
t/helper/ directory, so I personally do not think it is a huge loss
if these two are forbidden.  While I do not see why we should use
strtok(), none of the above sound like sensible reasons to ban
strtok_r().  At best, they may point out awkwardness of the function
to make you try finding an alternative that is easier-to-use before
choosing strtok_r() for your application on a case-by-case basis.

If your application wants to chomp a string into tokens from left to
right, inspecting the resulting token one-by-one as it goes until it
hits a token that satisfies some condition and then terminate
without wasting cycles on the rest, string_list_split_in_place() is
a poor choice.  In such a use case, you do not know upfront where in
the string the sought-after token would be, so you have to split the
string in full without taking an early exit via maxsplit.  Also, you
are restricted to a single byte value for the delimiter, and unlike
strtok[_r](), string_list_split_in_place() does not squash a run of
delimiter bytes into one inter-token delimiter.

One gripe I have against use of strtok() is actually not with
threading but because people often misuse it when strcspn() is what
they want (i.e. measure the length of the "first token", so that
they can then xmemdupz() a copy out), forgetting that strtok[_r]()
is destructive.

So, I dunno.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/5] banned.h: mark `strtok()`, `strtok_r()` as banned
  2023-04-14  1:39   ` Junio C Hamano
@ 2023-04-14  2:08     ` Chris Torek
  2023-04-14 13:41     ` Taylor Blau
  1 sibling, 0 replies; 47+ messages in thread
From: Chris Torek @ 2023-04-14  2:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, Jeff King

On Thu, Apr 13, 2023 at 7:00 PM Junio C Hamano <gitster@pobox.com> wrote:
> ... At best, they may point out awkwardness of the function [strtok_r]

The awkward interface for strtok as extended to strtok_r is why
we (Keith Bostic and I) came up with what's now strsep. Of course
strsep is not standard and has its own issues but it can be used
to parse $PATH for instance, or old-style /etc/passwd entries.

In this case (tests) it's probably not that important to be efficient
anyway,

Chris

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/5] banned.h: mark `strtok()`, `strtok_r()` as banned
  2023-04-14  1:39   ` Junio C Hamano
  2023-04-14  2:08     ` Chris Torek
@ 2023-04-14 13:41     ` Taylor Blau
  1 sibling, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-14 13:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King

On Thu, Apr 13, 2023 at 06:39:18PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> >   - `strtok_r()` forces the caller to maintain an extra string pointer
> >     to pass as its `saveptr` value
> >
> >   - `strtok_r()` also requires that its `saveptr` value be unmodified
> >     between calls.
> >
> >   - `strtok()` (and by extension, `strtok_r()`) is confusing when used
> >     across multiple functions, since the caller is supposed to pass NULL
> >     as its first argument after the first call. This makes it difficult
> >     to determine what string is actually being tokenized without clear
> >     dataflow.
>
> It seems that the only existing users of strtok() are all in
> t/helper/ directory, so I personally do not think it is a huge loss
> if these two are forbidden.  While I do not see why we should use
> strtok(), none of the above sound like sensible reasons to ban
> strtok_r().  At best, they may point out awkwardness of the function
> to make you try finding an alternative that is easier-to-use before
> choosing strtok_r() for your application on a case-by-case basis.

For what it's worth, I could certainly live if we accomplished getting
strtok(2) on the banned list, but left strtok_r(2) un-banned.

TBH, I think that leaving the reenterant version of a banned function as
un-banned is a little awkward, I don't mind it if you don't feel like
the above are sufficient reasons to ban it.

> If your application wants to chomp a string into tokens from left to
> right, inspecting the resulting token one-by-one as it goes until it
> hits a token that satisfies some condition and then terminate
> without wasting cycles on the rest, string_list_split_in_place() is
> a poor choice.  In such a use case, you do not know upfront where in
> the string the sought-after token would be, so you have to split the
> string in full without taking an early exit via maxsplit.  Also, you
> are restricted to a single byte value for the delimiter, and unlike
> strtok[_r](), string_list_split_in_place() does not squash a run of
> delimiter bytes into one inter-token delimiter.

I don't quite agree with this. In practice, you could repeatedly call
`string_list_split_in_place()` with maxsplit of "1", using the tail of
the string list you're splitting into as the string to split. That would
allow you to split tokens one at a time into the string list without
having to split the whole line up front.

That all said, I don't think that we have such a use case in the tree,
at least from my searching for strtok() and
string_list_split_in_place().

It may be that using strtok_r() for such a thing would be less awkward:
not having tried both (and having no examples to reference) I honestly
do not know for certain.

> One gripe I have against use of strtok() is actually not with
> threading but because people often misuse it when strcspn() is what
> they want (i.e. measure the length of the "first token", so that
> they can then xmemdupz() a copy out), forgetting that strtok[_r]()
> is destructive.

Heh, I'm happy to add that to this, if you want ;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] string-list: introduce `string_list_split_in_place_multi()`
  2023-04-13 23:31 ` [PATCH 1/5] string-list: introduce `string_list_split_in_place_multi()` Taylor Blau
@ 2023-04-18 10:10   ` Jeff King
  2023-04-18 17:08     ` Taylor Blau
  0 siblings, 1 reply; 47+ messages in thread
From: Jeff King @ 2023-04-18 10:10 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano

On Thu, Apr 13, 2023 at 07:31:43PM -0400, Taylor Blau wrote:

> Instead of using `strchr(2)` to locate the first occurrence of the given
> delimiter character, `string_list_split_in_place_multi()` uses
> `strpbrk(2)` to find the first occurrence of *any* character in the given
> delimiter string.
> 
> Since the `_multi` variant is a generalization of the original
> implementation, reimplement `string_list_split_in_place()` in terms of
> the more general function by providing a single-character string for the
> list of accepted delimiters.

I'd imagine that strpbrk() is potentially a lot slower than strchr().
But I kind of doubt that splitting is such a hot path that it will
matter. If we want to care, I think we could do something like:

  end = delim[1] ? strpbrk(p, delim) : strchr(p, delim[0]);

It's entirely possible that a half-decent strpbrk() implementation does
this already.

So I mention it mostly in case we need to revisit this later. I think
it's OK to ignore for now.

-Peff

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/5] t/helper/test-hashmap.c: avoid using `strtok()`
  2023-04-13 23:31 ` [PATCH 2/5] t/helper/test-hashmap.c: avoid using `strtok()` Taylor Blau
@ 2023-04-18 10:23   ` Jeff King
  2023-04-18 18:06     ` Taylor Blau
  0 siblings, 1 reply; 47+ messages in thread
From: Jeff King @ 2023-04-18 10:23 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano

On Thu, Apr 13, 2023 at 07:31:46PM -0400, Taylor Blau wrote:

> Avoid using the non-reentrant `strtok()` to separate the parts of each
> incoming command. Instead of replacing it with `strtok_r()`, let's
> instead use the more friendly `string_list_split_in_place_multi()`.

Junio mentioned this offhand in his response, but I wanted to highlight
one difference in before/after behavior here.

  [before]
  $ printf 'add foo    bar\niterate\n' | t/helper/test-tool hashmap
  foo bar

  [after]
  $ printf 'add foo    bar\niterate\n' | t/helper/test-tool hashmap
  foo    bar

I think that's fine for this test script, but it may be an indication
that we want string-list's split to support different semantics.

> @@ -159,21 +161,34 @@ int cmd__hashmap(int argc, const char **argv)
>  
>  	/* process commands from stdin */
>  	while (strbuf_getline(&line, stdin) != EOF) {
> -		char *cmd, *p1 = NULL, *p2 = NULL;
> +		char *cmd, *p1, *p2;
>  		unsigned int hash = 0;
>  		struct test_entry *entry;
>  
> +		/*
> +		 * Because we memdup() the arguments out of the
> +		 * string_list before inserting them into the hashmap,
> +		 * it's OK to set its length back to zero to avoid
> +		 * re-allocating the items array once per line.
> +		 *
> +		 * By doing so, we'll instead overwrite the existing
> +		 * entries and avoid re-allocating.
> +		 */
> +		parts.nr = 0;

I think this is OK, but I wonder if we should wrap it in a function that
makes sure the strings aren't owned by the strdup (and thus aren't being
leaked). Something like:

  void string_list_setlen(struct string_list *sl, size_t nr)
  {
	/* alternatively, I guess we could actually free nr..sl->nr */
	if (sl->strdup_strings)
		BUG("you can't setlen on a string-list which owns its strings");
	if (nr > sl->nr)
		BUG("you can't grow a string-list with setlen");
	sl->nr = nr;
  }

That is probably overkill for these two test helpers, but I wonder if
the pattern might escape into "real" code (especially if we suggest
using string-list instead of strtok).

-Peff

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] string-list: introduce `string_list_split_in_place_multi()`
  2023-04-18 10:10   ` Jeff King
@ 2023-04-18 17:08     ` Taylor Blau
  0 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-18 17:08 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano

On Tue, Apr 18, 2023 at 06:10:58AM -0400, Jeff King wrote:
> On Thu, Apr 13, 2023 at 07:31:43PM -0400, Taylor Blau wrote:
>
> > Instead of using `strchr(2)` to locate the first occurrence of the given
> > delimiter character, `string_list_split_in_place_multi()` uses
> > `strpbrk(2)` to find the first occurrence of *any* character in the given
> > delimiter string.
> >
> > Since the `_multi` variant is a generalization of the original
> > implementation, reimplement `string_list_split_in_place()` in terms of
> > the more general function by providing a single-character string for the
> > list of accepted delimiters.
>
> I'd imagine that strpbrk() is potentially a lot slower than strchr().
> But I kind of doubt that splitting is such a hot path that it will
> matter. If we want to care, I think we could do something like:
>
>   end = delim[1] ? strpbrk(p, delim) : strchr(p, delim[0]);
>
> It's entirely possible that a half-decent strpbrk() implementation does
> this already.

I did a little research here, and it looks like both glibc[1] and
musl[2] already do this. Those links are for their implementations of
strcspn(), but which both implement[^1] strpbrk() as "p += strcspn(p,
d)".

> So I mention it mostly in case we need to revisit this later. I think
> it's OK to ignore for now.

In addition to being OK from a performance standpoint I think solving
the semantics concern you noted lower down in the thread[3] is fairly
straightforward.

When in "_multi" mode, I think this boils down to moving past any number
of characters in `delim` before each iteration of the loop. And that's
doable with "p += strspn(p, delim)".

That avoids any behavior change, and more closely matches what strtok()
does, so I think it's a better path.

Thanks,
Taylor

[1]: https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcspn.c;hb=glibc-2.37#l35
[2]: https://git.musl-libc.org/cgit/musl/tree/src/string/strcspn.c?h=v1.2.3#n11
[3]: https://lore.kernel.org/git/20230418102320.GB508219@coredump.intra.peff.net/

[^1]: Not entirely. If strcspn() takes you to the end of the string,
  strpbrk() will return NULL, instead of a pointer to the end of the
  string.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/5] t/helper/test-hashmap.c: avoid using `strtok()`
  2023-04-18 10:23   ` Jeff King
@ 2023-04-18 18:06     ` Taylor Blau
  0 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-18 18:06 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano

On Tue, Apr 18, 2023 at 06:23:20AM -0400, Jeff King wrote:
> On Thu, Apr 13, 2023 at 07:31:46PM -0400, Taylor Blau wrote:
>
> > Avoid using the non-reentrant `strtok()` to separate the parts of each
> > incoming command. Instead of replacing it with `strtok_r()`, let's
> > instead use the more friendly `string_list_split_in_place_multi()`.
>
> Junio mentioned this offhand in his response, but I wanted to highlight
> one difference in before/after behavior here.
>
>   [before]
>   $ printf 'add foo    bar\niterate\n' | t/helper/test-tool hashmap
>   foo bar
>
>   [after]
>   $ printf 'add foo    bar\niterate\n' | t/helper/test-tool hashmap
>   foo    bar
>
> I think that's fine for this test script, but it may be an indication
> that we want string-list's split to support different semantics.

I agree that it's OK for the test scripts, but probably isn't the right
thing for all cases. In the reroll I'll send shortly, I reimplemented
this in a way that (a) doesn't change the behavior of
`string_list_split_in_place()`, and (b) allows us to easily change the
semantics for `string_list_split_in_place_multi()`.

> > @@ -159,21 +161,34 @@ int cmd__hashmap(int argc, const char **argv)
> >
> >  	/* process commands from stdin */
> >  	while (strbuf_getline(&line, stdin) != EOF) {
> > -		char *cmd, *p1 = NULL, *p2 = NULL;
> > +		char *cmd, *p1, *p2;
> >  		unsigned int hash = 0;
> >  		struct test_entry *entry;
> >
> > +		/*
> > +		 * Because we memdup() the arguments out of the
> > +		 * string_list before inserting them into the hashmap,
> > +		 * it's OK to set its length back to zero to avoid
> > +		 * re-allocating the items array once per line.
> > +		 *
> > +		 * By doing so, we'll instead overwrite the existing
> > +		 * entries and avoid re-allocating.
> > +		 */
> > +		parts.nr = 0;
>
> I think this is OK, but I wonder if we should wrap it in a function that
> makes sure the strings aren't owned by the strdup (and thus aren't being
> leaked). Something like:
>
>   void string_list_setlen(struct string_list *sl, size_t nr)
>   {
> 	/* alternatively, I guess we could actually free nr..sl->nr */
> 	if (sl->strdup_strings)
> 		BUG("you can't setlen on a string-list which owns its strings");
> 	if (nr > sl->nr)
> 		BUG("you can't grow a string-list with setlen");
> 	sl->nr = nr;
>   }
>
> That is probably overkill for these two test helpers, but I wonder if
> the pattern might escape into "real" code (especially if we suggest
> using string-list instead of strtok).

I like this suggestion. I added a new patch (for which you are listed
under Co-authored-by) which adds this function as-is.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 0/6] banned: mark `strok()` as banned
  2023-04-13 23:31 [PATCH 0/5] banned: mark `strok()`, `strtok_r()` as banned Taylor Blau
                   ` (4 preceding siblings ...)
  2023-04-13 23:31 ` [PATCH 5/5] banned.h: mark `strtok()`, `strtok_r()` as banned Taylor Blau
@ 2023-04-18 19:18 ` Taylor Blau
  2023-04-18 19:18   ` [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()` Taylor Blau
                     ` (5 more replies)
  2023-04-24 22:20 ` [PATCH v3 0/6] banned: mark `strok()`, `strtok_r()` " Taylor Blau
  6 siblings, 6 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-18 19:18 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Chris Torek, Junio C Hamano

Here is a medium-sized reroll of my series to add `strtok()` to the list
of banned functions.

Some notable things that have changed since the first round[1] include:

  - The behavior of `string_list_split_in_place()` is unchanged, and
    `string_list_split_in_place_multi()` now has the more sane behavior
    of removing all delimiter characters from its output, similar to
    strtok().

  - We now guard against dangerous cases of assigning `list->nr` with a
    new `string_list_setlen()` function.

  - More test cases are added in t0063.

  - `strtok_r()` is no longer on the banned list, and `strtok()` now
    *is* on the list, after I forgot to remove a stray `#if 0` left over
    from debugging.

As always, a range-diff is included below for convenience. Thanks in
advance for your review!

[1]: https://lore.kernel.org/git/cover.1681428696.git.me@ttaylorr.com/

Taylor Blau (6):
  string-list: introduce `string_list_split_in_place_multi()`
  string-list: introduce `string_list_setlen()`
  t/helper/test-hashmap.c: avoid using `strtok()`
  t/helper/test-oidmap.c: avoid using `strtok()`
  t/helper/test-json-writer.c: avoid using `strtok()`
  banned.h: mark `strtok()` as banned

 banned.h                    |   2 +
 string-list.c               |  35 ++++++++++--
 string-list.h               |  17 ++++++
 t/helper/test-hashmap.c     |  30 ++++++++---
 t/helper/test-json-writer.c |  76 ++++++++++++++++----------
 t/helper/test-oidmap.c      |  20 ++++---
 t/helper/test-string-list.c |  15 ++++++
 t/t0063-string-list.sh      | 105 +++++++++++++++++++++++++-----------
 8 files changed, 224 insertions(+), 76 deletions(-)

Range-diff against v1:
1:  dda218c8c1 ! 1:  6658b231a9 string-list: introduce `string_list_split_in_place_multi()`
    @@ Commit message
         `_multi` variant splits the given string any any character appearing in
         the string `delim`.
     
    +    Like `strtok()`, the `_multi` variant skips past sequential delimiting
    +    characters. For example:
    +
    +        string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1);
    +
    +    would place in `xs` the elements "foo", "bar", and "baz".
    +
         Instead of using `strchr(2)` to locate the first occurrence of the given
         delimiter character, `string_list_split_in_place_multi()` uses
    -    `strpbrk(2)` to find the first occurrence of *any* character in the given
    -    delimiter string.
    +    `strcspn(2)` to move past the initial segment of characters comprised of
    +    any characters in the delimiting set.
    +
    +    When only a single delimiting character is provided, `strcspn(2)` has
    +    equivalent performance to `strchr(2)`. Modern `strcspn(2)`
    +    implementations treat an empty delimiter or the singleton delimiter as a
    +    special case and fall back to calling strchrnul(). Both glibc[1] and
    +    musl[2] implement `strcspn(2)` this way.
     
         Since the `_multi` variant is a generalization of the original
         implementation, reimplement `string_list_split_in_place()` in terms of
         the more general function by providing a single-character string for the
         list of accepted delimiters.
     
    +    To avoid regressions, update t0063 in this patch as well. Any "common"
    +    test cases (i.e., those that produce the same result whether you call
    +    `string_list_split()` or `string_list_split_in_place_multi()`) are
    +    grouped into a loop which is parameterized over the function to test.
    +
    +    Any cases which aren't common (of which there is one existing case, and
    +    a handful of new ones added which are specific to the `_multi` variant)
    +    are tested independently.
    +
    +    [1]: https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcspn.c;hb=glibc-2.37#l35
    +    [2]: https://git.musl-libc.org/cgit/musl/tree/src/string/strcspn.c?h=v1.2.3#n11
    +
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## string-list.c ##
    @@ string-list.c: int string_list_split(struct string_list *list, const char *strin
      
     -int string_list_split_in_place(struct string_list *list, char *string,
     -			       int delim, int maxsplit)
    -+int string_list_split_in_place_multi(struct string_list *list, char *string,
    -+				     const char *delim, int maxsplit)
    ++static int string_list_split_in_place_1(struct string_list *list, char *string,
    ++					const char *delim, int maxsplit,
    ++					unsigned runs)
      {
      	int count = 0;
      	char *p = string, *end;
     @@ string-list.c: int string_list_split_in_place(struct string_list *list, char *string,
    + 		die("internal error in string_list_split_in_place(): "
    + 		    "list->strdup_strings must not be set");
    + 	for (;;) {
    ++		if (runs)
    ++			p += strspn(p, delim);
    ++
    + 		count++;
    + 		if (maxsplit >= 0 && count > maxsplit) {
      			string_list_append(list, p);
      			return count;
      		}
     -		end = strchr(p, delim);
    -+		end = strpbrk(p, delim);
    - 		if (end) {
    +-		if (end) {
    ++		end = p + strcspn(p, delim);
    ++		if (end && *end) {
      			*end = '\0';
      			string_list_append(list, p);
    + 			p = end + 1;
     @@ string-list.c: int string_list_split_in_place(struct string_list *list, char *string,
      		}
      	}
      }
     +
    ++int string_list_split_in_place_multi(struct string_list *list, char *string,
    ++				     const char *delim, int maxsplit)
    ++{
    ++	return string_list_split_in_place_1(list, string, delim, maxsplit, 1);
    ++}
    ++
     +int string_list_split_in_place(struct string_list *list, char *string,
     +			       int delim, int maxsplit)
     +{
     +	char delim_s[2] = { delim, 0 };
     +
    -+	return string_list_split_in_place_multi(list, string, delim_s,
    -+						maxsplit);
    ++	return string_list_split_in_place_1(list, string, delim_s, maxsplit, 0);
     +}
     
      ## string-list.h ##
    @@ string-list.h: int string_list_split(struct string_list *list, const char *strin
     + *
     + * The "_multi" variant splits the given string on any character
     + * appearing in "delim", and the non-"_multi" variant splits only on the
    -+ * given character.
    ++ * given character. The "_multi" variant behaves like `strtok()` where
    ++ * no element contains the delimiting byte(s).
       */
     +int string_list_split_in_place_multi(struct string_list *list, char *string,
     +				     const char *delim, int maxsplit);
      int string_list_split_in_place(struct string_list *list, char *string,
      			       int delim, int maxsplit);
      #endif /* STRING_LIST_H */
    +
    + ## t/helper/test-string-list.c ##
    +@@ t/helper/test-string-list.c: int cmd__string_list(int argc, const char **argv)
    + 		return 0;
    + 	}
    + 
    ++	if (argc == 5 && !strcmp(argv[1], "split_in_place_multi")) {
    ++		struct string_list list = STRING_LIST_INIT_NODUP;
    ++		int i;
    ++		char *s = xstrdup(argv[2]);
    ++		const char *delim = argv[3];
    ++		int maxsplit = atoi(argv[4]);
    ++
    ++		i = string_list_split_in_place_multi(&list, s, delim, maxsplit);
    ++		printf("%d\n", i);
    ++		write_list(&list);
    ++		string_list_clear(&list, 0);
    ++		free(s);
    ++		return 0;
    ++	}
    ++
    + 	if (argc == 4 && !strcmp(argv[1], "filter")) {
    + 		/*
    + 		 * Retain only the items that have the specified prefix.
    +
    + ## t/t0063-string-list.sh ##
    +@@ t/t0063-string-list.sh: test_split () {
    + 	"
    + }
    + 
    +-test_split "foo:bar:baz" ":" "-1" <<EOF
    +-3
    +-[0]: "foo"
    +-[1]: "bar"
    +-[2]: "baz"
    +-EOF
    ++test_split_in_place_multi () {
    ++	cat >expected &&
    ++	test_expect_success "split_in_place_multi $1 at $2, max $3" "
    ++		test-tool string-list split_in_place_multi '$1' '$2' '$3' >actual &&
    ++		test_cmp expected actual
    ++	"
    ++}
    + 
    +-test_split "foo:bar:baz" ":" "0" <<EOF
    +-1
    +-[0]: "foo:bar:baz"
    +-EOF
    ++for test_fn in test_split test_split_in_place_multi
    ++do
    ++	$test_fn "foo:bar:baz" ":" "-1" <<-\EOF
    ++	3
    ++	[0]: "foo"
    ++	[1]: "bar"
    ++	[2]: "baz"
    ++	EOF
    + 
    +-test_split "foo:bar:baz" ":" "1" <<EOF
    +-2
    +-[0]: "foo"
    +-[1]: "bar:baz"
    +-EOF
    ++	$test_fn "foo:bar:baz" ":" "0" <<-\EOF
    ++	1
    ++	[0]: "foo:bar:baz"
    ++	EOF
    + 
    +-test_split "foo:bar:baz" ":" "2" <<EOF
    +-3
    +-[0]: "foo"
    +-[1]: "bar"
    +-[2]: "baz"
    +-EOF
    ++	$test_fn "foo:bar:baz" ":" "1" <<-\EOF
    ++	2
    ++	[0]: "foo"
    ++	[1]: "bar:baz"
    ++	EOF
    + 
    +-test_split "foo:bar:" ":" "-1" <<EOF
    +-3
    +-[0]: "foo"
    +-[1]: "bar"
    +-[2]: ""
    +-EOF
    ++	$test_fn "foo:bar:baz" ":" "2" <<-\EOF
    ++	3
    ++	[0]: "foo"
    ++	[1]: "bar"
    ++	[2]: "baz"
    ++	EOF
    + 
    +-test_split "" ":" "-1" <<EOF
    +-1
    +-[0]: ""
    +-EOF
    ++	$test_fn "foo:bar:" ":" "-1" <<-\EOF
    ++	3
    ++	[0]: "foo"
    ++	[1]: "bar"
    ++	[2]: ""
    ++	EOF
    ++
    ++	$test_fn "" ":" "-1" <<-\EOF
    ++	1
    ++	[0]: ""
    ++	EOF
    ++done
    + 
    + test_split ":" ":" "-1" <<EOF
    + 2
    +@@ t/t0063-string-list.sh: test_split ":" ":" "-1" <<EOF
    + [1]: ""
    + EOF
    + 
    ++test_split_in_place_multi "foo:;:bar:;:baz" ":;" "-1" <<-\EOF
    ++3
    ++[0]: "foo"
    ++[1]: "bar"
    ++[2]: "baz"
    ++EOF
    ++
    ++test_split_in_place_multi "foo:;:bar:;:baz" ":;" "0" <<-\EOF
    ++1
    ++[0]: "foo:;:bar:;:baz"
    ++EOF
    ++
    ++test_split_in_place_multi "foo:;:bar:;:baz" ":;" "1" <<-\EOF
    ++2
    ++[0]: "foo"
    ++[1]: "bar:;:baz"
    ++EOF
    ++
    ++test_split_in_place_multi "foo:;:bar:;:baz" ":;" "2" <<-\EOF
    ++3
    ++[0]: "foo"
    ++[1]: "bar"
    ++[2]: "baz"
    ++EOF
    ++
    ++test_split_in_place_multi "foo:;:bar:;:" ":;" "-1" <<-\EOF
    ++3
    ++[0]: "foo"
    ++[1]: "bar"
    ++[2]: ""
    ++EOF
    ++
    + test_expect_success "test filter_string_list" '
    + 	test "x-" = "x$(test-tool string-list filter - y)" &&
    + 	test "x-" = "x$(test-tool string-list filter no y)" &&
-:  ---------- > 2:  2a20ad8bc5 string-list: introduce `string_list_setlen()`
2:  0f199468a3 ! 3:  0ae07dec36 t/helper/test-hashmap.c: avoid using `strtok()`
    @@ t/helper/test-hashmap.c: int cmd__hashmap(int argc, const char **argv)
     +		 * By doing so, we'll instead overwrite the existing
     +		 * entries and avoid re-allocating.
     +		 */
    -+		parts.nr = 0;
    ++		string_list_setlen(&parts, 0);
      		/* break line into command and up to two parameters */
     -		cmd = strtok(line.buf, DELIM);
     +		string_list_split_in_place_multi(&parts, line.buf, DELIM, 2);
3:  135222d04e ! 4:  a659431e9c t/helper/test-oidmap.c: avoid using `strtok()`
    @@ t/helper/test-oidmap.c: int cmd__oidmap(int argc UNUSED, const char **argv UNUSE
      		struct object_id oid;
      
     +		/* see the comment in cmd__hashmap() */
    -+		parts.nr = 0;
    ++		string_list_setlen(&parts, 0);
      		/* break line into command and up to two parameters */
     -		cmd = strtok(line.buf, DELIM);
     +		string_list_split_in_place_multi(&parts, line.buf, DELIM, 2);
4:  ae29d4d892 ! 5:  fc6cd23698 t/helper/test-json-writer.c: avoid using `strtok()`
    @@ t/helper/test-json-writer.c: static int scripted(void)
      
     -		verb = strtok(line, " ");
     +		/* see the comment in cmd__hashmap() */
    -+		parts.nr = 0;
    ++		string_list_setlen(&parts, 0);
     +		/* break line into command and zero or more tokens */
     +		string_list_split_in_place(&parts, line, ' ', -1);
     +
5:  1d955f8bc6 ! 6:  56d2318a6d banned.h: mark `strtok()`, `strtok_r()` as banned
    @@ Metadata
     Author: Taylor Blau <me@ttaylorr.com>
     
      ## Commit message ##
    -    banned.h: mark `strtok()`, `strtok_r()` as banned
    +    banned.h: mark `strtok()` as banned
     
         `strtok_r()` is reentrant, but `strtok()` is not, meaning that using it
         is not thread-safe.
     
    -    We could ban `strtok()` and force callers to use its reentrant
    -    counterpart, but there are a few drawbacks to doing so:
    +    `strtok()` has a couple of drawbacks that make it undesirable to have
    +    any new instances. In addition to being thread-unsafe, it also
    +    encourages confusing data flows, where `strtok()` may be called from
    +    multiple functions with its first argument as NULL, making it unclear
    +    from the immediate context which string is being tokenized.
     
    -      - `strtok_r()` forces the caller to maintain an extra string pointer
    -        to pass as its `saveptr` value
    +    Now that we have removed all instances of `strtok()` from the tree,
    +    let's ban `strtok()` to avoid introducing new ones in the future. If new
    +    callers should arise, they can either use:
     
    -      - `strtok_r()` also requires that its `saveptr` value be unmodified
    -        between calls.
    +      - `string_list_split_in_place()`,
    +      - `string_list_split_in_place_multi()`, or
    +      - `strtok_r()`.
     
    -      - `strtok()` (and by extension, `strtok_r()`) is confusing when used
    -        across multiple functions, since the caller is supposed to pass NULL
    -        as its first argument after the first call. This makes it difficult
    -        to determine what string is actually being tokenized without clear
    -        dataflow.
    +    Callers are encouraged to use either of the string_list functions when
    +    appropriate over `strtok_r()`, since the latter suffers from the same
    +    confusing data-flow problem as `strtok()` does.
     
    -    So while we could ban only `strtok()`, there really is no good reason to
    -    use either when callers could instead use the much friendlier
    -    `string_list_split_in_place()` API, which avoids the above issues.
    +    But callers may prefer `strtok_r()` when the number of tokens in a given
    +    string is unknown, and they want to split and process them one at a
    +    time, so `strtok_r()` is left off the banned.h list.
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    @@ banned.h
      #define strncpy(x,y,n) BANNED(strncpy)
      #undef strncat
      #define strncat(x,y,n) BANNED(strncat)
    -+#if 0
     +#undef strtok
     +#define strtok(x,y) BANNED(strtok)
    -+#undef strtok_r
    -+#define strtok_r(x,y,z) BANNED(strtok_r)
    -+#endif
      
      #undef sprintf
      #undef vsprintf
-- 
2.40.0.358.g56d2318a6d

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()`
  2023-04-18 19:18 ` [PATCH v2 0/6] banned: mark `strok()` " Taylor Blau
@ 2023-04-18 19:18   ` Taylor Blau
  2023-04-18 19:39     ` Junio C Hamano
  2023-04-22 11:12     ` Jeff King
  2023-04-18 19:18   ` [PATCH v2 2/6] string-list: introduce `string_list_setlen()` Taylor Blau
                     ` (4 subsequent siblings)
  5 siblings, 2 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-18 19:18 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Chris Torek, Junio C Hamano

Introduce a variant of the `string_list_split_in_place()` function that
takes a string of accepted delimiters.

By contrast to its cousin `string_list_split_in_place()` which splits
the given string at every instance of the single character `delim`, the
`_multi` variant splits the given string any any character appearing in
the string `delim`.

Like `strtok()`, the `_multi` variant skips past sequential delimiting
characters. For example:

    string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1);

would place in `xs` the elements "foo", "bar", and "baz".

Instead of using `strchr(2)` to locate the first occurrence of the given
delimiter character, `string_list_split_in_place_multi()` uses
`strcspn(2)` to move past the initial segment of characters comprised of
any characters in the delimiting set.

When only a single delimiting character is provided, `strcspn(2)` has
equivalent performance to `strchr(2)`. Modern `strcspn(2)`
implementations treat an empty delimiter or the singleton delimiter as a
special case and fall back to calling strchrnul(). Both glibc[1] and
musl[2] implement `strcspn(2)` this way.

Since the `_multi` variant is a generalization of the original
implementation, reimplement `string_list_split_in_place()` in terms of
the more general function by providing a single-character string for the
list of accepted delimiters.

To avoid regressions, update t0063 in this patch as well. Any "common"
test cases (i.e., those that produce the same result whether you call
`string_list_split()` or `string_list_split_in_place_multi()`) are
grouped into a loop which is parameterized over the function to test.

Any cases which aren't common (of which there is one existing case, and
a handful of new ones added which are specific to the `_multi` variant)
are tested independently.

[1]: https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcspn.c;hb=glibc-2.37#l35
[2]: https://git.musl-libc.org/cgit/musl/tree/src/string/strcspn.c?h=v1.2.3#n11

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 string-list.c               |  26 +++++++--
 string-list.h               |   7 +++
 t/helper/test-string-list.c |  15 ++++++
 t/t0063-string-list.sh      | 105 +++++++++++++++++++++++++-----------
 4 files changed, 118 insertions(+), 35 deletions(-)

diff --git a/string-list.c b/string-list.c
index db473f273e..b27a53f2e1 100644
--- a/string-list.c
+++ b/string-list.c
@@ -300,8 +300,9 @@ int string_list_split(struct string_list *list, const char *string,
 	}
 }
 
-int string_list_split_in_place(struct string_list *list, char *string,
-			       int delim, int maxsplit)
+static int string_list_split_in_place_1(struct string_list *list, char *string,
+					const char *delim, int maxsplit,
+					unsigned runs)
 {
 	int count = 0;
 	char *p = string, *end;
@@ -310,13 +311,16 @@ int string_list_split_in_place(struct string_list *list, char *string,
 		die("internal error in string_list_split_in_place(): "
 		    "list->strdup_strings must not be set");
 	for (;;) {
+		if (runs)
+			p += strspn(p, delim);
+
 		count++;
 		if (maxsplit >= 0 && count > maxsplit) {
 			string_list_append(list, p);
 			return count;
 		}
-		end = strchr(p, delim);
-		if (end) {
+		end = p + strcspn(p, delim);
+		if (end && *end) {
 			*end = '\0';
 			string_list_append(list, p);
 			p = end + 1;
@@ -326,3 +330,17 @@ int string_list_split_in_place(struct string_list *list, char *string,
 		}
 	}
 }
+
+int string_list_split_in_place_multi(struct string_list *list, char *string,
+				     const char *delim, int maxsplit)
+{
+	return string_list_split_in_place_1(list, string, delim, maxsplit, 1);
+}
+
+int string_list_split_in_place(struct string_list *list, char *string,
+			       int delim, int maxsplit)
+{
+	char delim_s[2] = { delim, 0 };
+
+	return string_list_split_in_place_1(list, string, delim_s, maxsplit, 0);
+}
diff --git a/string-list.h b/string-list.h
index c7b0d5d000..f01bbb0bb6 100644
--- a/string-list.h
+++ b/string-list.h
@@ -268,7 +268,14 @@ int string_list_split(struct string_list *list, const char *string,
  * new string_list_items point into string (which therefore must not
  * be modified or freed while the string_list is in use).
  * list->strdup_strings must *not* be set.
+ *
+ * The "_multi" variant splits the given string on any character
+ * appearing in "delim", and the non-"_multi" variant splits only on the
+ * given character. The "_multi" variant behaves like `strtok()` where
+ * no element contains the delimiting byte(s).
  */
+int string_list_split_in_place_multi(struct string_list *list, char *string,
+				     const char *delim, int maxsplit);
 int string_list_split_in_place(struct string_list *list, char *string,
 			       int delim, int maxsplit);
 #endif /* STRING_LIST_H */
diff --git a/t/helper/test-string-list.c b/t/helper/test-string-list.c
index 2123dda85b..119bc9e1c9 100644
--- a/t/helper/test-string-list.c
+++ b/t/helper/test-string-list.c
@@ -73,6 +73,21 @@ int cmd__string_list(int argc, const char **argv)
 		return 0;
 	}
 
+	if (argc == 5 && !strcmp(argv[1], "split_in_place_multi")) {
+		struct string_list list = STRING_LIST_INIT_NODUP;
+		int i;
+		char *s = xstrdup(argv[2]);
+		const char *delim = argv[3];
+		int maxsplit = atoi(argv[4]);
+
+		i = string_list_split_in_place_multi(&list, s, delim, maxsplit);
+		printf("%d\n", i);
+		write_list(&list);
+		string_list_clear(&list, 0);
+		free(s);
+		return 0;
+	}
+
 	if (argc == 4 && !strcmp(argv[1], "filter")) {
 		/*
 		 * Retain only the items that have the specified prefix.
diff --git a/t/t0063-string-list.sh b/t/t0063-string-list.sh
index 46d4839194..9c5094616a 100755
--- a/t/t0063-string-list.sh
+++ b/t/t0063-string-list.sh
@@ -18,42 +18,53 @@ test_split () {
 	"
 }
 
-test_split "foo:bar:baz" ":" "-1" <<EOF
-3
-[0]: "foo"
-[1]: "bar"
-[2]: "baz"
-EOF
+test_split_in_place_multi () {
+	cat >expected &&
+	test_expect_success "split_in_place_multi $1 at $2, max $3" "
+		test-tool string-list split_in_place_multi '$1' '$2' '$3' >actual &&
+		test_cmp expected actual
+	"
+}
 
-test_split "foo:bar:baz" ":" "0" <<EOF
-1
-[0]: "foo:bar:baz"
-EOF
+for test_fn in test_split test_split_in_place_multi
+do
+	$test_fn "foo:bar:baz" ":" "-1" <<-\EOF
+	3
+	[0]: "foo"
+	[1]: "bar"
+	[2]: "baz"
+	EOF
 
-test_split "foo:bar:baz" ":" "1" <<EOF
-2
-[0]: "foo"
-[1]: "bar:baz"
-EOF
+	$test_fn "foo:bar:baz" ":" "0" <<-\EOF
+	1
+	[0]: "foo:bar:baz"
+	EOF
 
-test_split "foo:bar:baz" ":" "2" <<EOF
-3
-[0]: "foo"
-[1]: "bar"
-[2]: "baz"
-EOF
+	$test_fn "foo:bar:baz" ":" "1" <<-\EOF
+	2
+	[0]: "foo"
+	[1]: "bar:baz"
+	EOF
 
-test_split "foo:bar:" ":" "-1" <<EOF
-3
-[0]: "foo"
-[1]: "bar"
-[2]: ""
-EOF
+	$test_fn "foo:bar:baz" ":" "2" <<-\EOF
+	3
+	[0]: "foo"
+	[1]: "bar"
+	[2]: "baz"
+	EOF
 
-test_split "" ":" "-1" <<EOF
-1
-[0]: ""
-EOF
+	$test_fn "foo:bar:" ":" "-1" <<-\EOF
+	3
+	[0]: "foo"
+	[1]: "bar"
+	[2]: ""
+	EOF
+
+	$test_fn "" ":" "-1" <<-\EOF
+	1
+	[0]: ""
+	EOF
+done
 
 test_split ":" ":" "-1" <<EOF
 2
@@ -61,6 +72,38 @@ test_split ":" ":" "-1" <<EOF
 [1]: ""
 EOF
 
+test_split_in_place_multi "foo:;:bar:;:baz" ":;" "-1" <<-\EOF
+3
+[0]: "foo"
+[1]: "bar"
+[2]: "baz"
+EOF
+
+test_split_in_place_multi "foo:;:bar:;:baz" ":;" "0" <<-\EOF
+1
+[0]: "foo:;:bar:;:baz"
+EOF
+
+test_split_in_place_multi "foo:;:bar:;:baz" ":;" "1" <<-\EOF
+2
+[0]: "foo"
+[1]: "bar:;:baz"
+EOF
+
+test_split_in_place_multi "foo:;:bar:;:baz" ":;" "2" <<-\EOF
+3
+[0]: "foo"
+[1]: "bar"
+[2]: "baz"
+EOF
+
+test_split_in_place_multi "foo:;:bar:;:" ":;" "-1" <<-\EOF
+3
+[0]: "foo"
+[1]: "bar"
+[2]: ""
+EOF
+
 test_expect_success "test filter_string_list" '
 	test "x-" = "x$(test-tool string-list filter - y)" &&
 	test "x-" = "x$(test-tool string-list filter no y)" &&
-- 
2.40.0.358.g56d2318a6d


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 2/6] string-list: introduce `string_list_setlen()`
  2023-04-18 19:18 ` [PATCH v2 0/6] banned: mark `strok()` " Taylor Blau
  2023-04-18 19:18   ` [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()` Taylor Blau
@ 2023-04-18 19:18   ` Taylor Blau
  2023-04-22 11:14     ` Jeff King
  2023-04-18 19:18   ` [PATCH v2 3/6] t/helper/test-hashmap.c: avoid using `strtok()` Taylor Blau
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 47+ messages in thread
From: Taylor Blau @ 2023-04-18 19:18 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Chris Torek, Junio C Hamano

It is sometimes useful to reduce the size of a `string_list`'s list of
items without having to re-allocate them. For example, doing the
following:

    struct strbuf buf = STRBUF_INIT;
    struct string_list parts = STRING_LIST_INIT_NO_DUP;
    while (strbuf_getline(&buf, stdin) != EOF) {
      parts.nr = 0;
      string_list_split_in_place(&parts, buf.buf, ":", -1);
      /* ... */
    }
    string_list_clear(&parts, 0);

is preferable over calling `string_list_clear()` on every iteration of
the loop. This is because `string_list_clear()` causes us free our
existing `items` array. This means that every time we call
`string_list_split_in_place()`, the string-list internals re-allocate
the same size array.

Since in the above example we do not care about the individual parts
after processing each line, it is much more efficient to pretend that
there aren't any elements in the `string_list` by setting `list->nr` to
0 while leaving the list of elements allocated as-is.

This allows `string_list_split_in_place()` to overwrite any existing
entries without needing to free and re-allocate them.

However, setting `list->nr` manually is not safe in all instances. There
are a couple of cases worth worrying about:

  - If the `string_list` is initialized with `strdup_strings`,
    truncating the list can lead to overwriting strings which are
    allocated elsewhere. If there aren't any other pointers to those
    strings other than the ones inside of the `items` array, they will
    become unreachable and leak.

    (We could ourselves free the truncated items between
    string_list->items[nr] and `list->nr`, but no present or future
    callers would benefit from this additional complexity).

  - If the given `nr` is larger than the current value of `list->nr`,
    we'll trick the `string_list` into a state where it thinks there are
    more items allocated than there actually are, which can lead to
    undefined behavior if we try to read or write those entries.

Guard against both of these by introducing a helper function which
guards assignment of `list->nr` against each of the above conditions.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 string-list.c |  9 +++++++++
 string-list.h | 10 ++++++++++
 2 files changed, 19 insertions(+)

diff --git a/string-list.c b/string-list.c
index b27a53f2e1..f0b3cdae94 100644
--- a/string-list.c
+++ b/string-list.c
@@ -203,6 +203,15 @@ void string_list_clear_func(struct string_list *list, string_list_clear_func_t c
 	list->nr = list->alloc = 0;
 }
 
+void string_list_setlen(struct string_list *list, size_t nr)
+{
+	if (list->strdup_strings)
+		BUG("cannot setlen a string_list which owns its entries");
+	if (nr > list->nr)
+		BUG("cannot grow a string_list with setlen");
+	list->nr = nr;
+}
+
 struct string_list_item *string_list_append_nodup(struct string_list *list,
 						  char *string)
 {
diff --git a/string-list.h b/string-list.h
index f01bbb0bb6..b41ecda6f4 100644
--- a/string-list.h
+++ b/string-list.h
@@ -134,6 +134,16 @@ typedef void (*string_list_clear_func_t)(void *p, const char *str);
 /** Call a custom clear function on each util pointer */
 void string_list_clear_func(struct string_list *list, string_list_clear_func_t clearfunc);
 
+/*
+ * Set the length of a string_list to `nr`, provided that (a) `list`
+ * does not own its own storage, and (b) that `nr` is no larger than
+ * `list->nr`.
+ *
+ * Useful when "shrinking" `list` to write over existing entries that
+ * are no longer used without reallocating.
+ */
+void string_list_setlen(struct string_list *list, size_t nr);
+
 /**
  * Apply `func` to each item. If `func` returns nonzero, the
  * iteration aborts and the return value is propagated.
-- 
2.40.0.358.g56d2318a6d


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 3/6] t/helper/test-hashmap.c: avoid using `strtok()`
  2023-04-18 19:18 ` [PATCH v2 0/6] banned: mark `strok()` " Taylor Blau
  2023-04-18 19:18   ` [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()` Taylor Blau
  2023-04-18 19:18   ` [PATCH v2 2/6] string-list: introduce `string_list_setlen()` Taylor Blau
@ 2023-04-18 19:18   ` Taylor Blau
  2023-04-22 11:16     ` Jeff King
  2023-04-18 19:18   ` [PATCH v2 4/6] t/helper/test-oidmap.c: " Taylor Blau
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 47+ messages in thread
From: Taylor Blau @ 2023-04-18 19:18 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Chris Torek, Junio C Hamano

Avoid using the non-reentrant `strtok()` to separate the parts of each
incoming command. Instead of replacing it with `strtok_r()`, let's
instead use the more friendly `string_list_split_in_place_multi()`.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-hashmap.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/t/helper/test-hashmap.c b/t/helper/test-hashmap.c
index 36ff07bd4b..5a3e74a3e5 100644
--- a/t/helper/test-hashmap.c
+++ b/t/helper/test-hashmap.c
@@ -2,6 +2,7 @@
 #include "git-compat-util.h"
 #include "hashmap.h"
 #include "strbuf.h"
+#include "string-list.h"
 
 struct test_entry
 {
@@ -150,6 +151,7 @@ static void perf_hashmap(unsigned int method, unsigned int rounds)
  */
 int cmd__hashmap(int argc, const char **argv)
 {
+	struct string_list parts = STRING_LIST_INIT_NODUP;
 	struct strbuf line = STRBUF_INIT;
 	int icase;
 	struct hashmap map = HASHMAP_INIT(test_entry_cmp, &icase);
@@ -159,21 +161,34 @@ int cmd__hashmap(int argc, const char **argv)
 
 	/* process commands from stdin */
 	while (strbuf_getline(&line, stdin) != EOF) {
-		char *cmd, *p1 = NULL, *p2 = NULL;
+		char *cmd, *p1, *p2;
 		unsigned int hash = 0;
 		struct test_entry *entry;
 
+		/*
+		 * Because we memdup() the arguments out of the
+		 * string_list before inserting them into the hashmap,
+		 * it's OK to set its length back to zero to avoid
+		 * re-allocating the items array once per line.
+		 *
+		 * By doing so, we'll instead overwrite the existing
+		 * entries and avoid re-allocating.
+		 */
+		string_list_setlen(&parts, 0);
 		/* break line into command and up to two parameters */
-		cmd = strtok(line.buf, DELIM);
+		string_list_split_in_place_multi(&parts, line.buf, DELIM, 2);
+
 		/* ignore empty lines */
-		if (!cmd || *cmd == '#')
+		if (!parts.nr)
+			continue;
+		if (!*parts.items[0].string || *parts.items[0].string == '#')
 			continue;
 
-		p1 = strtok(NULL, DELIM);
-		if (p1) {
+		cmd = parts.items[0].string;
+		p1 = parts.nr >= 1 ? parts.items[1].string : NULL;
+		p2 = parts.nr >= 2 ? parts.items[2].string : NULL;
+		if (p1)
 			hash = icase ? strihash(p1) : strhash(p1);
-			p2 = strtok(NULL, DELIM);
-		}
 
 		if (!strcmp("add", cmd) && p1 && p2) {
 
@@ -260,6 +275,7 @@ int cmd__hashmap(int argc, const char **argv)
 		}
 	}
 
+	string_list_clear(&parts, 0);
 	strbuf_release(&line);
 	hashmap_clear_and_free(&map, struct test_entry, ent);
 	return 0;
-- 
2.40.0.358.g56d2318a6d


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 4/6] t/helper/test-oidmap.c: avoid using `strtok()`
  2023-04-18 19:18 ` [PATCH v2 0/6] banned: mark `strok()` " Taylor Blau
                     ` (2 preceding siblings ...)
  2023-04-18 19:18   ` [PATCH v2 3/6] t/helper/test-hashmap.c: avoid using `strtok()` Taylor Blau
@ 2023-04-18 19:18   ` Taylor Blau
  2023-04-18 19:18   ` [PATCH v2 5/6] t/helper/test-json-writer.c: " Taylor Blau
  2023-04-18 19:18   ` [PATCH v2 6/6] banned.h: mark `strtok()` as banned Taylor Blau
  5 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-18 19:18 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Chris Torek, Junio C Hamano

Apply similar treatment as in the previous commit to remove usage of
`strtok()` from the "oidmap" test helper.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-oidmap.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/t/helper/test-oidmap.c b/t/helper/test-oidmap.c
index a7b7b38df1..bca2ca0e06 100644
--- a/t/helper/test-oidmap.c
+++ b/t/helper/test-oidmap.c
@@ -4,6 +4,7 @@
 #include "oidmap.h"
 #include "setup.h"
 #include "strbuf.h"
+#include "string-list.h"
 
 /* key is an oid and value is a name (could be a refname for example) */
 struct test_entry {
@@ -25,6 +26,7 @@ struct test_entry {
  */
 int cmd__oidmap(int argc UNUSED, const char **argv UNUSED)
 {
+	struct string_list parts = STRING_LIST_INIT_NODUP;
 	struct strbuf line = STRBUF_INIT;
 	struct oidmap map = OIDMAP_INIT;
 
@@ -35,19 +37,24 @@ int cmd__oidmap(int argc UNUSED, const char **argv UNUSED)
 
 	/* process commands from stdin */
 	while (strbuf_getline(&line, stdin) != EOF) {
-		char *cmd, *p1 = NULL, *p2 = NULL;
+		char *cmd, *p1, *p2;
 		struct test_entry *entry;
 		struct object_id oid;
 
+		/* see the comment in cmd__hashmap() */
+		string_list_setlen(&parts, 0);
 		/* break line into command and up to two parameters */
-		cmd = strtok(line.buf, DELIM);
+		string_list_split_in_place_multi(&parts, line.buf, DELIM, 2);
+
 		/* ignore empty lines */
-		if (!cmd || *cmd == '#')
+		if (!parts.nr)
+			continue;
+		if (!*parts.items[0].string || *parts.items[0].string == '#')
 			continue;
 
-		p1 = strtok(NULL, DELIM);
-		if (p1)
-			p2 = strtok(NULL, DELIM);
+		cmd = parts.items[0].string;
+		p1 = parts.nr >= 1 ? parts.items[1].string : NULL;
+		p2 = parts.nr >= 2 ? parts.items[2].string : NULL;
 
 		if (!strcmp("put", cmd) && p1 && p2) {
 
@@ -108,6 +115,7 @@ int cmd__oidmap(int argc UNUSED, const char **argv UNUSED)
 		}
 	}
 
+	string_list_clear(&parts, 0);
 	strbuf_release(&line);
 	oidmap_free(&map, 1);
 	return 0;
-- 
2.40.0.358.g56d2318a6d


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 5/6] t/helper/test-json-writer.c: avoid using `strtok()`
  2023-04-18 19:18 ` [PATCH v2 0/6] banned: mark `strok()` " Taylor Blau
                     ` (3 preceding siblings ...)
  2023-04-18 19:18   ` [PATCH v2 4/6] t/helper/test-oidmap.c: " Taylor Blau
@ 2023-04-18 19:18   ` Taylor Blau
  2023-04-18 19:18   ` [PATCH v2 6/6] banned.h: mark `strtok()` as banned Taylor Blau
  5 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-18 19:18 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Chris Torek, Junio C Hamano

Apply similar treatment as in the previous commit to remove usage of
`strtok()` from the "oidmap" test helper.

Each of the different commands that the "json-writer" helper accepts
pops the next space-delimited token from the current line and interprets
it as a string, integer, or double (with the exception of the very first
token, which is the command itself).

To accommodate this, split the line in place by the space character, and
pass the corresponding string_list to each of the specialized `get_s()`,
`get_i()`, and `get_d()` functions.

`get_i()` and `get_d()` are thin wrappers around `get_s()` that convert
their result into the appropriate type by either calling `strtol()` or
`strtod()`, respectively. In `get_s()`, we mark the token as "consumed"
by incrementing the `consumed_nr` counter, indicating how many tokens we
have read up to that point.

Because each of these functions needs the string-list parts, the number
of tokens consumed, and the line number, these three are wrapped up in
to a struct representing the line state.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-json-writer.c | 76 +++++++++++++++++++++++--------------
 1 file changed, 48 insertions(+), 28 deletions(-)

diff --git a/t/helper/test-json-writer.c b/t/helper/test-json-writer.c
index 86887f5320..af0a34aa04 100644
--- a/t/helper/test-json-writer.c
+++ b/t/helper/test-json-writer.c
@@ -1,5 +1,6 @@
 #include "test-tool.h"
 #include "json-writer.h"
+#include "string-list.h"
 
 static const char *expect_obj1 = "{\"a\":\"abc\",\"b\":42,\"c\":true}";
 static const char *expect_obj2 = "{\"a\":-1,\"b\":2147483647,\"c\":0}";
@@ -394,35 +395,41 @@ static int unit_tests(void)
 	return 0;
 }
 
-static void get_s(int line_nr, char **s_in)
+struct line {
+	struct string_list *parts;
+	size_t consumed_nr;
+	int nr;
+};
+
+static void get_s(struct line *line, char **s_in)
 {
-	*s_in = strtok(NULL, " ");
-	if (!*s_in)
-		die("line[%d]: expected: <s>", line_nr);
+	if (line->consumed_nr > line->parts->nr)
+		die("line[%d]: expected: <s>", line->nr);
+	*s_in = line->parts->items[line->consumed_nr++].string;
 }
 
-static void get_i(int line_nr, intmax_t *s_in)
+static void get_i(struct line *line, intmax_t *s_in)
 {
 	char *s;
 	char *endptr;
 
-	get_s(line_nr, &s);
+	get_s(line, &s);
 
 	*s_in = strtol(s, &endptr, 10);
 	if (*endptr || errno == ERANGE)
-		die("line[%d]: invalid integer value", line_nr);
+		die("line[%d]: invalid integer value", line->nr);
 }
 
-static void get_d(int line_nr, double *s_in)
+static void get_d(struct line *line, double *s_in)
 {
 	char *s;
 	char *endptr;
 
-	get_s(line_nr, &s);
+	get_s(line, &s);
 
 	*s_in = strtod(s, &endptr);
 	if (*endptr || errno == ERANGE)
-		die("line[%d]: invalid float value", line_nr);
+		die("line[%d]: invalid float value", line->nr);
 }
 
 static int pretty;
@@ -453,6 +460,7 @@ static char *get_trimmed_line(char *buf, int buf_size)
 
 static int scripted(void)
 {
+	struct string_list parts = STRING_LIST_INIT_NODUP;
 	struct json_writer jw = JSON_WRITER_INIT;
 	char buf[MAX_LINE_LENGTH];
 	char *line;
@@ -470,66 +478,77 @@ static int scripted(void)
 		die("expected first line to be 'object' or 'array'");
 
 	while ((line = get_trimmed_line(buf, MAX_LINE_LENGTH)) != NULL) {
+		struct line state = { 0 };
 		char *verb;
 		char *key;
 		char *s_value;
 		intmax_t i_value;
 		double d_value;
 
-		line_nr++;
+		state.parts = &parts;
+		state.nr = ++line_nr;
 
-		verb = strtok(line, " ");
+		/* see the comment in cmd__hashmap() */
+		string_list_setlen(&parts, 0);
+		/* break line into command and zero or more tokens */
+		string_list_split_in_place(&parts, line, ' ', -1);
+
+		/* ignore empty lines */
+		if (!parts.nr || !*parts.items[0].string)
+			continue;
+
+		verb = parts.items[state.consumed_nr++].string;
 
 		if (!strcmp(verb, "end")) {
 			jw_end(&jw);
 		}
 		else if (!strcmp(verb, "object-string")) {
-			get_s(line_nr, &key);
-			get_s(line_nr, &s_value);
+			get_s(&state, &key);
+			get_s(&state, &s_value);
 			jw_object_string(&jw, key, s_value);
 		}
 		else if (!strcmp(verb, "object-int")) {
-			get_s(line_nr, &key);
-			get_i(line_nr, &i_value);
+			get_s(&state, &key);
+			get_i(&state, &i_value);
 			jw_object_intmax(&jw, key, i_value);
 		}
 		else if (!strcmp(verb, "object-double")) {
-			get_s(line_nr, &key);
-			get_i(line_nr, &i_value);
-			get_d(line_nr, &d_value);
+			get_s(&state, &key);
+			get_i(&state, &i_value);
+			get_d(&state, &d_value);
 			jw_object_double(&jw, key, i_value, d_value);
 		}
 		else if (!strcmp(verb, "object-true")) {
-			get_s(line_nr, &key);
+			get_s(&state, &key);
 			jw_object_true(&jw, key);
 		}
 		else if (!strcmp(verb, "object-false")) {
-			get_s(line_nr, &key);
+			get_s(&state, &key);
 			jw_object_false(&jw, key);
 		}
 		else if (!strcmp(verb, "object-null")) {
-			get_s(line_nr, &key);
+			get_s(&state, &key);
 			jw_object_null(&jw, key);
 		}
 		else if (!strcmp(verb, "object-object")) {
-			get_s(line_nr, &key);
+			get_s(&state, &key);
 			jw_object_inline_begin_object(&jw, key);
 		}
 		else if (!strcmp(verb, "object-array")) {
-			get_s(line_nr, &key);
+			get_s(&state, &key);
 			jw_object_inline_begin_array(&jw, key);
 		}
 		else if (!strcmp(verb, "array-string")) {
-			get_s(line_nr, &s_value);
+			get_s(&state, &s_value);
 			jw_array_string(&jw, s_value);
 		}
 		else if (!strcmp(verb, "array-int")) {
-			get_i(line_nr, &i_value);
+			get_i(&state, &i_value);
 			jw_array_intmax(&jw, i_value);
 		}
 		else if (!strcmp(verb, "array-double")) {
-			get_i(line_nr, &i_value);
-			get_d(line_nr, &d_value);
+			get_i(&state, &i_value);
+			get_d(&state, &d_value);
 			jw_array_double(&jw, i_value, d_value);
 		}
 		else if (!strcmp(verb, "array-true"))
@@ -552,6 +571,7 @@ static int scripted(void)
 	printf("%s\n", jw.json.buf);
 
 	jw_release(&jw);
+	string_list_clear(&parts, 0);
 	return 0;
 }
 
-- 
2.40.0.358.g56d2318a6d


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 6/6] banned.h: mark `strtok()` as banned
  2023-04-18 19:18 ` [PATCH v2 0/6] banned: mark `strok()` " Taylor Blau
                     ` (4 preceding siblings ...)
  2023-04-18 19:18   ` [PATCH v2 5/6] t/helper/test-json-writer.c: " Taylor Blau
@ 2023-04-18 19:18   ` Taylor Blau
  5 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-18 19:18 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Chris Torek, Junio C Hamano

`strtok_r()` is reentrant, but `strtok()` is not, meaning that using it
is not thread-safe.

`strtok()` has a couple of drawbacks that make it undesirable to have
any new instances. In addition to being thread-unsafe, it also
encourages confusing data flows, where `strtok()` may be called from
multiple functions with its first argument as NULL, making it unclear
from the immediate context which string is being tokenized.

Now that we have removed all instances of `strtok()` from the tree,
let's ban `strtok()` to avoid introducing new ones in the future. If new
callers should arise, they can either use:

  - `string_list_split_in_place()`,
  - `string_list_split_in_place_multi()`, or
  - `strtok_r()`.

Callers are encouraged to use either of the string_list functions when
appropriate over `strtok_r()`, since the latter suffers from the same
confusing data-flow problem as `strtok()` does.

But callers may prefer `strtok_r()` when the number of tokens in a given
string is unknown, and they want to split and process them one at a
time, so `strtok_r()` is left off the banned.h list.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 banned.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/banned.h b/banned.h
index 6ccf46bc19..dd43ab3178 100644
--- a/banned.h
+++ b/banned.h
@@ -18,6 +18,8 @@
 #define strncpy(x,y,n) BANNED(strncpy)
 #undef strncat
 #define strncat(x,y,n) BANNED(strncat)
+#undef strtok
+#define strtok(x,y) BANNED(strtok)
 
 #undef sprintf
 #undef vsprintf
-- 
2.40.0.358.g56d2318a6d

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()`
  2023-04-18 19:18   ` [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()` Taylor Blau
@ 2023-04-18 19:39     ` Junio C Hamano
  2023-04-18 20:54       ` Taylor Blau
  2023-04-22 11:12     ` Jeff King
  1 sibling, 1 reply; 47+ messages in thread
From: Junio C Hamano @ 2023-04-18 19:39 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Chris Torek

Taylor Blau <me@ttaylorr.com> writes:

> Introduce a variant of the `string_list_split_in_place()` function that
> takes a string of accepted delimiters.
>
> By contrast to its cousin `string_list_split_in_place()` which splits
> the given string at every instance of the single character `delim`, the
> `_multi` variant splits the given string any any character appearing in
> the string `delim`.
>
> Like `strtok()`, the `_multi` variant skips past sequential delimiting
> characters. For example:
>
>     string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1);
>
> would place in `xs` the elements "foo", "bar", and "baz".

strtok() also skips leading and trailing delimiters, i.e. the above
will give you identical result for ":foo:bar:baz:".

It would be useful to test that here in addition to the existing ones.

> +for test_fn in test_split test_split_in_place_multi
> +do
> +	$test_fn "foo:bar:baz" ":" "-1" <<-\EOF
> +	3
> +	[0]: "foo"
> +	[1]: "bar"
> +	[2]: "baz"
> +	EOF
>  
> +	$test_fn "foo:bar:baz" ":" "0" <<-\EOF
> +	1
> +	[0]: "foo:bar:baz"
> +	EOF
>  
> +	$test_fn "foo:bar:baz" ":" "1" <<-\EOF
> +	2
> +	[0]: "foo"
> +	[1]: "bar:baz"
> +	EOF
>  
> +	$test_fn "foo:bar:baz" ":" "2" <<-\EOF
> +	3
> +	[0]: "foo"
> +	[1]: "bar"
> +	[2]: "baz"
> +	EOF

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()`
  2023-04-18 19:39     ` Junio C Hamano
@ 2023-04-18 20:54       ` Taylor Blau
  0 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-18 20:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Chris Torek

On Tue, Apr 18, 2023 at 12:39:48PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Introduce a variant of the `string_list_split_in_place()` function that
> > takes a string of accepted delimiters.
> >
> > By contrast to its cousin `string_list_split_in_place()` which splits
> > the given string at every instance of the single character `delim`, the
> > `_multi` variant splits the given string any any character appearing in
> > the string `delim`.
> >
> > Like `strtok()`, the `_multi` variant skips past sequential delimiting
> > characters. For example:
> >
> >     string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1);
> >
> > would place in `xs` the elements "foo", "bar", and "baz".
>
> strtok() also skips leading and trailing delimiters, i.e. the above
> will give you identical result for ":foo:bar:baz:".

I'm not sure the results are identical. Adding this test case for
testing the behavior of string_list_split_in_place() passes before and
after this series:

--- 8< ---
diff --git a/t/t0063-string-list.sh b/t/t0063-string-list.sh
index 9c5094616a..dfe970a566 100755
--- a/t/t0063-string-list.sh
+++ b/t/t0063-string-list.sh
@@ -72,6 +72,15 @@ test_split ":" ":" "-1" <<EOF
 [1]: ""
 EOF

+test_split ":foo:bar:baz:" ":" "-1" <<-\EOF
+5
+[0]: ""
+[1]: "foo"
+[2]: "bar"
+[3]: "baz"
+[4]: ""
+EOF
+
 test_split_in_place_multi "foo:;:bar:;:baz" ":;" "-1" <<-\EOF
 3
 [0]: "foo"
--- >8 ---

> It would be useful to test that here in addition to the existing ones.

Sure. FWIW, the behavior for string_list_split_in_place_multi() is
slightly different, since it will eat up all of the leading delimiter
tokens and treat the first token as "foo".

Here's a diff that could be squashed into this patch which captures both
cases:

--- 8< ---
diff --git a/t/t0063-string-list.sh b/t/t0063-string-list.sh
index 9c5094616a..efc84dc124 100755
--- a/t/t0063-string-list.sh
+++ b/t/t0063-string-list.sh
@@ -72,6 +72,15 @@ test_split ":" ":" "-1" <<EOF
 [1]: ""
 EOF

+test_split ":foo:bar:baz:" ":" "-1" <<-\EOF
+5
+[0]: ""
+[1]: "foo"
+[2]: "bar"
+[3]: "baz"
+[4]: ""
+EOF
+
 test_split_in_place_multi "foo:;:bar:;:baz" ":;" "-1" <<-\EOF
 3
 [0]: "foo"
@@ -104,6 +113,14 @@ test_split_in_place_multi "foo:;:bar:;:" ":;" "-1" <<-\EOF
 [2]: ""
 EOF

+test_split_in_place_multi ":;:foo:;:bar:;:baz:;:" ":;" "-1" <<-\EOF
+4
+[0]: "foo"
+[1]: "bar"
+[2]: "baz"
+[3]: ""
+EOF
+
 test_expect_success "test filter_string_list" '
 	test "x-" = "x$(test-tool string-list filter - y)" &&
 	test "x-" = "x$(test-tool string-list filter no y)" &&
--- >8 ---


Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()`
  2023-04-18 19:18   ` [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()` Taylor Blau
  2023-04-18 19:39     ` Junio C Hamano
@ 2023-04-22 11:12     ` Jeff King
  2023-04-22 15:53       ` René Scharfe
  2023-04-23  2:38       ` [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()`t Taylor Blau
  1 sibling, 2 replies; 47+ messages in thread
From: Jeff King @ 2023-04-22 11:12 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Chris Torek, Junio C Hamano

On Tue, Apr 18, 2023 at 03:18:43PM -0400, Taylor Blau wrote:

> Introduce a variant of the `string_list_split_in_place()` function that
> takes a string of accepted delimiters.
> 
> By contrast to its cousin `string_list_split_in_place()` which splits
> the given string at every instance of the single character `delim`, the
> `_multi` variant splits the given string any any character appearing in
> the string `delim`.
> 
> Like `strtok()`, the `_multi` variant skips past sequential delimiting
> characters. For example:
> 
>     string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1);
> 
> would place in `xs` the elements "foo", "bar", and "baz".

I have mixed feelings on this.

There are multiple orthogonal options here: single/multiple delimiter,
and how to deal with sequential delimiters (you call it "runs" here,
though I think of it inverted as "allow empty fields"). Plus existing
ones like maxsplit.

Your underlying function, string_list_split_in_place_1(), handles these
independently. But it seems like a subtle gotcha that
string_list_split_in_place() and its _multi() variant, which purport to
differ only in one dimension (representation of delimiter list), also
subtly differ in another dimension. Especially because it's a facet that
might not come up in simple tests, I can see somebody incorrectly
applying one or the other.

Obviously one solution is to add the "runs" option to all variants. But
I'd be hesitant to burden existing callers. So I'd propose one of:

  1. Make your _1() function public, with a name like _with_options() or
     something (though the function name is sadly already quite long).
     Leave string_list_split_in_place() as a wrapper that behaves as
     now, and have the few new callers use the with_options() variant.

  2. Don't bother implementing the "runs" version. The only users would
     be test programs, and nobody cares much either way for their cases.
     Document in the commit message (and possibly above the function)
     that this isn't a strict replacement for strtok(). That would
     hopefully be enough for a potential caller to think about the
     behavior, and we can add "runs" if and when somebody actually wants
     it.

-Peff

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 2/6] string-list: introduce `string_list_setlen()`
  2023-04-18 19:18   ` [PATCH v2 2/6] string-list: introduce `string_list_setlen()` Taylor Blau
@ 2023-04-22 11:14     ` Jeff King
  0 siblings, 0 replies; 47+ messages in thread
From: Jeff King @ 2023-04-22 11:14 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Chris Torek, Junio C Hamano

On Tue, Apr 18, 2023 at 03:18:46PM -0400, Taylor Blau wrote:

> This allows `string_list_split_in_place()` to overwrite any existing
> entries without needing to free and re-allocate them.
> 
> However, setting `list->nr` manually is not safe in all instances. There
> are a couple of cases worth worrying about:
> [...]

Thanks for adding this patch. Your explanation very nicely covers why
this is a reasonable pattern, and what pitfalls we're worried about by
doing it inline.

> Co-authored-by: Jeff King <peff@peff.net>
> Signed-off-by: Jeff King <peff@peff.net>

Acking my forged sign-off. :)

-Peff

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 3/6] t/helper/test-hashmap.c: avoid using `strtok()`
  2023-04-18 19:18   ` [PATCH v2 3/6] t/helper/test-hashmap.c: avoid using `strtok()` Taylor Blau
@ 2023-04-22 11:16     ` Jeff King
  2023-04-24 21:19       ` Taylor Blau
  0 siblings, 1 reply; 47+ messages in thread
From: Jeff King @ 2023-04-22 11:16 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Chris Torek, Junio C Hamano

On Tue, Apr 18, 2023 at 03:18:49PM -0400, Taylor Blau wrote:

> @@ -159,21 +161,34 @@ int cmd__hashmap(int argc, const char **argv)
>  
>  	/* process commands from stdin */
>  	while (strbuf_getline(&line, stdin) != EOF) {
> -		char *cmd, *p1 = NULL, *p2 = NULL;
> +		char *cmd, *p1, *p2;
>  		unsigned int hash = 0;
>  		struct test_entry *entry;
>  
> +		/*
> +		 * Because we memdup() the arguments out of the
> +		 * string_list before inserting them into the hashmap,
> +		 * it's OK to set its length back to zero to avoid
> +		 * re-allocating the items array once per line.
> +		 *
> +		 * By doing so, we'll instead overwrite the existing
> +		 * entries and avoid re-allocating.
> +		 */
> +		string_list_setlen(&parts, 0);
>  		/* break line into command and up to two parameters */
> -		cmd = strtok(line.buf, DELIM);
> +		string_list_split_in_place_multi(&parts, line.buf, DELIM, 2);
> +

I'd argue we can drop this comment now. Having string_list_setlen()
makes it a blessed pattern, and I don't think there's anything special
about this caller that makes it more or less so. Obviously yes, the
string list items won't be valid as we enter a new loop iteration. But
that is always true of split_in_place(), not to mention strtok(),
because we are overwriting the buffer in each loop.

Ditto for the later commits which have similar (if shorter) comments.

-Peff

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()`
  2023-04-22 11:12     ` Jeff King
@ 2023-04-22 15:53       ` René Scharfe
  2023-04-23  0:35         ` Jeff King
  2023-04-23  2:38       ` [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()`t Taylor Blau
  1 sibling, 1 reply; 47+ messages in thread
From: René Scharfe @ 2023-04-22 15:53 UTC (permalink / raw)
  To: Jeff King, Taylor Blau; +Cc: git, Chris Torek, Junio C Hamano

Am 22.04.23 um 13:12 schrieb Jeff King:
> On Tue, Apr 18, 2023 at 03:18:43PM -0400, Taylor Blau wrote:
>
>> Introduce a variant of the `string_list_split_in_place()` function that
>> takes a string of accepted delimiters.
>>
>> By contrast to its cousin `string_list_split_in_place()` which splits
>> the given string at every instance of the single character `delim`, the
>> `_multi` variant splits the given string any any character appearing in
>> the string `delim`.
>>
>> Like `strtok()`, the `_multi` variant skips past sequential delimiting
>> characters. For example:
>>
>>     string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1);
>>
>> would place in `xs` the elements "foo", "bar", and "baz".

Which one is it here really, string_list_split_in_place() or _multi()?

> I have mixed feelings on this.
>
> There are multiple orthogonal options here: single/multiple delimiter,
> and how to deal with sequential delimiters (you call it "runs" here,
> though I think of it inverted as "allow empty fields"). Plus existing
> ones like maxsplit.
>
> Your underlying function, string_list_split_in_place_1(), handles these
> independently. But it seems like a subtle gotcha that
> string_list_split_in_place() and its _multi() variant, which purport to
> differ only in one dimension (representation of delimiter list), also
> subtly differ in another dimension. Especially because it's a facet that
> might not come up in simple tests, I can see somebody incorrectly
> applying one or the other.
>
> Obviously one solution is to add the "runs" option to all variants. But
> I'd be hesitant to burden existing callers. So I'd propose one of:
>
>   1. Make your _1() function public, with a name like _with_options() or
>      something (though the function name is sadly already quite long).
>      Leave string_list_split_in_place() as a wrapper that behaves as
>      now, and have the few new callers use the with_options() variant.
>
>   2. Don't bother implementing the "runs" version. The only users would
>      be test programs, and nobody cares much either way for their cases.
>      Document in the commit message (and possibly above the function)
>      that this isn't a strict replacement for strtok(). That would
>      hopefully be enough for a potential caller to think about the
>      behavior, and we can add "runs" if and when somebody actually wants
>      it.

You can call string_list_remove_empty_items() to get rid of empty
strings.  And if the single-char version calls a multi-char version
under the hood then its only advantage is backward compatibility -- but
converting all callers isn't that hard.  So the only string_list split
function version you really need accepts multiple delimiters and
preserves empty items and can be called string_list_split_in_place().

René


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()`
  2023-04-22 15:53       ` René Scharfe
@ 2023-04-23  0:35         ` Jeff King
  2023-04-24 16:24           ` Junio C Hamano
  0 siblings, 1 reply; 47+ messages in thread
From: Jeff King @ 2023-04-23  0:35 UTC (permalink / raw)
  To: René Scharfe; +Cc: Taylor Blau, git, Chris Torek, Junio C Hamano

On Sat, Apr 22, 2023 at 05:53:05PM +0200, René Scharfe wrote:

> > Obviously one solution is to add the "runs" option to all variants. But
> > I'd be hesitant to burden existing callers. So I'd propose one of:
> >
> >   1. Make your _1() function public, with a name like _with_options() or
> >      something (though the function name is sadly already quite long).
> >      Leave string_list_split_in_place() as a wrapper that behaves as
> >      now, and have the few new callers use the with_options() variant.
> >
> >   2. Don't bother implementing the "runs" version. The only users would
> >      be test programs, and nobody cares much either way for their cases.
> >      Document in the commit message (and possibly above the function)
> >      that this isn't a strict replacement for strtok(). That would
> >      hopefully be enough for a potential caller to think about the
> >      behavior, and we can add "runs" if and when somebody actually wants
> >      it.
> 
> You can call string_list_remove_empty_items() to get rid of empty
> strings.  And if the single-char version calls a multi-char version
> under the hood then its only advantage is backward compatibility -- but
> converting all callers isn't that hard.  So the only string_list split
> function version you really need accepts multiple delimiters and
> preserves empty items and can be called string_list_split_in_place().

Ooh, I like that. I didn't think of processing after the fact. And
indeed, we have several spots already that do the split/remove_empty
pair.

And yes, it looks like there are only 7 callers which would need a
trivial update if we just switched to the multi-char version.

That's very compelling.

-Peff

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()`t
  2023-04-22 11:12     ` Jeff King
  2023-04-22 15:53       ` René Scharfe
@ 2023-04-23  2:38       ` Taylor Blau
  2023-04-23  2:40         ` Taylor Blau
  1 sibling, 1 reply; 47+ messages in thread
From: Taylor Blau @ 2023-04-23  2:38 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Chris Torek, Junio C Hamano

On Sat, Apr 22, 2023 at 07:12:13AM -0400, Jeff King wrote:
> On Tue, Apr 18, 2023 at 03:18:43PM -0400, Taylor Blau wrote:
>
> > Introduce a variant of the `string_list_split_in_place()` function that
> > takes a string of accepted delimiters.
> >
> > By contrast to its cousin `string_list_split_in_place()` which splits
> > the given string at every instance of the single character `delim`, the
> > `_multi` variant splits the given string any any character appearing in
> > the string `delim`.
> >
> > Like `strtok()`, the `_multi` variant skips past sequential delimiting
> > characters. For example:
> >
> >     string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1);
> >
> > would place in `xs` the elements "foo", "bar", and "baz".
>
> I have mixed feelings on this.

Hmm. I implemented it this way partially after reading your suggestion
in [1], but also to allow using `string_list_split_in_place()` as a
strict replacement for strtok().

And I agree that for the existing users of strtok(), which are all in the
test helpers, this probably doesn't matter much either way. Though I
feel like since we are banning strtok() over the whole tree, that we
should provide a suitable replacement at the time we ban strtok(), not
at the time somebody needs to rely on its non-empty fields behavior.

[1]: https://lore.kernel.org/git/20230418102320.GB508219@coredump.intra.peff.net/

> Obviously one solution is to add the "runs" option to all variants. But
> I'd be hesitant to burden existing callers. So I'd propose one of:
>
>   1. Make your _1() function public, with a name like _with_options() or
>      something (though the function name is sadly already quite long).
>      Leave string_list_split_in_place() as a wrapper that behaves as
>      now, and have the few new callers use the with_options() variant.

I think that in general I'd prefer (2) to avoid polluting the list of
declarations in string-list.h, but in this case I think that in this
case it is the right thing to do.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()`t
  2023-04-23  2:38       ` [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()`t Taylor Blau
@ 2023-04-23  2:40         ` Taylor Blau
  0 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-23  2:40 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Chris Torek, Junio C Hamano

On Sat, Apr 22, 2023 at 10:38:13PM -0400, Taylor Blau wrote:
> > Obviously one solution is to add the "runs" option to all variants. But
> > I'd be hesitant to burden existing callers. So I'd propose one of:
> >
> >   1. Make your _1() function public, with a name like _with_options() or
> >      something (though the function name is sadly already quite long).
> >      Leave string_list_split_in_place() as a wrapper that behaves as
> >      now, and have the few new callers use the with_options() variant.
>
> I think that in general I'd prefer (2) to avoid polluting the list of
> declarations in string-list.h, but in this case I think that in this
> case it is the right thing to do.

Oh, nevermind. I just read René's solution below and it is much cleaner.
I take it back ;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()`
  2023-04-23  0:35         ` Jeff King
@ 2023-04-24 16:24           ` Junio C Hamano
  0 siblings, 0 replies; 47+ messages in thread
From: Junio C Hamano @ 2023-04-24 16:24 UTC (permalink / raw)
  To: Jeff King; +Cc: René Scharfe, Taylor Blau, git, Chris Torek

Jeff King <peff@peff.net> writes:

> And yes, it looks like there are only 7 callers which would need a
> trivial update if we just switched to the multi-char version.
>
> That's very compelling.

Sounds good.  Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 3/6] t/helper/test-hashmap.c: avoid using `strtok()`
  2023-04-22 11:16     ` Jeff King
@ 2023-04-24 21:19       ` Taylor Blau
  0 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-24 21:19 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Chris Torek, Junio C Hamano

On Sat, Apr 22, 2023 at 07:16:57AM -0400, Jeff King wrote:
> On Tue, Apr 18, 2023 at 03:18:49PM -0400, Taylor Blau wrote:
>
> > @@ -159,21 +161,34 @@ int cmd__hashmap(int argc, const char **argv)
> >
> >  	/* process commands from stdin */
> >  	while (strbuf_getline(&line, stdin) != EOF) {
> > -		char *cmd, *p1 = NULL, *p2 = NULL;
> > +		char *cmd, *p1, *p2;
> >  		unsigned int hash = 0;
> >  		struct test_entry *entry;
> >
> > +		/*
> > +		 * Because we memdup() the arguments out of the
> > +		 * string_list before inserting them into the hashmap,
> > +		 * it's OK to set its length back to zero to avoid
> > +		 * re-allocating the items array once per line.
> > +		 *
> > +		 * By doing so, we'll instead overwrite the existing
> > +		 * entries and avoid re-allocating.
> > +		 */
> > +		string_list_setlen(&parts, 0);
> >  		/* break line into command and up to two parameters */
> > -		cmd = strtok(line.buf, DELIM);
> > +		string_list_split_in_place_multi(&parts, line.buf, DELIM, 2);
> > +
>
> I'd argue we can drop this comment now. Having string_list_setlen()
> makes it a blessed pattern, and I don't think there's anything special
> about this caller that makes it more or less so. Obviously yes, the
> string list items won't be valid as we enter a new loop iteration. But
> that is always true of split_in_place(), not to mention strtok(),
> because we are overwriting the buffer in each loop.

Agreed, I think that part of the point of string_list_setlen() is that
this is a blessed pattern, so shouldn't need a comment.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v3 0/6] banned: mark `strok()`, `strtok_r()` as banned
  2023-04-13 23:31 [PATCH 0/5] banned: mark `strok()`, `strtok_r()` as banned Taylor Blau
                   ` (5 preceding siblings ...)
  2023-04-18 19:18 ` [PATCH v2 0/6] banned: mark `strok()` " Taylor Blau
@ 2023-04-24 22:20 ` Taylor Blau
  2023-04-24 22:20   ` [PATCH v3 1/6] string-list: multi-delimiter `string_list_split_in_place()` Taylor Blau
                     ` (6 more replies)
  6 siblings, 7 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-24 22:20 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Chris Torek, Junio C Hamano, Jeff Hostetler,
	René Scharfe

Here is another medium-sized reroll of my series to add `strtok()` (and
`strtok_r()`!) to the list of banned functions.

Notable changes include:

  - Dropped `string_list_split_in_place_multi()` in favor of a
    combination of `string_list_split_in_place()` and
    `string_list_remove_empty_items()`.

  - `strtok_r()` is back on the banned list, with a more realistic sales
    pitch.

As always, a range-diff is included below for convenience. Thanks in
advance for your review!

[1]: https://lore.kernel.org/git/cover.1681428696.git.me@ttaylorr.com/

Taylor Blau (6):
  string-list: multi-delimiter `string_list_split_in_place()`
  string-list: introduce `string_list_setlen()`
  t/helper/test-hashmap.c: avoid using `strtok()`
  t/helper/test-oidmap.c: avoid using `strtok()`
  t/helper/test-json-writer.c: avoid using `strtok()`
  banned.h: mark `strtok()` and `strtok_r()` as banned

 banned.h                    |  4 ++
 builtin/gc.c                |  4 +-
 diff.c                      |  2 +-
 notes.c                     |  2 +-
 refs/packed-backend.c       |  2 +-
 string-list.c               | 13 ++++++-
 string-list.h               | 12 +++++-
 t/helper/test-hashmap.c     | 22 +++++++----
 t/helper/test-json-writer.c | 76 +++++++++++++++++++++++--------------
 t/helper/test-oidmap.c      | 20 +++++++---
 t/helper/test-string-list.c |  4 +-
 t/t0063-string-list.sh      | 51 +++++++++++++++++++++++++
 12 files changed, 161 insertions(+), 51 deletions(-)

Range-diff against v2:
1:  6658b231a9 ! 1:  59d3e778b6 string-list: introduce `string_list_split_in_place_multi()`
    @@ Metadata
     Author: Taylor Blau <me@ttaylorr.com>
     
      ## Commit message ##
    -    string-list: introduce `string_list_split_in_place_multi()`
    +    string-list: multi-delimiter `string_list_split_in_place()`
     
    -    Introduce a variant of the `string_list_split_in_place()` function that
    -    takes a string of accepted delimiters.
    -
    -    By contrast to its cousin `string_list_split_in_place()` which splits
    -    the given string at every instance of the single character `delim`, the
    -    `_multi` variant splits the given string any any character appearing in
    -    the string `delim`.
    -
    -    Like `strtok()`, the `_multi` variant skips past sequential delimiting
    -    characters. For example:
    -
    -        string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1);
    -
    -    would place in `xs` the elements "foo", "bar", and "baz".
    +    Enhance `string_list_split_in_place()` to accept multiple characters as
    +    delimiters instead of a single character.
     
         Instead of using `strchr(2)` to locate the first occurrence of the given
         delimiter character, `string_list_split_in_place_multi()` uses
         `strcspn(2)` to move past the initial segment of characters comprised of
         any characters in the delimiting set.
     
    -    When only a single delimiting character is provided, `strcspn(2)` has
    -    equivalent performance to `strchr(2)`. Modern `strcspn(2)`
    -    implementations treat an empty delimiter or the singleton delimiter as a
    -    special case and fall back to calling strchrnul(). Both glibc[1] and
    -    musl[2] implement `strcspn(2)` this way.
    +    When only a single delimiting character is provided, `strpbrk(2)` (which
    +    is implemented with `strcspn(2)`) has equivalent performance to
    +    `strchr(2)`. Modern `strcspn(2)` implementations treat an empty
    +    delimiter or the singleton delimiter as a special case and fall back to
    +    calling strchrnul(). Both glibc[1] and musl[2] implement `strcspn(2)`
    +    this way.
     
    -    Since the `_multi` variant is a generalization of the original
    -    implementation, reimplement `string_list_split_in_place()` in terms of
    -    the more general function by providing a single-character string for the
    -    list of accepted delimiters.
    +    This change is one step to removing `strtok(2)` from the tree. Note that
    +    `string_list_split_in_place()` is not a strict replacement for
    +    `strtok()`, since it will happily turn sequential delimiter characters
    +    into empty entries in the resulting string_list. For example:
     
    -    To avoid regressions, update t0063 in this patch as well. Any "common"
    -    test cases (i.e., those that produce the same result whether you call
    -    `string_list_split()` or `string_list_split_in_place_multi()`) are
    -    grouped into a loop which is parameterized over the function to test.
    +        string_list_split_in_place(&xs, "foo:;:bar:;:baz", ":;", -1)
     
    -    Any cases which aren't common (of which there is one existing case, and
    -    a handful of new ones added which are specific to the `_multi` variant)
    -    are tested independently.
    +    would yield a string list of:
    +
    +        ["foo", "", "", "bar", "", "", "baz"]
    +
    +    Callers that wish to emulate the behavior of strtok(2) more directly
    +    should call `string_list_remove_empty_items()` after splitting.
    +
    +    To avoid regressions for the new multi-character delimter cases, update
    +    t0063 in this patch as well.
     
         [1]: https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcspn.c;hb=glibc-2.37#l35
         [2]: https://git.musl-libc.org/cgit/musl/tree/src/string/strcspn.c?h=v1.2.3#n11
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    + ## builtin/gc.c ##
    +@@ builtin/gc.c: static int get_schedule_cmd(const char **cmd, int *is_available)
    + 	if (is_available)
    + 		*is_available = 0;
    + 
    +-	string_list_split_in_place(&list, testing, ',', -1);
    ++	string_list_split_in_place(&list, testing, ",", -1);
    + 	for_each_string_list_item(item, &list) {
    + 		struct string_list pair = STRING_LIST_INIT_NODUP;
    + 
    +-		if (string_list_split_in_place(&pair, item->string, ':', 2) != 2)
    ++		if (string_list_split_in_place(&pair, item->string, ":", 2) != 2)
    + 			continue;
    + 
    + 		if (!strcmp(*cmd, pair.items[0].string)) {
    +
    + ## diff.c ##
    +@@ diff.c: static int parse_dirstat_params(struct diff_options *options, const char *params
    + 	int i;
    + 
    + 	if (*params_copy)
    +-		string_list_split_in_place(&params, params_copy, ',', -1);
    ++		string_list_split_in_place(&params, params_copy, ",", -1);
    + 	for (i = 0; i < params.nr; i++) {
    + 		const char *p = params.items[i].string;
    + 		if (!strcmp(p, "changes")) {
    +
    + ## notes.c ##
    +@@ notes.c: void string_list_add_refs_from_colon_sep(struct string_list *list,
    + 	char *globs_copy = xstrdup(globs);
    + 	int i;
    + 
    +-	string_list_split_in_place(&split, globs_copy, ':', -1);
    ++	string_list_split_in_place(&split, globs_copy, ":", -1);
    + 	string_list_remove_empty_items(&split, 0);
    + 
    + 	for (i = 0; i < split.nr; i++)
    +
    + ## refs/packed-backend.c ##
    +@@ refs/packed-backend.c: static struct snapshot *create_snapshot(struct packed_ref_store *refs)
    + 					 snapshot->buf,
    + 					 snapshot->eof - snapshot->buf);
    + 
    +-		string_list_split_in_place(&traits, p, ' ', -1);
    ++		string_list_split_in_place(&traits, p, " ", -1);
    + 
    + 		if (unsorted_string_list_has_string(&traits, "fully-peeled"))
    + 			snapshot->peeled = PEELED_FULLY;
    +
      ## string-list.c ##
     @@ string-list.c: int string_list_split(struct string_list *list, const char *string,
    - 	}
      }
      
    --int string_list_split_in_place(struct string_list *list, char *string,
    + int string_list_split_in_place(struct string_list *list, char *string,
     -			       int delim, int maxsplit)
    -+static int string_list_split_in_place_1(struct string_list *list, char *string,
    -+					const char *delim, int maxsplit,
    -+					unsigned runs)
    ++			       const char *delim, int maxsplit)
      {
      	int count = 0;
      	char *p = string, *end;
     @@ string-list.c: int string_list_split_in_place(struct string_list *list, char *string,
    - 		die("internal error in string_list_split_in_place(): "
    - 		    "list->strdup_strings must not be set");
    - 	for (;;) {
    -+		if (runs)
    -+			p += strspn(p, delim);
    -+
    - 		count++;
    - 		if (maxsplit >= 0 && count > maxsplit) {
      			string_list_append(list, p);
      			return count;
      		}
     -		end = strchr(p, delim);
    --		if (end) {
    -+		end = p + strcspn(p, delim);
    -+		if (end && *end) {
    ++		end = strpbrk(p, delim);
    + 		if (end) {
      			*end = '\0';
      			string_list_append(list, p);
    - 			p = end + 1;
    -@@ string-list.c: int string_list_split_in_place(struct string_list *list, char *string,
    - 		}
    - 	}
    - }
    -+
    -+int string_list_split_in_place_multi(struct string_list *list, char *string,
    -+				     const char *delim, int maxsplit)
    -+{
    -+	return string_list_split_in_place_1(list, string, delim, maxsplit, 1);
    -+}
    -+
    -+int string_list_split_in_place(struct string_list *list, char *string,
    -+			       int delim, int maxsplit)
    -+{
    -+	char delim_s[2] = { delim, 0 };
    -+
    -+	return string_list_split_in_place_1(list, string, delim_s, maxsplit, 0);
    -+}
     
      ## string-list.h ##
     @@ string-list.h: int string_list_split(struct string_list *list, const char *string,
    -  * new string_list_items point into string (which therefore must not
    -  * be modified or freed while the string_list is in use).
       * list->strdup_strings must *not* be set.
    -+ *
    -+ * The "_multi" variant splits the given string on any character
    -+ * appearing in "delim", and the non-"_multi" variant splits only on the
    -+ * given character. The "_multi" variant behaves like `strtok()` where
    -+ * no element contains the delimiting byte(s).
       */
    -+int string_list_split_in_place_multi(struct string_list *list, char *string,
    -+				     const char *delim, int maxsplit);
      int string_list_split_in_place(struct string_list *list, char *string,
    - 			       int delim, int maxsplit);
    +-			       int delim, int maxsplit);
    ++			       const char *delim, int maxsplit);
      #endif /* STRING_LIST_H */
     
      ## t/helper/test-string-list.c ##
     @@ t/helper/test-string-list.c: int cmd__string_list(int argc, const char **argv)
    - 		return 0;
    - 	}
    - 
    -+	if (argc == 5 && !strcmp(argv[1], "split_in_place_multi")) {
    -+		struct string_list list = STRING_LIST_INIT_NODUP;
    -+		int i;
    -+		char *s = xstrdup(argv[2]);
    + 		struct string_list list = STRING_LIST_INIT_NODUP;
    + 		int i;
    + 		char *s = xstrdup(argv[2]);
    +-		int delim = *argv[3];
     +		const char *delim = argv[3];
    -+		int maxsplit = atoi(argv[4]);
    -+
    -+		i = string_list_split_in_place_multi(&list, s, delim, maxsplit);
    -+		printf("%d\n", i);
    -+		write_list(&list);
    -+		string_list_clear(&list, 0);
    -+		free(s);
    -+		return 0;
    -+	}
    -+
    - 	if (argc == 4 && !strcmp(argv[1], "filter")) {
    - 		/*
    - 		 * Retain only the items that have the specified prefix.
    + 		int maxsplit = atoi(argv[4]);
    + 
    + 		i = string_list_split_in_place(&list, s, delim, maxsplit);
    +@@ t/helper/test-string-list.c: int cmd__string_list(int argc, const char **argv)
    + 		 */
    + 		if (sb.len && sb.buf[sb.len - 1] == '\n')
    + 			strbuf_setlen(&sb, sb.len - 1);
    +-		string_list_split_in_place(&list, sb.buf, '\n', -1);
    ++		string_list_split_in_place(&list, sb.buf, "\n", -1);
    + 
    + 		string_list_sort(&list);
    + 
     
      ## t/t0063-string-list.sh ##
     @@ t/t0063-string-list.sh: test_split () {
      	"
      }
      
    --test_split "foo:bar:baz" ":" "-1" <<EOF
    --3
    --[0]: "foo"
    --[1]: "bar"
    --[2]: "baz"
    --EOF
    -+test_split_in_place_multi () {
    ++test_split_in_place() {
     +	cat >expected &&
    -+	test_expect_success "split_in_place_multi $1 at $2, max $3" "
    -+		test-tool string-list split_in_place_multi '$1' '$2' '$3' >actual &&
    ++	test_expect_success "split (in place) $1 at $2, max $3" "
    ++		test-tool string-list split_in_place '$1' '$2' '$3' >actual &&
     +		test_cmp expected actual
     +	"
     +}
    - 
    --test_split "foo:bar:baz" ":" "0" <<EOF
    --1
    --[0]: "foo:bar:baz"
    --EOF
    -+for test_fn in test_split test_split_in_place_multi
    -+do
    -+	$test_fn "foo:bar:baz" ":" "-1" <<-\EOF
    -+	3
    -+	[0]: "foo"
    -+	[1]: "bar"
    -+	[2]: "baz"
    -+	EOF
    - 
    --test_split "foo:bar:baz" ":" "1" <<EOF
    --2
    --[0]: "foo"
    --[1]: "bar:baz"
    --EOF
    -+	$test_fn "foo:bar:baz" ":" "0" <<-\EOF
    -+	1
    -+	[0]: "foo:bar:baz"
    -+	EOF
    - 
    --test_split "foo:bar:baz" ":" "2" <<EOF
    --3
    --[0]: "foo"
    --[1]: "bar"
    --[2]: "baz"
    --EOF
    -+	$test_fn "foo:bar:baz" ":" "1" <<-\EOF
    -+	2
    -+	[0]: "foo"
    -+	[1]: "bar:baz"
    -+	EOF
    - 
    --test_split "foo:bar:" ":" "-1" <<EOF
    --3
    --[0]: "foo"
    --[1]: "bar"
    --[2]: ""
    --EOF
    -+	$test_fn "foo:bar:baz" ":" "2" <<-\EOF
    -+	3
    -+	[0]: "foo"
    -+	[1]: "bar"
    -+	[2]: "baz"
    -+	EOF
    - 
    --test_split "" ":" "-1" <<EOF
    --1
    --[0]: ""
    --EOF
    -+	$test_fn "foo:bar:" ":" "-1" <<-\EOF
    -+	3
    -+	[0]: "foo"
    -+	[1]: "bar"
    -+	[2]: ""
    -+	EOF
     +
    -+	$test_fn "" ":" "-1" <<-\EOF
    -+	1
    -+	[0]: ""
    -+	EOF
    -+done
    - 
    - test_split ":" ":" "-1" <<EOF
    - 2
    + test_split "foo:bar:baz" ":" "-1" <<EOF
    + 3
    + [0]: "foo"
     @@ t/t0063-string-list.sh: test_split ":" ":" "-1" <<EOF
      [1]: ""
      EOF
      
    -+test_split_in_place_multi "foo:;:bar:;:baz" ":;" "-1" <<-\EOF
    -+3
    ++test_split_in_place "foo:;:bar:;:baz:;:" ":;" "-1" <<EOF
    ++10
     +[0]: "foo"
    -+[1]: "bar"
    -+[2]: "baz"
    ++[1]: ""
    ++[2]: ""
    ++[3]: "bar"
    ++[4]: ""
    ++[5]: ""
    ++[6]: "baz"
    ++[7]: ""
    ++[8]: ""
    ++[9]: ""
     +EOF
     +
    -+test_split_in_place_multi "foo:;:bar:;:baz" ":;" "0" <<-\EOF
    ++test_split_in_place "foo:;:bar:;:baz" ":;" "0" <<EOF
     +1
     +[0]: "foo:;:bar:;:baz"
     +EOF
     +
    -+test_split_in_place_multi "foo:;:bar:;:baz" ":;" "1" <<-\EOF
    ++test_split_in_place "foo:;:bar:;:baz" ":;" "1" <<EOF
     +2
     +[0]: "foo"
    -+[1]: "bar:;:baz"
    ++[1]: ";:bar:;:baz"
     +EOF
     +
    -+test_split_in_place_multi "foo:;:bar:;:baz" ":;" "2" <<-\EOF
    ++test_split_in_place "foo:;:bar:;:baz" ":;" "2" <<EOF
     +3
     +[0]: "foo"
    -+[1]: "bar"
    -+[2]: "baz"
    ++[1]: ""
    ++[2]: ":bar:;:baz"
     +EOF
     +
    -+test_split_in_place_multi "foo:;:bar:;:" ":;" "-1" <<-\EOF
    -+3
    ++test_split_in_place "foo:;:bar:;:" ":;" "-1" <<EOF
    ++7
     +[0]: "foo"
    -+[1]: "bar"
    ++[1]: ""
     +[2]: ""
    ++[3]: "bar"
    ++[4]: ""
    ++[5]: ""
    ++[6]: ""
     +EOF
     +
      test_expect_success "test filter_string_list" '
2:  2a20ad8bc5 = 2:  ae8d0ce1f2 string-list: introduce `string_list_setlen()`
3:  0ae07dec36 ! 3:  78ecf13cb0 t/helper/test-hashmap.c: avoid using `strtok()`
    @@ Commit message
     
         Avoid using the non-reentrant `strtok()` to separate the parts of each
         incoming command. Instead of replacing it with `strtok_r()`, let's
    -    instead use the more friendly `string_list_split_in_place_multi()`.
    +    instead use the more friendly pair of `string_list_split_in_place()` and
    +    `string_list_remove_empty_items()`.
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    @@ t/helper/test-hashmap.c: int cmd__hashmap(int argc, const char **argv)
      		unsigned int hash = 0;
      		struct test_entry *entry;
      
    -+		/*
    -+		 * Because we memdup() the arguments out of the
    -+		 * string_list before inserting them into the hashmap,
    -+		 * it's OK to set its length back to zero to avoid
    -+		 * re-allocating the items array once per line.
    -+		 *
    -+		 * By doing so, we'll instead overwrite the existing
    -+		 * entries and avoid re-allocating.
    -+		 */
    -+		string_list_setlen(&parts, 0);
      		/* break line into command and up to two parameters */
     -		cmd = strtok(line.buf, DELIM);
    -+		string_list_split_in_place_multi(&parts, line.buf, DELIM, 2);
    ++		string_list_setlen(&parts, 0);
    ++		string_list_split_in_place(&parts, line.buf, DELIM, 2);
    ++		string_list_remove_empty_items(&parts, 0);
     +
      		/* ignore empty lines */
     -		if (!cmd || *cmd == '#')
4:  a659431e9c ! 4:  c9b929406a t/helper/test-oidmap.c: avoid using `strtok()`
    @@ t/helper/test-oidmap.c: int cmd__oidmap(int argc UNUSED, const char **argv UNUSE
      		struct test_entry *entry;
      		struct object_id oid;
      
    -+		/* see the comment in cmd__hashmap() */
    -+		string_list_setlen(&parts, 0);
      		/* break line into command and up to two parameters */
     -		cmd = strtok(line.buf, DELIM);
    -+		string_list_split_in_place_multi(&parts, line.buf, DELIM, 2);
    ++		string_list_setlen(&parts, 0);
    ++		string_list_split_in_place(&parts, line.buf, DELIM, 2);
    ++		string_list_remove_empty_items(&parts, 0);
     +
      		/* ignore empty lines */
     -		if (!cmd || *cmd == '#')
5:  fc6cd23698 ! 5:  201fcac6c4 t/helper/test-json-writer.c: avoid using `strtok()`
    @@ t/helper/test-json-writer.c: static int scripted(void)
     +		state.nr = ++line_nr;
      
     -		verb = strtok(line, " ");
    -+		/* see the comment in cmd__hashmap() */
    -+		string_list_setlen(&parts, 0);
     +		/* break line into command and zero or more tokens */
    -+		string_list_split_in_place(&parts, line, ' ', -1);
    ++		string_list_setlen(&parts, 0);
    ++		string_list_split_in_place(&parts, line, " ", -1);
    ++		string_list_remove_empty_items(&parts, 0);
     +
     +		/* ignore empty lines */
     +		if (!parts.nr || !*parts.items[0].string)
6:  56d2318a6d ! 6:  da896aa358 banned.h: mark `strtok()` as banned
    @@ Metadata
     Author: Taylor Blau <me@ttaylorr.com>
     
      ## Commit message ##
    -    banned.h: mark `strtok()` as banned
    -
    -    `strtok_r()` is reentrant, but `strtok()` is not, meaning that using it
    -    is not thread-safe.
    +    banned.h: mark `strtok()` and `strtok_r()` as banned
     
         `strtok()` has a couple of drawbacks that make it undesirable to have
         any new instances. In addition to being thread-unsafe, it also
    @@ Commit message
     
         Now that we have removed all instances of `strtok()` from the tree,
         let's ban `strtok()` to avoid introducing new ones in the future. If new
    -    callers should arise, they can either use:
    +    callers should arise, they are encouraged to use
    +    `string_list_split_in_place()` (and `string_list_remove_empty_items()`,
    +    if applicable).
     
    -      - `string_list_split_in_place()`,
    -      - `string_list_split_in_place_multi()`, or
    -      - `strtok_r()`.
    +    string_list_split_in_place() is not a perfect drop-in replacement
    +    for `strtok_r()`, particularly if the caller is processing a string with
    +    an arbitrary number of tokens, and wants to process each token one at a
    +    time.
     
    -    Callers are encouraged to use either of the string_list functions when
    -    appropriate over `strtok_r()`, since the latter suffers from the same
    -    confusing data-flow problem as `strtok()` does.
    -
    -    But callers may prefer `strtok_r()` when the number of tokens in a given
    -    string is unknown, and they want to split and process them one at a
    -    time, so `strtok_r()` is left off the banned.h list.
    +    But there are no instances of this in Git's tree which are more
    +    well-suited to `strtok_r()` than the friendlier
    +    `string_list_in_place()`, so ban `strtok_r()`, too.
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    @@ banned.h
      #define strncat(x,y,n) BANNED(strncat)
     +#undef strtok
     +#define strtok(x,y) BANNED(strtok)
    ++#undef strtok_r
    ++#define strtok_r(x,y,z) BANNED(strtok_r)
      
      #undef sprintf
      #undef vsprintf
-- 
2.40.0.380.gd2df7d2365

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v3 1/6] string-list: multi-delimiter `string_list_split_in_place()`
  2023-04-24 22:20 ` [PATCH v3 0/6] banned: mark `strok()`, `strtok_r()` " Taylor Blau
@ 2023-04-24 22:20   ` Taylor Blau
  2023-04-24 22:20   ` [PATCH v3 2/6] string-list: introduce `string_list_setlen()` Taylor Blau
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-24 22:20 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Chris Torek, Junio C Hamano, Jeff Hostetler,
	René Scharfe

Enhance `string_list_split_in_place()` to accept multiple characters as
delimiters instead of a single character.

Instead of using `strchr(2)` to locate the first occurrence of the given
delimiter character, `string_list_split_in_place_multi()` uses
`strcspn(2)` to move past the initial segment of characters comprised of
any characters in the delimiting set.

When only a single delimiting character is provided, `strpbrk(2)` (which
is implemented with `strcspn(2)`) has equivalent performance to
`strchr(2)`. Modern `strcspn(2)` implementations treat an empty
delimiter or the singleton delimiter as a special case and fall back to
calling strchrnul(). Both glibc[1] and musl[2] implement `strcspn(2)`
this way.

This change is one step to removing `strtok(2)` from the tree. Note that
`string_list_split_in_place()` is not a strict replacement for
`strtok()`, since it will happily turn sequential delimiter characters
into empty entries in the resulting string_list. For example:

    string_list_split_in_place(&xs, "foo:;:bar:;:baz", ":;", -1)

would yield a string list of:

    ["foo", "", "", "bar", "", "", "baz"]

Callers that wish to emulate the behavior of strtok(2) more directly
should call `string_list_remove_empty_items()` after splitting.

To avoid regressions for the new multi-character delimter cases, update
t0063 in this patch as well.

[1]: https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcspn.c;hb=glibc-2.37#l35
[2]: https://git.musl-libc.org/cgit/musl/tree/src/string/strcspn.c?h=v1.2.3#n11

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/gc.c                |  4 +--
 diff.c                      |  2 +-
 notes.c                     |  2 +-
 refs/packed-backend.c       |  2 +-
 string-list.c               |  4 +--
 string-list.h               |  2 +-
 t/helper/test-string-list.c |  4 +--
 t/t0063-string-list.sh      | 51 +++++++++++++++++++++++++++++++++++++
 8 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index edd98d35a5..f68e976704 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -1687,11 +1687,11 @@ static int get_schedule_cmd(const char **cmd, int *is_available)
 	if (is_available)
 		*is_available = 0;
 
-	string_list_split_in_place(&list, testing, ',', -1);
+	string_list_split_in_place(&list, testing, ",", -1);
 	for_each_string_list_item(item, &list) {
 		struct string_list pair = STRING_LIST_INIT_NODUP;
 
-		if (string_list_split_in_place(&pair, item->string, ':', 2) != 2)
+		if (string_list_split_in_place(&pair, item->string, ":", 2) != 2)
 			continue;
 
 		if (!strcmp(*cmd, pair.items[0].string)) {
diff --git a/diff.c b/diff.c
index 78b0fdd8ca..378a0248e1 100644
--- a/diff.c
+++ b/diff.c
@@ -134,7 +134,7 @@ static int parse_dirstat_params(struct diff_options *options, const char *params
 	int i;
 
 	if (*params_copy)
-		string_list_split_in_place(&params, params_copy, ',', -1);
+		string_list_split_in_place(&params, params_copy, ",", -1);
 	for (i = 0; i < params.nr; i++) {
 		const char *p = params.items[i].string;
 		if (!strcmp(p, "changes")) {
diff --git a/notes.c b/notes.c
index 45fb7f22d1..eee806f626 100644
--- a/notes.c
+++ b/notes.c
@@ -963,7 +963,7 @@ void string_list_add_refs_from_colon_sep(struct string_list *list,
 	char *globs_copy = xstrdup(globs);
 	int i;
 
-	string_list_split_in_place(&split, globs_copy, ':', -1);
+	string_list_split_in_place(&split, globs_copy, ":", -1);
 	string_list_remove_empty_items(&split, 0);
 
 	for (i = 0; i < split.nr; i++)
diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 1eba1015dd..cc903baa7e 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -650,7 +650,7 @@ static struct snapshot *create_snapshot(struct packed_ref_store *refs)
 					 snapshot->buf,
 					 snapshot->eof - snapshot->buf);
 
-		string_list_split_in_place(&traits, p, ' ', -1);
+		string_list_split_in_place(&traits, p, " ", -1);
 
 		if (unsorted_string_list_has_string(&traits, "fully-peeled"))
 			snapshot->peeled = PEELED_FULLY;
diff --git a/string-list.c b/string-list.c
index db473f273e..5f5b60fe1c 100644
--- a/string-list.c
+++ b/string-list.c
@@ -301,7 +301,7 @@ int string_list_split(struct string_list *list, const char *string,
 }
 
 int string_list_split_in_place(struct string_list *list, char *string,
-			       int delim, int maxsplit)
+			       const char *delim, int maxsplit)
 {
 	int count = 0;
 	char *p = string, *end;
@@ -315,7 +315,7 @@ int string_list_split_in_place(struct string_list *list, char *string,
 			string_list_append(list, p);
 			return count;
 		}
-		end = strchr(p, delim);
+		end = strpbrk(p, delim);
 		if (end) {
 			*end = '\0';
 			string_list_append(list, p);
diff --git a/string-list.h b/string-list.h
index c7b0d5d000..77854840f7 100644
--- a/string-list.h
+++ b/string-list.h
@@ -270,5 +270,5 @@ int string_list_split(struct string_list *list, const char *string,
  * list->strdup_strings must *not* be set.
  */
 int string_list_split_in_place(struct string_list *list, char *string,
-			       int delim, int maxsplit);
+			       const char *delim, int maxsplit);
 #endif /* STRING_LIST_H */
diff --git a/t/helper/test-string-list.c b/t/helper/test-string-list.c
index 2123dda85b..63df88575c 100644
--- a/t/helper/test-string-list.c
+++ b/t/helper/test-string-list.c
@@ -62,7 +62,7 @@ int cmd__string_list(int argc, const char **argv)
 		struct string_list list = STRING_LIST_INIT_NODUP;
 		int i;
 		char *s = xstrdup(argv[2]);
-		int delim = *argv[3];
+		const char *delim = argv[3];
 		int maxsplit = atoi(argv[4]);
 
 		i = string_list_split_in_place(&list, s, delim, maxsplit);
@@ -111,7 +111,7 @@ int cmd__string_list(int argc, const char **argv)
 		 */
 		if (sb.len && sb.buf[sb.len - 1] == '\n')
 			strbuf_setlen(&sb, sb.len - 1);
-		string_list_split_in_place(&list, sb.buf, '\n', -1);
+		string_list_split_in_place(&list, sb.buf, "\n", -1);
 
 		string_list_sort(&list);
 
diff --git a/t/t0063-string-list.sh b/t/t0063-string-list.sh
index 46d4839194..1fee6d9010 100755
--- a/t/t0063-string-list.sh
+++ b/t/t0063-string-list.sh
@@ -18,6 +18,14 @@ test_split () {
 	"
 }
 
+test_split_in_place() {
+	cat >expected &&
+	test_expect_success "split (in place) $1 at $2, max $3" "
+		test-tool string-list split_in_place '$1' '$2' '$3' >actual &&
+		test_cmp expected actual
+	"
+}
+
 test_split "foo:bar:baz" ":" "-1" <<EOF
 3
 [0]: "foo"
@@ -61,6 +69,49 @@ test_split ":" ":" "-1" <<EOF
 [1]: ""
 EOF
 
+test_split_in_place "foo:;:bar:;:baz:;:" ":;" "-1" <<EOF
+10
+[0]: "foo"
+[1]: ""
+[2]: ""
+[3]: "bar"
+[4]: ""
+[5]: ""
+[6]: "baz"
+[7]: ""
+[8]: ""
+[9]: ""
+EOF
+
+test_split_in_place "foo:;:bar:;:baz" ":;" "0" <<EOF
+1
+[0]: "foo:;:bar:;:baz"
+EOF
+
+test_split_in_place "foo:;:bar:;:baz" ":;" "1" <<EOF
+2
+[0]: "foo"
+[1]: ";:bar:;:baz"
+EOF
+
+test_split_in_place "foo:;:bar:;:baz" ":;" "2" <<EOF
+3
+[0]: "foo"
+[1]: ""
+[2]: ":bar:;:baz"
+EOF
+
+test_split_in_place "foo:;:bar:;:" ":;" "-1" <<EOF
+7
+[0]: "foo"
+[1]: ""
+[2]: ""
+[3]: "bar"
+[4]: ""
+[5]: ""
+[6]: ""
+EOF
+
 test_expect_success "test filter_string_list" '
 	test "x-" = "x$(test-tool string-list filter - y)" &&
 	test "x-" = "x$(test-tool string-list filter no y)" &&
-- 
2.40.0.380.gd2df7d2365


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 2/6] string-list: introduce `string_list_setlen()`
  2023-04-24 22:20 ` [PATCH v3 0/6] banned: mark `strok()`, `strtok_r()` " Taylor Blau
  2023-04-24 22:20   ` [PATCH v3 1/6] string-list: multi-delimiter `string_list_split_in_place()` Taylor Blau
@ 2023-04-24 22:20   ` Taylor Blau
  2023-04-25  6:21     ` Jeff King
  2023-04-24 22:20   ` [PATCH v3 3/6] t/helper/test-hashmap.c: avoid using `strtok()` Taylor Blau
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 47+ messages in thread
From: Taylor Blau @ 2023-04-24 22:20 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Chris Torek, Junio C Hamano, Jeff Hostetler,
	René Scharfe

It is sometimes useful to reduce the size of a `string_list`'s list of
items without having to re-allocate them. For example, doing the
following:

    struct strbuf buf = STRBUF_INIT;
    struct string_list parts = STRING_LIST_INIT_NO_DUP;
    while (strbuf_getline(&buf, stdin) != EOF) {
      parts.nr = 0;
      string_list_split_in_place(&parts, buf.buf, ":", -1);
      /* ... */
    }
    string_list_clear(&parts, 0);

is preferable over calling `string_list_clear()` on every iteration of
the loop. This is because `string_list_clear()` causes us free our
existing `items` array. This means that every time we call
`string_list_split_in_place()`, the string-list internals re-allocate
the same size array.

Since in the above example we do not care about the individual parts
after processing each line, it is much more efficient to pretend that
there aren't any elements in the `string_list` by setting `list->nr` to
0 while leaving the list of elements allocated as-is.

This allows `string_list_split_in_place()` to overwrite any existing
entries without needing to free and re-allocate them.

However, setting `list->nr` manually is not safe in all instances. There
are a couple of cases worth worrying about:

  - If the `string_list` is initialized with `strdup_strings`,
    truncating the list can lead to overwriting strings which are
    allocated elsewhere. If there aren't any other pointers to those
    strings other than the ones inside of the `items` array, they will
    become unreachable and leak.

    (We could ourselves free the truncated items between
    string_list->items[nr] and `list->nr`, but no present or future
    callers would benefit from this additional complexity).

  - If the given `nr` is larger than the current value of `list->nr`,
    we'll trick the `string_list` into a state where it thinks there are
    more items allocated than there actually are, which can lead to
    undefined behavior if we try to read or write those entries.

Guard against both of these by introducing a helper function which
guards assignment of `list->nr` against each of the above conditions.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 string-list.c |  9 +++++++++
 string-list.h | 10 ++++++++++
 2 files changed, 19 insertions(+)

diff --git a/string-list.c b/string-list.c
index 5f5b60fe1c..0f8ac117fd 100644
--- a/string-list.c
+++ b/string-list.c
@@ -203,6 +203,15 @@ void string_list_clear_func(struct string_list *list, string_list_clear_func_t c
 	list->nr = list->alloc = 0;
 }
 
+void string_list_setlen(struct string_list *list, size_t nr)
+{
+	if (list->strdup_strings)
+		BUG("cannot setlen a string_list which owns its entries");
+	if (nr > list->nr)
+		BUG("cannot grow a string_list with setlen");
+	list->nr = nr;
+}
+
 struct string_list_item *string_list_append_nodup(struct string_list *list,
 						  char *string)
 {
diff --git a/string-list.h b/string-list.h
index 77854840f7..122b318641 100644
--- a/string-list.h
+++ b/string-list.h
@@ -134,6 +134,16 @@ typedef void (*string_list_clear_func_t)(void *p, const char *str);
 /** Call a custom clear function on each util pointer */
 void string_list_clear_func(struct string_list *list, string_list_clear_func_t clearfunc);
 
+/*
+ * Set the length of a string_list to `nr`, provided that (a) `list`
+ * does not own its own storage, and (b) that `nr` is no larger than
+ * `list->nr`.
+ *
+ * Useful when "shrinking" `list` to write over existing entries that
+ * are no longer used without reallocating.
+ */
+void string_list_setlen(struct string_list *list, size_t nr);
+
 /**
  * Apply `func` to each item. If `func` returns nonzero, the
  * iteration aborts and the return value is propagated.
-- 
2.40.0.380.gd2df7d2365


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 3/6] t/helper/test-hashmap.c: avoid using `strtok()`
  2023-04-24 22:20 ` [PATCH v3 0/6] banned: mark `strok()`, `strtok_r()` " Taylor Blau
  2023-04-24 22:20   ` [PATCH v3 1/6] string-list: multi-delimiter `string_list_split_in_place()` Taylor Blau
  2023-04-24 22:20   ` [PATCH v3 2/6] string-list: introduce `string_list_setlen()` Taylor Blau
@ 2023-04-24 22:20   ` Taylor Blau
  2023-04-24 22:20   ` [PATCH v3 4/6] t/helper/test-oidmap.c: " Taylor Blau
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-24 22:20 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Chris Torek, Junio C Hamano, Jeff Hostetler,
	René Scharfe

Avoid using the non-reentrant `strtok()` to separate the parts of each
incoming command. Instead of replacing it with `strtok_r()`, let's
instead use the more friendly pair of `string_list_split_in_place()` and
`string_list_remove_empty_items()`.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-hashmap.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/t/helper/test-hashmap.c b/t/helper/test-hashmap.c
index 36ff07bd4b..0eb0b3d49c 100644
--- a/t/helper/test-hashmap.c
+++ b/t/helper/test-hashmap.c
@@ -2,6 +2,7 @@
 #include "git-compat-util.h"
 #include "hashmap.h"
 #include "strbuf.h"
+#include "string-list.h"
 
 struct test_entry
 {
@@ -150,6 +151,7 @@ static void perf_hashmap(unsigned int method, unsigned int rounds)
  */
 int cmd__hashmap(int argc, const char **argv)
 {
+	struct string_list parts = STRING_LIST_INIT_NODUP;
 	struct strbuf line = STRBUF_INIT;
 	int icase;
 	struct hashmap map = HASHMAP_INIT(test_entry_cmp, &icase);
@@ -159,21 +161,26 @@ int cmd__hashmap(int argc, const char **argv)
 
 	/* process commands from stdin */
 	while (strbuf_getline(&line, stdin) != EOF) {
-		char *cmd, *p1 = NULL, *p2 = NULL;
+		char *cmd, *p1, *p2;
 		unsigned int hash = 0;
 		struct test_entry *entry;
 
 		/* break line into command and up to two parameters */
-		cmd = strtok(line.buf, DELIM);
+		string_list_setlen(&parts, 0);
+		string_list_split_in_place(&parts, line.buf, DELIM, 2);
+		string_list_remove_empty_items(&parts, 0);
+
 		/* ignore empty lines */
-		if (!cmd || *cmd == '#')
+		if (!parts.nr)
+			continue;
+		if (!*parts.items[0].string || *parts.items[0].string == '#')
 			continue;
 
-		p1 = strtok(NULL, DELIM);
-		if (p1) {
+		cmd = parts.items[0].string;
+		p1 = parts.nr >= 1 ? parts.items[1].string : NULL;
+		p2 = parts.nr >= 2 ? parts.items[2].string : NULL;
+		if (p1)
 			hash = icase ? strihash(p1) : strhash(p1);
-			p2 = strtok(NULL, DELIM);
-		}
 
 		if (!strcmp("add", cmd) && p1 && p2) {
 
@@ -260,6 +267,7 @@ int cmd__hashmap(int argc, const char **argv)
 		}
 	}
 
+	string_list_clear(&parts, 0);
 	strbuf_release(&line);
 	hashmap_clear_and_free(&map, struct test_entry, ent);
 	return 0;
-- 
2.40.0.380.gd2df7d2365


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 4/6] t/helper/test-oidmap.c: avoid using `strtok()`
  2023-04-24 22:20 ` [PATCH v3 0/6] banned: mark `strok()`, `strtok_r()` " Taylor Blau
                     ` (2 preceding siblings ...)
  2023-04-24 22:20   ` [PATCH v3 3/6] t/helper/test-hashmap.c: avoid using `strtok()` Taylor Blau
@ 2023-04-24 22:20   ` Taylor Blau
  2023-04-24 22:20   ` [PATCH v3 5/6] t/helper/test-json-writer.c: " Taylor Blau
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-24 22:20 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Chris Torek, Junio C Hamano, Jeff Hostetler,
	René Scharfe

Apply similar treatment as in the previous commit to remove usage of
`strtok()` from the "oidmap" test helper.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-oidmap.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/t/helper/test-oidmap.c b/t/helper/test-oidmap.c
index a7b7b38df1..5ce9eb3334 100644
--- a/t/helper/test-oidmap.c
+++ b/t/helper/test-oidmap.c
@@ -4,6 +4,7 @@
 #include "oidmap.h"
 #include "setup.h"
 #include "strbuf.h"
+#include "string-list.h"
 
 /* key is an oid and value is a name (could be a refname for example) */
 struct test_entry {
@@ -25,6 +26,7 @@ struct test_entry {
  */
 int cmd__oidmap(int argc UNUSED, const char **argv UNUSED)
 {
+	struct string_list parts = STRING_LIST_INIT_NODUP;
 	struct strbuf line = STRBUF_INIT;
 	struct oidmap map = OIDMAP_INIT;
 
@@ -35,19 +37,24 @@ int cmd__oidmap(int argc UNUSED, const char **argv UNUSED)
 
 	/* process commands from stdin */
 	while (strbuf_getline(&line, stdin) != EOF) {
-		char *cmd, *p1 = NULL, *p2 = NULL;
+		char *cmd, *p1, *p2;
 		struct test_entry *entry;
 		struct object_id oid;
 
 		/* break line into command and up to two parameters */
-		cmd = strtok(line.buf, DELIM);
+		string_list_setlen(&parts, 0);
+		string_list_split_in_place(&parts, line.buf, DELIM, 2);
+		string_list_remove_empty_items(&parts, 0);
+
 		/* ignore empty lines */
-		if (!cmd || *cmd == '#')
+		if (!parts.nr)
+			continue;
+		if (!*parts.items[0].string || *parts.items[0].string == '#')
 			continue;
 
-		p1 = strtok(NULL, DELIM);
-		if (p1)
-			p2 = strtok(NULL, DELIM);
+		cmd = parts.items[0].string;
+		p1 = parts.nr >= 1 ? parts.items[1].string : NULL;
+		p2 = parts.nr >= 2 ? parts.items[2].string : NULL;
 
 		if (!strcmp("put", cmd) && p1 && p2) {
 
@@ -108,6 +115,7 @@ int cmd__oidmap(int argc UNUSED, const char **argv UNUSED)
 		}
 	}
 
+	string_list_clear(&parts, 0);
 	strbuf_release(&line);
 	oidmap_free(&map, 1);
 	return 0;
-- 
2.40.0.380.gd2df7d2365


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 5/6] t/helper/test-json-writer.c: avoid using `strtok()`
  2023-04-24 22:20 ` [PATCH v3 0/6] banned: mark `strok()`, `strtok_r()` " Taylor Blau
                     ` (3 preceding siblings ...)
  2023-04-24 22:20   ` [PATCH v3 4/6] t/helper/test-oidmap.c: " Taylor Blau
@ 2023-04-24 22:20   ` Taylor Blau
  2023-04-25 13:57     ` Jeff Hostetler
  2023-04-24 22:20   ` [PATCH v3 6/6] banned.h: mark `strtok()` and `strtok_r()` as banned Taylor Blau
  2023-04-25  6:27   ` [PATCH v3 0/6] banned: mark `strok()`, " Jeff King
  6 siblings, 1 reply; 47+ messages in thread
From: Taylor Blau @ 2023-04-24 22:20 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Chris Torek, Junio C Hamano, Jeff Hostetler,
	René Scharfe

Apply similar treatment as in the previous commit to remove usage of
`strtok()` from the "oidmap" test helper.

Each of the different commands that the "json-writer" helper accepts
pops the next space-delimited token from the current line and interprets
it as a string, integer, or double (with the exception of the very first
token, which is the command itself).

To accommodate this, split the line in place by the space character, and
pass the corresponding string_list to each of the specialized `get_s()`,
`get_i()`, and `get_d()` functions.

`get_i()` and `get_d()` are thin wrappers around `get_s()` that convert
their result into the appropriate type by either calling `strtol()` or
`strtod()`, respectively. In `get_s()`, we mark the token as "consumed"
by incrementing the `consumed_nr` counter, indicating how many tokens we
have read up to that point.

Because each of these functions needs the string-list parts, the number
of tokens consumed, and the line number, these three are wrapped up in
to a struct representing the line state.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-json-writer.c | 76 +++++++++++++++++++++++--------------
 1 file changed, 48 insertions(+), 28 deletions(-)

diff --git a/t/helper/test-json-writer.c b/t/helper/test-json-writer.c
index 86887f5320..afe393f597 100644
--- a/t/helper/test-json-writer.c
+++ b/t/helper/test-json-writer.c
@@ -1,5 +1,6 @@
 #include "test-tool.h"
 #include "json-writer.h"
+#include "string-list.h"
 
 static const char *expect_obj1 = "{\"a\":\"abc\",\"b\":42,\"c\":true}";
 static const char *expect_obj2 = "{\"a\":-1,\"b\":2147483647,\"c\":0}";
@@ -394,35 +395,41 @@ static int unit_tests(void)
 	return 0;
 }
 
-static void get_s(int line_nr, char **s_in)
+struct line {
+	struct string_list *parts;
+	size_t consumed_nr;
+	int nr;
+};
+
+static void get_s(struct line *line, char **s_in)
 {
-	*s_in = strtok(NULL, " ");
-	if (!*s_in)
-		die("line[%d]: expected: <s>", line_nr);
+	if (line->consumed_nr > line->parts->nr)
+		die("line[%d]: expected: <s>", line->nr);
+	*s_in = line->parts->items[line->consumed_nr++].string;
 }
 
-static void get_i(int line_nr, intmax_t *s_in)
+static void get_i(struct line *line, intmax_t *s_in)
 {
 	char *s;
 	char *endptr;
 
-	get_s(line_nr, &s);
+	get_s(line, &s);
 
 	*s_in = strtol(s, &endptr, 10);
 	if (*endptr || errno == ERANGE)
-		die("line[%d]: invalid integer value", line_nr);
+		die("line[%d]: invalid integer value", line->nr);
 }
 
-static void get_d(int line_nr, double *s_in)
+static void get_d(struct line *line, double *s_in)
 {
 	char *s;
 	char *endptr;
 
-	get_s(line_nr, &s);
+	get_s(line, &s);
 
 	*s_in = strtod(s, &endptr);
 	if (*endptr || errno == ERANGE)
-		die("line[%d]: invalid float value", line_nr);
+		die("line[%d]: invalid float value", line->nr);
 }
 
 static int pretty;
@@ -453,6 +460,7 @@ static char *get_trimmed_line(char *buf, int buf_size)
 
 static int scripted(void)
 {
+	struct string_list parts = STRING_LIST_INIT_NODUP;
 	struct json_writer jw = JSON_WRITER_INIT;
 	char buf[MAX_LINE_LENGTH];
 	char *line;
@@ -470,66 +478,77 @@ static int scripted(void)
 		die("expected first line to be 'object' or 'array'");
 
 	while ((line = get_trimmed_line(buf, MAX_LINE_LENGTH)) != NULL) {
+		struct line state = { 0 };
 		char *verb;
 		char *key;
 		char *s_value;
 		intmax_t i_value;
 		double d_value;
 
-		line_nr++;
+		state.parts = &parts;
+		state.nr = ++line_nr;
 
-		verb = strtok(line, " ");
+		/* break line into command and zero or more tokens */
+		string_list_setlen(&parts, 0);
+		string_list_split_in_place(&parts, line, " ", -1);
+		string_list_remove_empty_items(&parts, 0);
+
+		/* ignore empty lines */
+		if (!parts.nr || !*parts.items[0].string)
+			continue;
+
+		verb = parts.items[state.consumed_nr++].string;
 
 		if (!strcmp(verb, "end")) {
 			jw_end(&jw);
 		}
 		else if (!strcmp(verb, "object-string")) {
-			get_s(line_nr, &key);
-			get_s(line_nr, &s_value);
+			get_s(&state, &key);
+			get_s(&state, &s_value);
 			jw_object_string(&jw, key, s_value);
 		}
 		else if (!strcmp(verb, "object-int")) {
-			get_s(line_nr, &key);
-			get_i(line_nr, &i_value);
+			get_s(&state, &key);
+			get_i(&state, &i_value);
 			jw_object_intmax(&jw, key, i_value);
 		}
 		else if (!strcmp(verb, "object-double")) {
-			get_s(line_nr, &key);
-			get_i(line_nr, &i_value);
-			get_d(line_nr, &d_value);
+			get_s(&state, &key);
+			get_i(&state, &i_value);
+			get_d(&state, &d_value);
 			jw_object_double(&jw, key, i_value, d_value);
 		}
 		else if (!strcmp(verb, "object-true")) {
-			get_s(line_nr, &key);
+			get_s(&state, &key);
 			jw_object_true(&jw, key);
 		}
 		else if (!strcmp(verb, "object-false")) {
-			get_s(line_nr, &key);
+			get_s(&state, &key);
 			jw_object_false(&jw, key);
 		}
 		else if (!strcmp(verb, "object-null")) {
-			get_s(line_nr, &key);
+			get_s(&state, &key);
 			jw_object_null(&jw, key);
 		}
 		else if (!strcmp(verb, "object-object")) {
-			get_s(line_nr, &key);
+			get_s(&state, &key);
 			jw_object_inline_begin_object(&jw, key);
 		}
 		else if (!strcmp(verb, "object-array")) {
-			get_s(line_nr, &key);
+			get_s(&state, &key);
 			jw_object_inline_begin_array(&jw, key);
 		}
 		else if (!strcmp(verb, "array-string")) {
-			get_s(line_nr, &s_value);
+			get_s(&state, &s_value);
 			jw_array_string(&jw, s_value);
 		}
 		else if (!strcmp(verb, "array-int")) {
-			get_i(line_nr, &i_value);
+			get_i(&state, &i_value);
 			jw_array_intmax(&jw, i_value);
 		}
 		else if (!strcmp(verb, "array-double")) {
-			get_i(line_nr, &i_value);
-			get_d(line_nr, &d_value);
+			get_i(&state, &i_value);
+			get_d(&state, &d_value);
 			jw_array_double(&jw, i_value, d_value);
 		}
 		else if (!strcmp(verb, "array-true"))
@@ -552,6 +571,7 @@ static int scripted(void)
 	printf("%s\n", jw.json.buf);
 
 	jw_release(&jw);
+	string_list_clear(&parts, 0);
 	return 0;
 }
 
-- 
2.40.0.380.gd2df7d2365


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 6/6] banned.h: mark `strtok()` and `strtok_r()` as banned
  2023-04-24 22:20 ` [PATCH v3 0/6] banned: mark `strok()`, `strtok_r()` " Taylor Blau
                     ` (4 preceding siblings ...)
  2023-04-24 22:20   ` [PATCH v3 5/6] t/helper/test-json-writer.c: " Taylor Blau
@ 2023-04-24 22:20   ` Taylor Blau
  2023-04-24 22:25     ` Chris Torek
  2023-04-25  6:26     ` Jeff King
  2023-04-25  6:27   ` [PATCH v3 0/6] banned: mark `strok()`, " Jeff King
  6 siblings, 2 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-24 22:20 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Chris Torek, Junio C Hamano, Jeff Hostetler,
	René Scharfe

`strtok()` has a couple of drawbacks that make it undesirable to have
any new instances. In addition to being thread-unsafe, it also
encourages confusing data flows, where `strtok()` may be called from
multiple functions with its first argument as NULL, making it unclear
from the immediate context which string is being tokenized.

Now that we have removed all instances of `strtok()` from the tree,
let's ban `strtok()` to avoid introducing new ones in the future. If new
callers should arise, they are encouraged to use
`string_list_split_in_place()` (and `string_list_remove_empty_items()`,
if applicable).

string_list_split_in_place() is not a perfect drop-in replacement
for `strtok_r()`, particularly if the caller is processing a string with
an arbitrary number of tokens, and wants to process each token one at a
time.

But there are no instances of this in Git's tree which are more
well-suited to `strtok_r()` than the friendlier
`string_list_in_place()`, so ban `strtok_r()`, too.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 banned.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/banned.h b/banned.h
index 6ccf46bc19..44e76bd90a 100644
--- a/banned.h
+++ b/banned.h
@@ -18,6 +18,10 @@
 #define strncpy(x,y,n) BANNED(strncpy)
 #undef strncat
 #define strncat(x,y,n) BANNED(strncat)
+#undef strtok
+#define strtok(x,y) BANNED(strtok)
+#undef strtok_r
+#define strtok_r(x,y,z) BANNED(strtok_r)
 
 #undef sprintf
 #undef vsprintf
-- 
2.40.0.380.gd2df7d2365

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 6/6] banned.h: mark `strtok()` and `strtok_r()` as banned
  2023-04-24 22:20   ` [PATCH v3 6/6] banned.h: mark `strtok()` and `strtok_r()` as banned Taylor Blau
@ 2023-04-24 22:25     ` Chris Torek
  2023-04-24 23:00       ` Taylor Blau
  2023-04-25  6:26     ` Jeff King
  1 sibling, 1 reply; 47+ messages in thread
From: Chris Torek @ 2023-04-24 22:25 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Jeff King, Junio C Hamano, Jeff Hostetler, René Scharfe

I eyeballed the whole thing (not the same as real tests of course)
and it looks good, but there's one typo (missing-word-o) here of note:

On Mon, Apr 24, 2023 at 3:20 PM Taylor Blau <me@ttaylorr.com> wrote:
...
> But there are no instances of this in Git's tree which are more
> well-suited to `strtok_r()` than the friendlier
> `string_list_in_place()`, so ban `strtok_r()`, too.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>

Missing `split_`. Probably not worth a re-roll...

Chris

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 6/6] banned.h: mark `strtok()` and `strtok_r()` as banned
  2023-04-24 22:25     ` Chris Torek
@ 2023-04-24 23:00       ` Taylor Blau
  0 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-24 23:00 UTC (permalink / raw)
  To: Chris Torek
  Cc: git, Jeff King, Junio C Hamano, Jeff Hostetler, René Scharfe

On Mon, Apr 24, 2023 at 03:25:55PM -0700, Chris Torek wrote:
> I eyeballed the whole thing (not the same as real tests of course)
> and it looks good, but there's one typo (missing-word-o) here of note:
>
> On Mon, Apr 24, 2023 at 3:20 PM Taylor Blau <me@ttaylorr.com> wrote:
> ...
> > But there are no instances of this in Git's tree which are more
> > well-suited to `strtok_r()` than the friendlier
> > `string_list_in_place()`, so ban `strtok_r()`, too.
> >
> > Signed-off-by: Taylor Blau <me@ttaylorr.com>
>
> Missing `split_`. Probably not worth a re-roll...

Oops, thanks for noticing.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 2/6] string-list: introduce `string_list_setlen()`
  2023-04-24 22:20   ` [PATCH v3 2/6] string-list: introduce `string_list_setlen()` Taylor Blau
@ 2023-04-25  6:21     ` Jeff King
  2023-04-25 21:00       ` Taylor Blau
  0 siblings, 1 reply; 47+ messages in thread
From: Jeff King @ 2023-04-25  6:21 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Chris Torek, Junio C Hamano, Jeff Hostetler,
	René Scharfe

On Mon, Apr 24, 2023 at 06:20:14PM -0400, Taylor Blau wrote:

> However, setting `list->nr` manually is not safe in all instances. There
> are a couple of cases worth worrying about:
> 
>   - If the `string_list` is initialized with `strdup_strings`,
>     truncating the list can lead to overwriting strings which are
>     allocated elsewhere. If there aren't any other pointers to those
>     strings other than the ones inside of the `items` array, they will
>     become unreachable and leak.
> 
>     (We could ourselves free the truncated items between
>     string_list->items[nr] and `list->nr`, but no present or future
>     callers would benefit from this additional complexity).

I wondered how bad it would be to just free those truncated entries when
strdup_strings is set. But that led me to another interesting point: the
util fields. The regular string_list_clear() will optionally free the
util entries, too. We'd potentially need to deal with those, too.

We don't do anything with them here. So code like:

  struct string_list foo = STRING_LIST_INIT_NODUP;

  string_list_append(&foo, "bar")->util = xstrdup("something else");
  string_list_setlen(&foo, 0);

would leak that util field. To be clear, to me this definitely falls
under "if it hurts, don't do it", and I think code like above is pretty
unlikely. But since the point of our function is to prevent mistakes, I
thought it was worth mentioning.

I think we _could_ do something like:

  for (i = nr; i < list->nr; i++) {
	if (list->items[i].util)
		BUG("truncated string list item has non-NULL util field");
  }

though that is technically tighter than we need to be (it could be an
unowned util field, after all; we don't know what it means here). So I'm
inclined to leave your patch as-is.

This would all be easier if the string_list had a field for "we own the
util fields, too" just like it has strdup_strings. Or even a free-ing
function. But instead we have ad-hoc solutions like "free_util" and
string_list_clear_func(). But that's really outside the scope of your
series. </rant> :)

-Peff

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 6/6] banned.h: mark `strtok()` and `strtok_r()` as banned
  2023-04-24 22:20   ` [PATCH v3 6/6] banned.h: mark `strtok()` and `strtok_r()` as banned Taylor Blau
  2023-04-24 22:25     ` Chris Torek
@ 2023-04-25  6:26     ` Jeff King
  2023-04-25 21:02       ` Taylor Blau
  1 sibling, 1 reply; 47+ messages in thread
From: Jeff King @ 2023-04-25  6:26 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Chris Torek, Junio C Hamano, Jeff Hostetler,
	René Scharfe

On Mon, Apr 24, 2023 at 06:20:26PM -0400, Taylor Blau wrote:

> string_list_split_in_place() is not a perfect drop-in replacement
> for `strtok_r()`, particularly if the caller is processing a string with
> an arbitrary number of tokens, and wants to process each token one at a
> time.
> 
> But there are no instances of this in Git's tree which are more
> well-suited to `strtok_r()` than the friendlier
> `string_list_in_place()`, so ban `strtok_r()`, too.

For true incremental left-to-right parsing, strcspn() is probably a
better solution. We could mention that here in case anybody digs up the
commit after getting a "banned function" error.

I'm tempted to say that this thread could serve the same function, but
I'm not sure where people turn to for answers (I find searching the list
about as easy as "git log -S", but then I've invested a lot of effort in
my list archive tooling :) ).

I'm happy with it either the way, though.

-Peff

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 0/6] banned: mark `strok()`, `strtok_r()` as banned
  2023-04-24 22:20 ` [PATCH v3 0/6] banned: mark `strok()`, `strtok_r()` " Taylor Blau
                     ` (5 preceding siblings ...)
  2023-04-24 22:20   ` [PATCH v3 6/6] banned.h: mark `strtok()` and `strtok_r()` as banned Taylor Blau
@ 2023-04-25  6:27   ` Jeff King
  2023-04-25 21:03     ` Taylor Blau
  6 siblings, 1 reply; 47+ messages in thread
From: Jeff King @ 2023-04-25  6:27 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Chris Torek, Junio C Hamano, Jeff Hostetler,
	René Scharfe

On Mon, Apr 24, 2023 at 06:20:07PM -0400, Taylor Blau wrote:

> Here is another medium-sized reroll of my series to add `strtok()` (and
> `strtok_r()`!) to the list of banned functions.
> 
> Notable changes include:
> 
>   - Dropped `string_list_split_in_place_multi()` in favor of a
>     combination of `string_list_split_in_place()` and
>     `string_list_remove_empty_items()`.
> 
>   - `strtok_r()` is back on the banned list, with a more realistic sales
>     pitch.

This all looks good to me. I left two comments which are on the border
between "minor" and "philosophizing", so I'd be happy to see the series
go in as-is.

-Peff

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 5/6] t/helper/test-json-writer.c: avoid using `strtok()`
  2023-04-24 22:20   ` [PATCH v3 5/6] t/helper/test-json-writer.c: " Taylor Blau
@ 2023-04-25 13:57     ` Jeff Hostetler
  0 siblings, 0 replies; 47+ messages in thread
From: Jeff Hostetler @ 2023-04-25 13:57 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff Hostetler



On 4/24/23 6:20 PM, Taylor Blau wrote:
> Apply similar treatment as in the previous commit to remove usage of
> `strtok()` from the "oidmap" test helper.
> 
> Each of the different commands that the "json-writer" helper accepts
> pops the next space-delimited token from the current line and interprets
> it as a string, integer, or double (with the exception of the very first
> token, which is the command itself).
> 
> To accommodate this, split the line in place by the space character, and
> pass the corresponding string_list to each of the specialized `get_s()`,
> `get_i()`, and `get_d()` functions.
> 
> `get_i()` and `get_d()` are thin wrappers around `get_s()` that convert
> their result into the appropriate type by either calling `strtol()` or
> `strtod()`, respectively. In `get_s()`, we mark the token as "consumed"
> by incrementing the `consumed_nr` counter, indicating how many tokens we
> have read up to that point.
> 
> Because each of these functions needs the string-list parts, the number
> of tokens consumed, and the line number, these three are wrapped up in
> to a struct representing the line state.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>

LGTM

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 2/6] string-list: introduce `string_list_setlen()`
  2023-04-25  6:21     ` Jeff King
@ 2023-04-25 21:00       ` Taylor Blau
  0 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-25 21:00 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Chris Torek, Junio C Hamano, Jeff Hostetler,
	René Scharfe

On Tue, Apr 25, 2023 at 02:21:07AM -0400, Jeff King wrote:
> I think we _could_ do something like:
>
>   for (i = nr; i < list->nr; i++) {
> 	if (list->items[i].util)
> 		BUG("truncated string list item has non-NULL util field");
>   }
>
> though that is technically tighter than we need to be (it could be an
> unowned util field, after all; we don't know what it means here). So I'm
> inclined to leave your patch as-is.

I think there are two ways to do it, either:

  - something like what you wrote above, perhaps with an additional
    `free_util` bit on the string_list itself (which might make it
    convenient to drop all of the `free_util` parameters that permeate
    its API)

  - have string_list_setlen() take a `free_util` argument itself, in
    which case your code would change to:

    if (free_util) {
      for (i = nr; i < list->nr; i++)
        free(list->items[i].util)
    }

> This would all be easier if the string_list had a field for "we own the
> util fields, too" just like it has strdup_strings. Or even a free-ing
> function. But instead we have ad-hoc solutions like "free_util" and
> string_list_clear_func(). But that's really outside the scope of your
> series. </rant> :)

;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 6/6] banned.h: mark `strtok()` and `strtok_r()` as banned
  2023-04-25  6:26     ` Jeff King
@ 2023-04-25 21:02       ` Taylor Blau
  0 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-25 21:02 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Chris Torek, Junio C Hamano, Jeff Hostetler,
	René Scharfe

On Tue, Apr 25, 2023 at 02:26:17AM -0400, Jeff King wrote:
> On Mon, Apr 24, 2023 at 06:20:26PM -0400, Taylor Blau wrote:
>
> > string_list_split_in_place() is not a perfect drop-in replacement
> > for `strtok_r()`, particularly if the caller is processing a string with
> > an arbitrary number of tokens, and wants to process each token one at a
> > time.
> >
> > But there are no instances of this in Git's tree which are more
> > well-suited to `strtok_r()` than the friendlier
> > `string_list_in_place()`, so ban `strtok_r()`, too.
>
> For true incremental left-to-right parsing, strcspn() is probably a
> better solution. We could mention that here in case anybody digs up the
> commit after getting a "banned function" error.
>
> I'm tempted to say that this thread could serve the same function, but
> I'm not sure where people turn to for answers (I find searching the list
> about as easy as "git log -S", but then I've invested a lot of effort in
> my list archive tooling :) ).

Personally, between what's already in the patch message and this
discussion on the list, I think that folks would have enough guidance on
how to do things right.

But if others feel like we could or should be more rigid here and update
the commit message to say something like "if you're scanning from
left-to-right, you could use strtok_r(), or strcspn() instead", but TBH
I think there are a gazillion different ways to do this task, so I don't
know that adding another item to that list substantially changes things.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 0/6] banned: mark `strok()`, `strtok_r()` as banned
  2023-04-25  6:27   ` [PATCH v3 0/6] banned: mark `strok()`, " Jeff King
@ 2023-04-25 21:03     ` Taylor Blau
  0 siblings, 0 replies; 47+ messages in thread
From: Taylor Blau @ 2023-04-25 21:03 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Chris Torek, Junio C Hamano, Jeff Hostetler,
	René Scharfe

On Tue, Apr 25, 2023 at 02:27:08AM -0400, Jeff King wrote:
> On Mon, Apr 24, 2023 at 06:20:07PM -0400, Taylor Blau wrote:
>
> > Here is another medium-sized reroll of my series to add `strtok()` (and
> > `strtok_r()`!) to the list of banned functions.
> >
> > Notable changes include:
> >
> >   - Dropped `string_list_split_in_place_multi()` in favor of a
> >     combination of `string_list_split_in_place()` and
> >     `string_list_remove_empty_items()`.
> >
> >   - `strtok_r()` is back on the banned list, with a more realistic sales
> >     pitch.
>
> This all looks good to me. I left two comments which are on the border
> between "minor" and "philosophizing", so I'd be happy to see the series
> go in as-is.

Thanks, I agree that this version is ready to go, absent any other
show-stopping reviews.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2023-04-25 21:03 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-13 23:31 [PATCH 0/5] banned: mark `strok()`, `strtok_r()` as banned Taylor Blau
2023-04-13 23:31 ` [PATCH 1/5] string-list: introduce `string_list_split_in_place_multi()` Taylor Blau
2023-04-18 10:10   ` Jeff King
2023-04-18 17:08     ` Taylor Blau
2023-04-13 23:31 ` [PATCH 2/5] t/helper/test-hashmap.c: avoid using `strtok()` Taylor Blau
2023-04-18 10:23   ` Jeff King
2023-04-18 18:06     ` Taylor Blau
2023-04-13 23:31 ` [PATCH 3/5] t/helper/test-oidmap.c: " Taylor Blau
2023-04-13 23:31 ` [PATCH 4/5] t/helper/test-json-writer.c: " Taylor Blau
2023-04-13 23:31 ` [PATCH 5/5] banned.h: mark `strtok()`, `strtok_r()` as banned Taylor Blau
2023-04-14  1:39   ` Junio C Hamano
2023-04-14  2:08     ` Chris Torek
2023-04-14 13:41     ` Taylor Blau
2023-04-18 19:18 ` [PATCH v2 0/6] banned: mark `strok()` " Taylor Blau
2023-04-18 19:18   ` [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()` Taylor Blau
2023-04-18 19:39     ` Junio C Hamano
2023-04-18 20:54       ` Taylor Blau
2023-04-22 11:12     ` Jeff King
2023-04-22 15:53       ` René Scharfe
2023-04-23  0:35         ` Jeff King
2023-04-24 16:24           ` Junio C Hamano
2023-04-23  2:38       ` [PATCH v2 1/6] string-list: introduce `string_list_split_in_place_multi()`t Taylor Blau
2023-04-23  2:40         ` Taylor Blau
2023-04-18 19:18   ` [PATCH v2 2/6] string-list: introduce `string_list_setlen()` Taylor Blau
2023-04-22 11:14     ` Jeff King
2023-04-18 19:18   ` [PATCH v2 3/6] t/helper/test-hashmap.c: avoid using `strtok()` Taylor Blau
2023-04-22 11:16     ` Jeff King
2023-04-24 21:19       ` Taylor Blau
2023-04-18 19:18   ` [PATCH v2 4/6] t/helper/test-oidmap.c: " Taylor Blau
2023-04-18 19:18   ` [PATCH v2 5/6] t/helper/test-json-writer.c: " Taylor Blau
2023-04-18 19:18   ` [PATCH v2 6/6] banned.h: mark `strtok()` as banned Taylor Blau
2023-04-24 22:20 ` [PATCH v3 0/6] banned: mark `strok()`, `strtok_r()` " Taylor Blau
2023-04-24 22:20   ` [PATCH v3 1/6] string-list: multi-delimiter `string_list_split_in_place()` Taylor Blau
2023-04-24 22:20   ` [PATCH v3 2/6] string-list: introduce `string_list_setlen()` Taylor Blau
2023-04-25  6:21     ` Jeff King
2023-04-25 21:00       ` Taylor Blau
2023-04-24 22:20   ` [PATCH v3 3/6] t/helper/test-hashmap.c: avoid using `strtok()` Taylor Blau
2023-04-24 22:20   ` [PATCH v3 4/6] t/helper/test-oidmap.c: " Taylor Blau
2023-04-24 22:20   ` [PATCH v3 5/6] t/helper/test-json-writer.c: " Taylor Blau
2023-04-25 13:57     ` Jeff Hostetler
2023-04-24 22:20   ` [PATCH v3 6/6] banned.h: mark `strtok()` and `strtok_r()` as banned Taylor Blau
2023-04-24 22:25     ` Chris Torek
2023-04-24 23:00       ` Taylor Blau
2023-04-25  6:26     ` Jeff King
2023-04-25 21:02       ` Taylor Blau
2023-04-25  6:27   ` [PATCH v3 0/6] banned: mark `strok()`, " Jeff King
2023-04-25 21:03     ` Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).