* [bug] git diff --word-diff gives wrong result for utf-8 chinese
@ 2022-11-29 3:46 Ping Yin
2022-11-29 3:49 ` Ping Yin
2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason
0 siblings, 2 replies; 12+ messages in thread
From: Ping Yin @ 2022-11-29 3:46 UTC (permalink / raw)
To: mailinggit list
[-- Attachment #1: Type: text/plain, Size: 359 bytes --]
Result of "git diff"
- 为1
+ 为2
or (if chinese can not be displayed correctly)
- <E4><B8><BA>1
+ <E4><B8><BA>2
Actual result of "git diff --color-words"
<E4><B8>[-<BA>1-]{+<BA>2+}
Expected result of "git diff --color-words"
为[-1-]{+2+}
or (if chinese can not be displayed correctly)
<E4><B8><BA>[-1-]{+2+}
Ping Yin
[-- Attachment #2: image.png --]
[-- Type: image/png, Size: 5682 bytes --]
[-- Attachment #3: image.png --]
[-- Type: image/png, Size: 9992 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
2022-11-29 3:46 [bug] git diff --word-diff gives wrong result for utf-8 chinese Ping Yin
@ 2022-11-29 3:49 ` Ping Yin
2022-11-29 8:18 ` Bagas Sanjaya
2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason
1 sibling, 1 reply; 12+ messages in thread
From: Ping Yin @ 2022-11-29 3:49 UTC (permalink / raw)
To: mailinggit list
sorry, typo, s/--color-words/--word-diff/g
Ping Yin
On Tue, Nov 29, 2022 at 11:46 AM Ping Yin <pkufranky@gmail.com> wrote:
>
> Result of "git diff"
>
> - 为1
> + 为2
>
> or (if chinese can not be displayed correctly)
>
> - <E4><B8><BA>1
> + <E4><B8><BA>2
>
> Actual result of "git diff --color-words"
>
> <E4><B8>[-<BA>1-]{+<BA>2+}
>
> Expected result of "git diff --color-words"
>
> 为[-1-]{+2+}
>
> or (if chinese can not be displayed correctly)
>
> <E4><B8><BA>[-1-]{+2+}
>
>
> Ping Yin
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
2022-11-29 3:49 ` Ping Yin
@ 2022-11-29 8:18 ` Bagas Sanjaya
0 siblings, 0 replies; 12+ messages in thread
From: Bagas Sanjaya @ 2022-11-29 8:18 UTC (permalink / raw)
To: Ping Yin, mailinggit list
On 11/29/22 10:49, Ping Yin wrote:
> sorry, typo, s/--color-words/--word-diff/g
>
Hi, welcome to Git mailing list!
Please remind yourself:
* Do not send HTML mails, send plain-text ones instead. Many mailing
lists (including vger.kernel.org that powers Git ML) reject HTML
emails for these are likely spam. Make sure your email isn't mangled
(tabs and spaces as-is, no line wrapping).
* Do not top-post, reply inline with appropriate context instead. I
have to cut the reply context as a result.
* When you submit a patch and people reply with their reviews, engage
with them (either sending revised patch addressing the reviews or
reply with justification). They will ignore you if you ignore them.
Thanks.
--
An old man doll... just what I always wanted! - Clara
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
2022-11-29 3:46 [bug] git diff --word-diff gives wrong result for utf-8 chinese Ping Yin
2022-11-29 3:49 ` Ping Yin
@ 2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason
2022-11-29 11:32 ` Junio C Hamano
1 sibling, 1 reply; 12+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-11-29 10:52 UTC (permalink / raw)
To: Ping Yin; +Cc: mailinggit list
On Tue, Nov 29 2022, Ping Yin wrote:
> Result of "git diff"
>
> - 为1
> + 为2
>
> or (if chinese can not be displayed correctly)
>
> - <E4><B8><BA>1
> + <E4><B8><BA>2
>
> Actual result of "git diff --color-words"
>
> <E4><B8>[-<BA>1-]{+<BA>2+}
>
> Expected result of "git diff --color-words"
>
> 为[-1-]{+2+}
>
> or (if chinese can not be displayed correctly)
I think we could provide new ways to do per-language diffs, right now
you can use --word-diff-regex, but it would be handy to e.g. have a
built-in collection of those (or other non-regex boundary algorithms)
for Chinese etc.
But as for considering this a bug, or changing the existing behavior I
think we'd need to deal with:
* We (approximately) split on space now, which is certainly
ASCII-biased, and outside of CJK fairly somewhat universal.
* If we're going to split on "real words" in some cross-language aware
way, are we going to run into conflicts between what different
languages would consider sensible rules?
* We probably don't want to make the "diff" dependent on the user's
locale, but e.g. saying "I want a Chinese diff" via a CLI option
would be OK.
* Even for say Chinese, there's probably interesting edge cases when
it's combined with other languages or character sets (e.g. Chinese +
HTML).
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason
@ 2022-11-29 11:32 ` Junio C Hamano
2022-11-29 18:23 ` Jeff King
0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2022-11-29 11:32 UTC (permalink / raw)
To: Ævar Arnfjörð Bjarmason; +Cc: Ping Yin, mailinggit list
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>> or (if chinese can not be displayed correctly)
>>
>> - <E4><B8><BA>1
>> + <E4><B8><BA>2
>>
>> Actual result of "git diff --color-words"
>>
>> <E4><B8>[-<BA>1-]{+<BA>2+}
>> ...
> I think we could provide new ways to do per-language diffs, right now
> you can use --word-diff-regex, but it would be handy to e.g. have a
> built-in collection of those (or other non-regex boundary algorithms)
> for Chinese etc.
I think you are thinking it with unnecessaarily complexity.
The only thing that needs noticing in the above example, I think is,
that the three-byte sequence E4-B8-BA in the example is supposed to
be a single unicode character, and the actual result depicted can
happen only if we (incorrectly) chomp that single character in the
middle.
No matter what language we are using, we shouldn't do that.
I suspect that "--word-diff" internal is not even aware what a
character is, but if you assume UTF-8 (precomposed), then you should
be able to tell where the character boundary is by only looking at
the high-bit patterns to avoid producing such an output.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
2022-11-29 11:32 ` Junio C Hamano
@ 2022-11-29 18:23 ` Jeff King
2022-11-29 18:54 ` Jeff King
0 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2022-11-29 18:23 UTC (permalink / raw)
To: Junio C Hamano
Cc: Ævar Arnfjörð Bjarmason, Ping Yin, mailinggit list
On Tue, Nov 29, 2022 at 08:32:58PM +0900, Junio C Hamano wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
> >> or (if chinese can not be displayed correctly)
> >>
> >> - <E4><B8><BA>1
> >> + <E4><B8><BA>2
> >>
> >> Actual result of "git diff --color-words"
> >>
> >> <E4><B8>[-<BA>1-]{+<BA>2+}
> >> ...
> > I think we could provide new ways to do per-language diffs, right now
> > you can use --word-diff-regex, but it would be handy to e.g. have a
> > built-in collection of those (or other non-regex boundary algorithms)
> > for Chinese etc.
>
> I think you are thinking it with unnecessaarily complexity.
>
> The only thing that needs noticing in the above example, I think is,
> that the three-byte sequence E4-B8-BA in the example is supposed to
> be a single unicode character, and the actual result depicted can
> happen only if we (incorrectly) chomp that single character in the
> middle.
>
> No matter what language we are using, we shouldn't do that.
>
> I suspect that "--word-diff" internal is not even aware what a
> character is, but if you assume UTF-8 (precomposed), then you should
> be able to tell where the character boundary is by only looking at
> the high-bit patterns to avoid producing such an output.
Agreed that we should probably avoid breaking characters. But what
puzzles me more is that we break it between B8 and BA, and not
elsewhere. Why not between E4 and B8? Why not between BA and "1"?
If the rule is "break on ascii whitespace", then I'd have expected the
whole four-character sequence to be taken as a unit. In other words, it
does should not have to care that a character is, as long as the bytes
for space characters cannot appear inside other characters (which is
true of utf8).
-Peff
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
2022-11-29 18:23 ` Jeff King
@ 2022-11-29 18:54 ` Jeff King
2022-12-01 7:08 ` Ping Yin
2022-12-01 7:33 ` Ping Yin
0 siblings, 2 replies; 12+ messages in thread
From: Jeff King @ 2022-11-29 18:54 UTC (permalink / raw)
To: Junio C Hamano
Cc: Ævar Arnfjörð Bjarmason, Ping Yin, mailinggit list
On Tue, Nov 29, 2022 at 01:23:27PM -0500, Jeff King wrote:
> > I suspect that "--word-diff" internal is not even aware what a
> > character is, but if you assume UTF-8 (precomposed), then you should
> > be able to tell where the character boundary is by only looking at
> > the high-bit patterns to avoid producing such an output.
>
> Agreed that we should probably avoid breaking characters. But what
> puzzles me more is that we break it between B8 and BA, and not
> elsewhere. Why not between E4 and B8? Why not between BA and "1"?
>
> If the rule is "break on ascii whitespace", then I'd have expected the
> whole four-character sequence to be taken as a unit. In other words, it
> does should not have to care that a character is, as long as the bytes
> for space characters cannot appear inside other characters (which is
> true of utf8).
Even more puzzling is that it produces the expected output for me:
[note that \x is a bash-ism]
$ printf '\xe4\xb8\xba1' >one
$ printf '\xe4\xb8\xba2' >two
$ git diff --no-index --word-diff one two
diff --git a/one b/two
index 9ae469fc41..576e6e32d8 100644
--- a/one
+++ b/two
@@ -1 +1 @@
[-为1-]{+为2+}
I wonder if OP has diff.wordRegex config (or attributes triggering a
diff.*.wordRegex) that is doing something else.
-Peff
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
2022-11-29 18:54 ` Jeff King
@ 2022-12-01 7:08 ` Ping Yin
2022-12-01 7:33 ` Ping Yin
1 sibling, 0 replies; 12+ messages in thread
From: Ping Yin @ 2022-12-01 7:08 UTC (permalink / raw)
To: Jeff King
Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason,
mailinggit list
> I wonder if OP has diff.wordRegex config (or attributes triggering a
> diff.*.wordRegex) that is doing something else.
Wow, you are right, sorry for the noise.
$ git config -l | grep word
diff.wordregex=[[:alnum:]_]+|[^[:space:]]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
2022-11-29 18:54 ` Jeff King
2022-12-01 7:08 ` Ping Yin
@ 2022-12-01 7:33 ` Ping Yin
2022-12-01 14:51 ` Phillip Wood
1 sibling, 1 reply; 12+ messages in thread
From: Ping Yin @ 2022-12-01 7:33 UTC (permalink / raw)
To: Jeff King
Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason,
mailinggit list
> > If the rule is "break on ascii whitespace",
Is there a way to achieve this: break english by word, and break
chinese by utf-8 character
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
2022-12-01 7:33 ` Ping Yin
@ 2022-12-01 14:51 ` Phillip Wood
2022-12-01 15:51 ` Ping Yin
2022-12-01 20:06 ` Jeff King
0 siblings, 2 replies; 12+ messages in thread
From: Phillip Wood @ 2022-12-01 14:51 UTC (permalink / raw)
To: Ping Yin, Jeff King
Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason,
mailinggit list
Hi Ping
On 01/12/2022 07:33, Ping Yin wrote:
>>> If the rule is "break on ascii whitespace",
>
> Is there a way to achieve this: break english by word, and break
> chinese by utf-8 character
You could extend your current regex so that it matches whole utf-8
codepoints which is what git does for the builtin userdiff regexes. I've
not tested it but I think
git config --global diff.wordregex "[[:alnum:]_]+|[^[:space:]]|$(printf
'[\xc0-\xff][\x80-\xbf]+')"
should work. The downside is that you end up with a .gitconfig that is
not valid utf-8. Perhaps someone else has a clever idea to get around that.
Best Wishes
Phillip
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
2022-12-01 14:51 ` Phillip Wood
@ 2022-12-01 15:51 ` Ping Yin
2022-12-01 20:06 ` Jeff King
1 sibling, 0 replies; 12+ messages in thread
From: Ping Yin @ 2022-12-01 15:51 UTC (permalink / raw)
To: phillip.wood
Cc: Jeff King, Junio C Hamano, Ævar Arnfjörð Bjarmason,
mailinggit list
Ping Yin
On Thu, Dec 1, 2022 at 10:51 PM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> Hi Ping
>
> On 01/12/2022 07:33, Ping Yin wrote:
>
> git config --global diff.wordregex "[[:alnum:]_]+|[^[:space:]]|$(printf
> '[\xc0-\xff][\x80-\xbf]+')"
>
> should work. The downside is that you end up with a .gitconfig that is
> not valid utf-8. Perhaps someone else has a clever idea to get around that.
Wow, it works. Thanks very much.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
2022-12-01 14:51 ` Phillip Wood
2022-12-01 15:51 ` Ping Yin
@ 2022-12-01 20:06 ` Jeff King
1 sibling, 0 replies; 12+ messages in thread
From: Jeff King @ 2022-12-01 20:06 UTC (permalink / raw)
To: phillip.wood
Cc: Ping Yin, Junio C Hamano, Ævar Arnfjörð Bjarmason,
mailinggit list
On Thu, Dec 01, 2022 at 02:51:29PM +0000, Phillip Wood wrote:
> On 01/12/2022 07:33, Ping Yin wrote:
> > > > If the rule is "break on ascii whitespace",
> >
> > Is there a way to achieve this: break english by word, and break
> > chinese by utf-8 character
>
> You could extend your current regex so that it matches whole utf-8
> codepoints which is what git does for the builtin userdiff regexes. I've not
> tested it but I think
>
> git config --global diff.wordregex "[[:alnum:]_]+|[^[:space:]]|$(printf
> '[\xc0-\xff][\x80-\xbf]+')"
>
> should work. The downside is that you end up with a .gitconfig that is not
> valid utf-8. Perhaps someone else has a clever idea to get around that.
I think in more advanced regular expression engines you can do stuff
like matching "[\x{4e00}-\x{9fcc}]", or even "\p{Han}". But I don't know
that the stock libc regex is capable of anything like this, even with
EREs. That's the only option Git provides for matching word regexes, but
in theory we could support libpcre. We already can optionally build
against it; we would just need config/plumbing to get it into
diff.c:find_word_boundaries().
-Peff
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2022-12-01 20:07 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-29 3:46 [bug] git diff --word-diff gives wrong result for utf-8 chinese Ping Yin
2022-11-29 3:49 ` Ping Yin
2022-11-29 8:18 ` Bagas Sanjaya
2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason
2022-11-29 11:32 ` Junio C Hamano
2022-11-29 18:23 ` Jeff King
2022-11-29 18:54 ` Jeff King
2022-12-01 7:08 ` Ping Yin
2022-12-01 7:33 ` Ping Yin
2022-12-01 14:51 ` Phillip Wood
2022-12-01 15:51 ` Ping Yin
2022-12-01 20:06 ` Jeff King
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).