Git Mailing List Archive mirror
 help / color / mirror / Atom feed
* similarity index vs. whitespaces
@ 2023-04-13  1:56 Mike Hommey
  2023-04-13  2:35 ` Chris Torek
  0 siblings, 1 reply; 2+ messages in thread
From: Mike Hommey @ 2023-04-13  1:56 UTC (permalink / raw
  To: git

Hi,

I hit an interesting case of copy-detection "failure" after renaming a
python script and changing indentation in it at the same time: git show
-C wouldn't detect it as a rename unless I lowered the similarity index
significantly. But from a human perspective, the similarity index feels
wrong.

Here's a small illustrative example:
$ git init qux
$ cd qux
$ cat > qux <<EOF
a
b
c
d
e
f
EOF
$ git add qux
$ git commit -a -m qux
$ git mv qux hoge
$ sed -i 's/[cde]/  \0/' hoge
$ git commit -a -m hoge
$ git diff -C HEAD^!
diff --git a/hoge b/hoge
new file mode 100644
index 0000000..9ab6fcc
--- /dev/null
+++ b/hoge
@@ -0,0 +1,6 @@
+a
+b
+  c
+  d
+  e
+f
diff --git a/qux b/qux
deleted file mode 100644
index 0fdf397..0000000
--- a/qux
+++ /dev/null
@@ -1,6 +0,0 @@
-a
-b
-c
-d
-e
-f

$ git diff -C10% HEAD^!
diff --git a/qux b/hoge
similarity index 33%
rename from qux
rename to hoge
index 0fdf397..9ab6fcc 100644
--- a/qux
+++ b/hoge
@@ -1,6 +1,6 @@
 a
 b
-c
-d
-e
+  c
+  d
+  e
 f

From a human perspective 33% similarity feels way too low here. I know
it's essentially counting lines in the diff, but that feels limited.

What do you think?

Mike

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: similarity index vs. whitespaces
  2023-04-13  1:56 similarity index vs. whitespaces Mike Hommey
@ 2023-04-13  2:35 ` Chris Torek
  0 siblings, 0 replies; 2+ messages in thread
From: Chris Torek @ 2023-04-13  2:35 UTC (permalink / raw
  To: Mike Hommey; +Cc: git

On Wed, Apr 12, 2023 at 7:01 PM Mike Hommey <mh@glandium.org> wrote:
[example of Python script diff snipped]
> From a human perspective 33% similarity feels way too low here. I know
> it's essentially counting lines in the diff, but that feels limited.

Technically, the similarity index isn't based on lines.  Instead
it's based on byte-by-byte matching, broken into segments (these
do use LF or CRLF to break up segments as appropriate but very
*long* lines get broken up without such line terminators).  This
shares some code with the delta compression algorithm used to
pack objects.  One of the goals here is to consider a lot of
*moved* lines to be "very similar", despite such moves generally
producing large-ish diffs.

CRs are, if I recall correctly, discarded during similarty index
computation.  It would be somewhat easy to discard leading white
space -- and even easier to discard *all* white space, but I'd
suggest that would be wrong -- due to the special case already
in place for line terminators here.

It's not clear to me that discarding leading white space would
be correct in *all* cases, but it does seem very appropriate for
Python code.  Whether the earlier discussion about diff algorithms
being adjusted based on .gitattributes entries might apply here,
I'll leave for others to argue about. :-)  (That is, I'm just
tossing it out as an idea for the mix.)

Chris

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-04-13  2:36 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-13  1:56 similarity index vs. whitespaces Mike Hommey
2023-04-13  2:35 ` Chris Torek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).