Re: icase pathspec magic support in ls-tree

Git Mailing List Archive mirror
 help / color / mirror / Atom feed

From: Tao Klerks <tao@klerks.biz>
To: Elijah Newren <newren@gmail.com>
Cc: "brian m. carlson" <sandals@crustytoothpaste.net>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	git <git@vger.kernel.org>
Subject: Re: icase pathspec magic support in ls-tree
Date: Mon, 17 Oct 2022 17:46:44 +0200	[thread overview]
Message-ID: <CAPMMpognADkh7ZJ+qYwE2GOoQ+knAthXJC-K6QAjN8qLwuS=7Q@mail.gmail.com> (raw)
In-Reply-To: <CAPMMpoictABXUhVCASZOiYZ4nydGtqiDYRpAELEBvbPn_5jRWA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 7541 bytes --]

Thanks again for the code samples above!

I've spent some time productizing this for my environment, and I think
it's pretty close to optimal for my environment at least.

In case this can be of value to anyone else, I've attached the final thing.

This version is genericized a little - two specific specializations
for my environment not included here are:
* a specific strategy for determining the most-likely-correct upstream
branch to use as a default for "diff from" for new branches, to avoid
having to do a full-tree dupe search; in most environments choosing
"master" or "main" would likely be the right thing most of the time
* a specific list of branch paths that we should validate (vs others
that we should not)

Best regards,
Tao

On Sun, Oct 16, 2022 at 12:06 AM Tao Klerks <tao@klerks.biz> wrote:
>
> This seems to be working, thank you!!!
>
> Two updates I had to make, in case this is useful to anyone else:
>
> 1: I'm getting some weird behavior I can't explain yet, where some
> paths are returned from the ls-tree call twice: The input to ls-tree
> is all unique paths, but the output somehow includes a relatively
> small subset of paths twice.
> This mysterious issue is easily addressed by adding an extra "uniq"
> call to remove the "trivial dupes" before hunting for the
> "case-insensitive dupes" we're interested in:
>
> git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
> -t HEAD | sort | uniq | uniq -i -d
>
> 2: The xargs call has issues with paths with spaces in them. Adding
> -d"\n" seems to be a clean way to fix this
>
> git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> all-leading-dirs.py | xargs -d"\n" --no-run-if-empty git ls-tree --name-only
> -t HEAD | sort | uniq | uniq -i -d
>
>
> Not only does this approach seem to work well, but it also has far
> better performance characteristics than I was expecting!
>
> Simple small commit (10 files): 20ms
> Reasonably large commit (10,000 files): 250ms
> Diff from empty on a smaller branch (100,000 files): 2,800ms
> Diff from empty on a larger branch (200,000 files): 5,400ms
>
>
> It still makes sense to check the number of files/lines after doing
> the diff, and do a "simple" 800ms full-tree (no-pathspec) dupe check
> instead of this when your diff size goes past some file count
> threshold, but it looks like that threshold would be quite high in my
> environment - 30k files maybe?
>
> I will have a go at writing a full update hook, and (without knowing
> whether it will make sense from a performance perspective) I'd like to
> try to implement the "all-leading-dirs" logic in bash 4 using
> associative arrays, to remove the python dependency. If I make it work
> I'll post back here.
>
> This seems to cover what I needed icase pathspec magic for in ls-tree,
> without having to implement it - so thanks again!
>
> Tao
>
> On Fri, Oct 14, 2022 at 7:06 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Fri, Oct 14, 2022 at 1:48 AM Tao Klerks <tao@klerks.biz> wrote:
> > >
> > > On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@gmail.com> wrote:
> > > >
> > [...]
> > > > I don't see why you need to do full-tree with existing options, nor
> > > > why the ls-tree option you want would somehow make it easier to avoid.
> > > > I think you can avoid the full-tree search with something like:
> > > >
> > > > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH |
> > > > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH |
> > > > \
> > > >    sort | uniq -i -d
> > > >
> > > > The final "sort | uniq -i -d" is taken from Torsten's suggestion.
> > > >
> > > > The git diff ... xargs git ls-tree section on the first line will
> > > > provide a list of all files (& subdirs) in the same directory as any
> > > > added file.  (Although, it has a blind spot for paths in the toplevel
> > > > directory.)
> > >
> > > The theoretical problem with this approach is that it only addresses
> > > case-insensitive-duplicate files, not directories.
> >
> > It'll catch some case-insensitive-duplicate directories too -- note
> > that I did call out that it'd print subdirs.  But to be more cautious,
> > you would need to carefully grab all leading directories of any added
> > path, not just the immediate leading directory.
> >
> > > Directories have been the problem, in "my" repo, around one-third of
> > > the time - typically someone does a directory rename, and someone else
> > > does a bad merge and reintroduces the old directory.
> > >
> > > That said, what "icase pathspec magic" actually *does*, is break down
> > > the pathspec into iteratively more complete paths, level by level,
> > > looking for case-duplicates at each level. That's something I could
> > > presumably do in shell scripting, collecting all the interesting
> > > sub-paths first, and then getting ls-tree to tell me about the
> > > immediate children for each sub-path, doing case-insensitive dupe
> > > searches across children for each of these sub-paths.
> > >
> > > ls-tree supporting icase pathspec magic would clearly be more
> > > efficient (I wouldn't need N ls-tree git processes, where N is the
> > > number of sub-paths in the diff), but this should be plenty efficient
> > > for normal commits, with a fallback to the full search
> > >
> > > This seems like a sensible direction, I'll have a play.
> >
> > If you create a script that gives you all leading directories of any
> > listed path (plus replacing the toplevel dir with ':/'), such as this
> > (which I'm calling 'all-leading-dirs.py'):
> >
> > """
> > #!/usr/bin/env python3
> >
> > import os
> > import sys
> >
> > paths = sys.stdin.read().splitlines()
> > dirs_seen = set()
> > for path in paths:
> >   dir = path
> >   while dir:
> >     dir = os.path.dirname(dir)
> >     if dir in dirs_seen:
> >       continue
> >     dirs_seen.add(dir)
> > if dirs_seen:
> >   # Replace top-level dir of "" with ":"; we'll add the trailing '/'
> > below when adding it to all other dirs
> >   dirs_seen.remove("")
> >   dirs_seen.add(':')
> >   for dir in dirs_seen:
> >     print(dir+'/')  # ls-tree wants the trailing '/' if we are going
> > to list contents within that tree rather than just the tree itself
> > """
> >
> > Then the following will catch duplicates files and directories for you:
> >
> > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
> > -t HEAD | sort | uniq -i -d
> >
> > and it no longer has problems catching duplicates in the toplevel
> > directory either.  It does have (at most) two git invocations, but
> > only one invocation of ls-tree.  Here's a test script to prove it
> > works:
> >
> > """
> > #!/bin/bash
> >
> > git init -b main nukeme
> > cd nukeme
> > mkdir -p dir1/subdir/whatever
> > mkdir -p dir2/subdir/whatever
> > >dir1/subdir/whatever/foo
> > >dir2/subdir/whatever/foo
> > git add .
> > git commit -m initial
> >
> > mkdir -p dir1/SubDir/whatever
> > >dir1/SubDir/whatever/foo
> > git add .
> > git commit -m stuff
> >
> > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
> > -t HEAD | sort | uniq -i -d
> > """
> >
> > The output of this script is
> > """
> > dir1/subdir
> > """
> > which correctly notifies on the duplicate (dir1/SubDir being the
> > other; uniq is the one that picks which of the two duplicate names to
> > print)

[-- Attachment #2: case-insensitive-duplicate-check-update-hook.bash --]
[-- Type: application/octet-stream, Size: 8047 bytes --]

#!/bin/bash

UPDATING_REF="$1"
FROM_HASH="$2"
TO_HASH="$3"

# GENERAL Problem Background:
# - In git, repo paths are inherently case-sensitive
# - In two of the three major dev workstation OSs (Windows, OSX), filesystems are typically case-insensitive
# - When files are added in the repo that are duplicates in the filesystem, but not in git, usability problems abound
# - A particularly onerous issue occurs when case-insensitive-duplicate files have differing content; after a checkout
#       of such a tree, the case-insensitive working tree always looks like it has a change to one or the other of the
#       VCS files. In such an "unavoidably/unfixably dirty tree" state, a casual user cannot even "pull" a fixed branch
#       state, as git will will refuse to pull in a dirty working tree.
# - Git currently offers no built-in mechanism to prevent this
# - It is preferable to avoid this issue in a server hook, rather than in CI validations (although the latter can also
#       work) - the earlier the issue is "nipped in the bud" the better.
# - There is a reasonably trivial shell scripting test for "Tree has duplicates", but validating simply on the basis of
#       this test presents two challenges:
#   - It might be too expensive to run on every commit in a large repo
#   - It might trigger "false positives" when creating new refs pointing to old already-problematic commits

# SPECIFIC Problem Repo Background:
# - 200K files, in deeply nested (java) folder structures (+50k directories)
# - Doing a "naive" full-tree case-insensitive duplicate search takes around 2s
#   - This is not a price we're willing to pay on every branch update
# - Case-insensitive duplicates are sometimes folders, sometimes files
# - Repo has lots of refs - 110k - so scanning them is costly
# - New refs are created *very* often - most often as "keeparound" refs in GitLab
#   - Validations don't make sense / don't add value for some or many classes of ref (those created by automated
#       systems, where a violation is not the "fault" of the pusher)
# - Some (active) branches are very distant from each other - there is no single "master"
 
# Test Cases / situations:
# - Single "normal" change, a dozen added files
# - Single "large-ish" change, a few hundred added files
# - on an existing branch (eg simple change, merge from upstream)
# - on a new branch, varying the original/upstream branch
# - on new refs that are excluded from validation by path/pattern

# Expected & tested performance characteristics of this hook script:
# - For non-branch or otherwise disqualified refs, close to no overhead.
# - For updates to existing refs, it is reasonably optimal
#   - There are four non-trivial operations, which all scale with the diff size:
#     - Finding the files added in the ref update (git diff)
#     - Finding all the ancestor-paths for these files (output_all_ancestor_gitpaths)
#       - This *can* be made slightly more efficient by using eg python
#     - Finding all items in these paths (git ls-tree)
#     - Sorting the resulting paths/items (sort)
#   - When the number of added files is too high, we fall back to a full-tree search
# - For new non-disqualified refs, there are additional overheads:
#   - Finding the best/closest "standard branch" to diff from, for a new ref, might be non-trivial in some repos
#     - (in the generic implementation, it is a trivial/cheap check for "master" and "main")
#   - Depending on the strategy used to find a "diff from" ref, diffs on new refs will likely be much larger/more expensive than they should ideally need to be

# Future work / possible improvements:
# - If git provided an easy/cheap way to detect "commits new to this repo" in an update hook, the diffs could be consistently much smaller/more efficient.
#   - (it is possible to find out whether a commit is net-new with something like "git rev-list MYREF --not --exclude MYREF --all", but this is *expensive* in a large repo)
# - If "git ls-tree" supported icase pathspec magic, we could pass in full added-file pathspecs, and subsequent dupe-detection could be more efficient
#   - (there would be a functional difference: the current strategy finds dupe directories even if they have non-overlapping contents; this other approach would not)

# Requirements:
# - Bash 4 for associative arrays


#---------
#FUNCTIONS - called via subshell
#---------

output_all_ancestor_gitpaths() {
  declare -A INTERESTING_PATHS
  while IFS= read -r INPUT_LINE; do
    CURRENT_PATH="$INPUT_LINE"
    while [ "$CURRENT_PATH" != ":" ]; do
      [[ "$CURRENT_PATH" == */* ]] || CURRENT_PATH=":/$CURRENT_PATH"
      PARENT_PATH="${CURRENT_PATH%/*}"
      if [ "${INTERESTING_PATHS["${PARENT_PATH}/"]}" = "1" ]; then
        break
      fi
      INTERESTING_PATHS["${PARENT_PATH}/"]="1"
      CURRENT_PATH="$PARENT_PATH"
    done
  done <<< "$1"
  for key in ${!INTERESTING_PATHS[@]}; do echo "$key"; done
  debug_log "FOUND ${#INTERESTING_PATHS[@]} INTERESTING PATHS" 
}

find_closest_standard_branch() {
  # Simplest implementation, look for "master" or "main" or give up.
  # there will likely be better repo-specific strategies in specific repos.
  git for-each-ref --format="%(refname)" --count=1 refs/heads/master refs/heads/main
}

START_TIME="$(date +%s%N)"
debug_log() {
  if [ -n "$DEBUG_DUPE_HOOK" ]; then
    CURRENT_TIME="$(date +%s%N)"
    echo "Debug at $(($(($CURRENT_TIME-$START_TIME))/1000000))ms: $1" >&2
  fi
}

#-----------
#MAIN SCRIPT
#-----------

if [ -z "$TO_HASH" ]; then
  # branch deletion, no possibility of new dupes
  debug_log "BRANCH DELETION - SKIPPING DUPE VALIDATION"
  exit 0
fi

VALIDATING_PATH_PREFIXES=("refs/heads/")
REF_IS_VALIDATABLE=0
for PATH_PREFIX in "${VALIDATING_PATH_PREFIXES[@]}"
do
  if [[ "$UPDATING_REF" = "$PATH_PREFIX"* ]]; then
    REF_IS_VALIDATABLE=1
  fi
done

if [[ "$REF_IS_VALIDATABLE" = "0" ]]; then
  # we're not interested in validating for dupes on this ref.
  debug_log "NON-VALIDATABLE REF - SKIPPING DUPE VALIDATION"
  exit 0
fi

FULL_SEARCH=false
if [ -z "$FROM_HASH" ]; then
  # New ref; in most repos new "interesting" (not-excluded-from-dupe-check)
  # refs will likely be closely related to master/main most of the time, so
  # diff from there rather than starting from scratch.
  debug_log "NEW REF - LOOKING FOR STANDARD BRANCH"
  FROM_HASH=$(find_closest_standard_branch "$TO_HASH")
fi

if [ -z "$FROM_HASH" ]; then
  # If this is a new ref and no "standard upstream ref" was found, fall back
  # to full dupe search.
  debug_log "NO STANDARD UPSTREAM REF FOUND; DOING FULL SEARCH"
  FULL_SEARCH=true
else
  debug_log "CHECKING DIFF: git diff --diff-filter=A --no-renames --name-only '$FROM_HASH' '$TO_HASH'"
  NEW_FILES=$(git diff --diff-filter=A --no-renames --name-only "$FROM_HASH" "$TO_HASH")
  NEW_FILE_LINE_COUNT=$(wc -l <<< "$NEW_FILES")
  
  if [ "$NEW_FILE_LINE_COUNT" -le "1" ] && [ -z "$NEW_FILES" ]; then
    # no newly added files, no possibility of new dupes
	# (count "1" can be an empty line)
    debug_log "EXITING WITH 0 ADDED FILES"
    exit 0
  fi

  if [ "$NEW_FILE_LINE_COUNT" -gt 10000 ]; then
    # If there are too many files, then generating all the sub-paths and
    # running ls-tree with a huge pathspec will be slower than simply
    # checking the whole tree.
    debug_log "FILE COUNT $NEW_FILE_LINE_COUNT OVER LIMIT; DOING FULL SEARCH"
    FULL_SEARCH=true
  fi
fi

if [ "$FULL_SEARCH" = "true" ]; then
  debug_log "DOING FULL DUPE SEARCH"
  DUPE_PATHS=$(git ls-tree --name-only -r "$TO_HASH" | sort | uniq -i -D)
else
  debug_log "DOING DIFF SEARCH OF $NEW_FILE_LINE_COUNT ADDED FILES BETWEEN $FROM_HASH AND $TO_HASH"
  DUPE_PATHS=$(output_all_ancestor_gitpaths "$NEW_FILES" | sort | xargs -d"\n" --no-run-if-empty git ls-tree --name-only -t "$TO_HASH" | sort | uniq | uniq -i -D)
fi

if [ -z "$DUPE_PATHS" ]; then
  debug_log "EXITING - NO DUPES FOUND"
  exit 0
else
  debug_log "EXITING WITH DUPES ERROR"
  echo "This ref cannot be accepted - duplicate folders or files found:"
  echo "$DUPE_PATHS"
  exit 1
fi

     prev parent reply	other threads:[~2022-10-17 15:47 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-30 12:04 icase pathspec magic support in ls-tree Tao Klerks
2022-09-30 13:53 ` Ævar Arnfjörð Bjarmason
2022-10-02 19:07   ` brian m. carlson
2022-10-13  6:35     ` Tao Klerks
2022-10-14  4:51       ` Torsten Bögershausen
2022-10-14  8:31         ` Tao Klerks
2022-10-14  8:37           ` Erik Cervin Edin
2022-10-14  7:41       ` Elijah Newren
2022-10-14  8:03         ` Erik Cervin Edin
2022-10-14  8:57           ` Tao Klerks
2022-10-14  8:48         ` Tao Klerks
2022-10-14  9:07           ` Tao Klerks
2022-10-14 12:00             ` Erik Cervin Edin
2022-10-14 17:06           ` Elijah Newren
2022-10-15 22:06             ` Tao Klerks
2022-10-17 15:46               ` Tao Klerks [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPMMpognADkh7ZJ+qYwE2GOoQ+knAthXJC-K6QAjN8qLwuS=7Q@mail.gmail.com' \
    --to=tao@klerks.biz \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=newren@gmail.com \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).