about summary refs log tree commit homepage
path: root/lib/PublicInbox/Git.pm
DateCommit message (Collapse)
2020-09-16treewide: relax allow >=40 chars for git OID
This will help with eventual git SHA-256 transitions.
2020-08-27git: show more context info on failures
I'm seeing "read: Connection timed out" from in my syslog from -httpd. The fail() calls in PublicInbox::Git seems to be the only code path of ours which could trigger it... ETIMEDOUT shouldn't happen on pipes, only sockets; and all of our socket operations are non-blocking. So this could be cgit-wwwhighlight-filter.lua, but that's connecting over localhost, though on fairly loaded HW.
2020-07-26imap: introduce and use Git->async_prefetch
We can keep the git process more active by sending another request to it while fetch_run_ops() is running. This parallelization speeds up mutt's initial FETCH for headers by around ~35%(!).
2020-07-25searchidx: support async git check
This allows v1 indexing to run while the `cat-file --batch-check' process is waiting on high-latency storage.
2020-07-06git: use v5.10.1, parent.pm and Time::HiRes::stat
parent.pm is leaner than base.pm, and Time::HiRes::stat is more accurate, so take advantage of these Perl 5.10+-isms since it's been over a year since we left 5.8 behind.
2020-06-25git_async_cat: remove circular reference
While this circular reference was carefully managed to not leak memory; it was still triggering a warning at -imapd/-nntpd shutdown due to the EPOLL_CTL_DEL op failing after the $Epoll FD gets closed. So remove the circular reference by providing a ref to `undef', instead.
2020-06-13git: async: automatic retry on alternates change
This matches the behavior of the existing synchronous ->cat_file method. In fact, ->cat_file now becomes a small wrapper around the ->cat_async method.
2020-06-13git: move async_cat reference to PublicInbox::Git
Trying to avoid a circular reference by relying on $ibx object here makes no sense, since skipping GitCatAsync::close will result in an FD leak, anyways. So keep GitAsyncCat contained to git-only operations, since we'll be using it for Solver in the distant feature.
2020-06-13git: cat_async: provide requested OID + "missing" on missing blobs
This will make it easier to implement the retries on alternates_changed() of the synchronous ->cat_file API.
2020-06-13git: idle rbuf for async
We do this for the C10K-oriented HTTP/NNTP/IMAP processes, and we may support thousands of git-cat-file processes in the future.
2020-06-13imap: use git-cat-file asynchronously
This ought to improve overall performance with multiple clients. Single client performance suffers a tiny bit due to extra syscall overhead from epoll. This also makes the existing async interface easier-to-use, since calling cat_async_begin is no longer required.
2020-06-13git: do our own read buffering for cat-file
To work with our event loop, we must perform read buffering ourselves or risk starvation, as there doesn't appear to be a way to check the amount of data buffered in userspace by by the PerlIO layers without resorting to C or XS. This lets us perform fewer syscalls at the expense of more Perl ops. As it stands, there seems to be a tiny performance improvement, but more will be possible in the future.
2020-06-13git: async: flatten the inflight array
Small array refs have considerable overhead in Perl, so reduce AV/SV overhead and instead allow the inflight array to grow twice as large.
2020-05-06git: warn on ->cat_async callback errors
This will help us track down bugs in our own code when it comes to missing error checking.
2020-04-29git: various minor speedups
While testing performance improvements elsewhere, I noticed some micro-optimizations could give a small ~2-3% speedup in my test using the git async API to parse a large inbox. The `read' perlfunc already has read-in-full behavior (unless git is killed unexpectedly), so there's no point in using a loop. SearchIdxShard in the parallel v2 indexing code path never looped on `read', either. Furthermore, we can avoid method dispatch overhead on ->getline and ->print by using `readline' and `print' as ops which can be resolved during the Perl compilation phase. Finally, avoid passing the IO handle around as a parameter, since avoiding hash lookups with a local variable has its own costs in stack and refcount bumping. Best off all, there's less code :>
2020-04-05git: reduce stat buffer storage overhead
The stat() array is a whopping 480 bytes (on x86-64, Perl 5.28), while the new packed representation of two 64-bit doubles as a scalar is "only" 56 bytes. This can add up when there's many inboxes. Just use a string comparison on the packed representation. Some 32-bit Perl builds (IIRC OpenBSD) lack quad support, so doubles were chosen for pack() portability.
2020-03-04git: remove POSIX::dup2 import
We rely on spawn/popen_rd for redirects, nowadays.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-13use popen_rd for bidirectional pipes
popen_rd accepts arbitrary redirects, so we can reuse its code to setup the pipe end we want to read, saving each caller a few lines of code compared to calling pipe+spawn.
2020-01-13git: packed_bytes: use GLOB_NOSORT
File::Glob is loaded by the perl for the "glob()" op, anyways, so call bsd_glob with the GLOB_NOSORT to avoid needless sorting of the output.
2020-01-13git: modified: don't slurp `rev-parse --branches'
While v1 inboxes typically only have one branch, code repositories may have dozens or even hundreds. Slurping those into memory is a waste.
2020-01-11spawn (and thus popen_rd) die on failure
Most spawn and popen_rd callers die on failure to spawn, anyways, and some are missing checks entirely. This saves us a bunch of verbose error-checking code in callers. This also makes popen_rd more consistent, since it already dies on pipe creation failures.
2020-01-11git: remove ->commit_title method
We haven't used it in SolverGit, yet, and I'll be reworking it to work with ->cat_async, instead.
2020-01-11git: ->modified uses cat_async
While v1 inboxes are typically only a single branch, coderepos will have many branches and being able to pipeline requests to "git cat-file --batch" can help us mask seek times.
2020-01-11allow HTTP_HOST to be '0' via defined() checks
'0' is a valid value for HTTP_HOST, and maybe some folks will want to hit that as port 80 where the HTTP client won't send the ":$PORT" suffix.
2020-01-06treewide: "require" + "use" cleanup and docs
There's a bunch of leftover "require" and "use" statements we no longer need and can get rid of, along with some excessive imports via "use". IO::Handle usage isn't always obvious, so add comments describing why a package loads it. Along the same lines, document the tmpdir support as the reason we depend on File::Temp 0.19, even though every Perl 5.10.1+ user has it. While we're at it, favor "use" over "require", since it it gives us extra compile-time checking.
2019-12-30spawn: allow passing GLOB handles for redirects
We can save callers the trouble of {-hold} and {-dev_null} refs as well as the trouble of calling fileno().
2019-12-26git: allow async_cat to pass arg to callback
This allows callers to avoid allocating several KB for for every call to ->async_cat.
2019-12-12git: async batch interface
This is a transitionary interface which does NOT require an event loop. It can be plugged into in current synchronous code without major surgery. It allows HTTP/1.1 pipelining-like functionality by taking advantage of predictable and well-specified POSIX pipe semantics by stuffing multiple git cat-file requests into the --batch pipe With xt/git_async_cmp.t and GIANT_GIT_DIR=git.git, the async interface is 10-25% faster than the synchronous interface since it can keep the "git cat-file" process busier. This is expected to improve performance on systems with slower storage (but multiple cores).
2019-10-22git: remove src_blob_url
This was intended for solver, but it's unused since commit 915cd090798069a4 ("solver: switch patch application to use a callback")
2019-09-14tmpfile: give temporary files meaningful names
Although we always unlink temporary files, give them a meaningful name so that we can we can still make sense of the pre-unlink name when using lsof(8) or similar tools on Linux.
2019-09-09run update-copyrights from gnulib for 2019
2019-07-08ds: use WNOHANG with waitpid if inside event loop
While we're usually not stuck waiting on waitpid after seeing a pipe EOF or even triggering SIGPIPE in the process (e.g. git-http-backend) we're reading from, it MAY happen and we should be careful to never hang the daemon process on waitpid calls. v2: use "eq" for string comparison against 'DEFAULT'
2019-06-14Merge remote-tracking branch 'origin/manifest' into next
* origin/manifest: git: ensure ->modified returns an integer www: support $INBOX/git/$EPOCH.git for v2 cloning www: wire up /$INBOX/manifest.js.gz, too wwwlisting: generate grokmirror-compatible manifest.js.gz wwwlisting: allow hiding entries from manifest
2019-06-14git: remove cat_file sub callback interface
We weren't using it, and in retrospect, it makes no sense to use this API cat_file for giant responses which can't read quickly with minimal context-switching (or sanely fit into memory for Email::Simple/Email::MIME). For giant blobs which we don't want slurped in memory, we'll spawn a short-lived git-cat-file process like we do in ViewVCS. Otherwise, monopolizing a git-cat-file process for a giant blob is harmful to other PSGI/NNTP users. A better interface is coming which will be more suitable for for batch processing of "small" objects such as commits and email blobs.
2019-06-10git: ensure ->modified returns an integer
We don't want to serialize timestamps as strings to JSON. I only noticed this bug on a 32-bit system.
2019-06-05tighten up digit matches to ASCII for git output
While I don't expect git to suddenly start spewing non-ASCII digits in places I'd expect ASCII, this would make things easier for future hackers and reviewers.
2019-06-01git: drop the deleted err_c file
No reason to leave that (usually) empty file open after killing off "cat-file --batch-check". This wasn't an unbound leak, though, as respawning the --batch-check process would've clobbered the old err_c file.
2019-06-01git: unconditional expiry
A constant stream of traffic to either httpd/nntpd would mean git-cat-file processes never expire. Things can go bad after a full repack, as a full repack will unlink old pack indices and git-cat-file does not currently detect unlinked files. We could do something complicated by recursively stat-ing objects/pack of every git directory and alternate; but that's probably not worth the trouble compared to occasionally restarting the cat-file process. So simplify the code and let httpd/nntpd expire them periodically, since spawning a "git-cat-file --batch" process isn't too expensive. We already spawn for every request which hits git-http-backend, cgit, and git-apply. In the future, we may optionally support the Git::Raw module to avoid IPC; but we must remain careful to not leave lingering FDs open to unlinked files after repack.
2019-05-22git: workaround old git-rev-parse(1) (--git-path)
git < 2.5.0 was missing --git-path support. This means any users relying on some rare environment variables will need git 2.5.0+
2019-04-18git: calculate modified time of repository
This will be used for generating an HTML listing for v1 inboxes, at least. The logic for this follows that of grokmirror, and we may dynamically generate manifest.js.gz natively...
2019-04-04support publicinbox.cgitrc directive
We can save admins the trouble of declaring [coderepo "..."] sections in the public-inbox config by parsing the cgitrc directly. Macro expansion (e.g. $HTTP_HOST) expansion is not supported, yet; but may be in the future.
2019-04-04git: add "commit_title" method
This will be useful for extracting titles/subjects from commit objects when displaying commits.
2019-01-31inbox: perform cleanup of Git objects for coderepos
Otherwise, long-running but idle git processes may keep unlinked packs around indefinitely and waste disk space.
2019-01-30git: use "git rev-parse --git-path"
Using git worktrees was causing t/solver_git.t to fail on me.
2019-01-27solver: hold patches in temporary directory
We can avoid bumping up RLIMIT_NOFILE too much by storing patches in a temporary directory. And we can share this top-level directory with our temporary git repository. Since we no longer rely on a working-tree for git, we are free to rearrange the layout and avoid relying on the ".git" convention and relying on "git -C" for chdir. This may also ease porting public-inbox to older systems where git does not support "-C" for chdir.
2019-01-20git: support 'ambiguous' result from --batch-check
David Turner's patch to return "ambiguous" seems like a reasonable patch for future versions of git: https://public-inbox.org/git/672a6fb9e480becbfcb5df23ae37193784811b6b.camel@novalis.org/
2019-01-19git: disable abbreviations with cat-file hints
Ambiguity is not worth it for internal usage with the solver.
2019-01-19git: check saves error on disambiguation
This will be useful for disambiguating short OIDs in older emails when abbreviations were shorter. Tested against the following script with /path/to/git.git ==> t.perl <== use strict; use PublicInbox::Git; use Data::Dumper; my $dir = shift or die "Usage: $0 GIT_DIR # (of git.git)"; my $git = PublicInbox::Git->new($dir); my @res = $git->check('dead'); print Dumper({res => \@res, err=> $git->last_check_err}); @res = $git->check('5335669531d83d7d6c905bcfca9b5f8e182dc4d4'); print Dumper({res => \@res, err=> $git->last_check_err});
2019-01-19git: add git_quote
It'll be helpful for displaying progress in SolverGit output.