about summary refs log tree commit homepage
path: root/lib/PublicInbox/GitAsyncCat.pm
DateCommit message (Collapse)
2023-11-30git_async_cat: use git from "all" extindex if possible
For inboxes associated with an extindex (currently only the special "all") one, we can share the git process across all those inboxes unambiguously when retrieving full SHA-1 blobs. The comment for my proposed patch is also out-of-date as that git speedup has been a part of git since 2.33.
2023-11-26git: improve coupling with {sock} and {inflight} fields
While the {inflight} array should be tied to the IO object even more tightly, that's not an easy task with our current code. So take some small steps by introducing a gcf_inflight helper to validate the ownership of the process and to drain the inflight array via the awaitpid callback. This hopefully fix problems with t/lei-q-save.t (still) hanging occasionally on v2 outputs since git->cleanup/->DESTROY was getting called in v2 shard workers.
2023-10-01gcf2: take non-ref scalar request arg
Asking callers to pass a scalar reference is awkward and doesn't benefit modern Perl with CoW support. Unlike some constant error messages, it can't save any allocations at all since there's no constant strings being passed to libgit2.
2023-10-01git: use Unix stream sockets for `cat-file --batch-*'
The benefit of 1MB potential pipe buffer size (on Linux) doesn't seem noticeable when reading from git (unlike when writing to v2 shards), so Unix stream sockets seem fine, here. This allows us to simplify our process management by using the same socket FD for reads and writes and enables us to use our ProcessPipe class for reaping (as we can do with Gcf2Client). Gcf2Client no longer relies on PublicInbox::DS for write buffering, and instead just waits for requests to complete once the number of inflight requests hits the MAX_INFLIGHT threshold as we do with PublicInbox::Git. We reuse the existing MAX_INFLIGHT limit (18) that was determined by the minimum allowed PIPE_BUF (512). (AFAIK) Unix stream sockets have no analogy to PIPE_BUF, but all *BSDs and Linux I've checked have default SO_RCVBUF and SO_SNDBUF values larger than the previously-required PIPE_BUF size of 512 bytes.
2023-02-20git_async_cat: don't mis-abort replaced process
When a git process gets replaced (e.g. due to new epochs/alternates), we must be careful and not abort the wrong one. I suspect this fixes the problem exacerbated by --batch-command. It was theoretically possible w/o --batch-command, but it seems to have made it surface more readily. This should fix "Failed to retrieve generated blob" errors from PublicInbox/ViewVCS.pm appearing in syslog Link: https://public-inbox.org/meta/20230209012932.M934961@dcvr/
2023-02-10git_async_cat: use awaitpid
While awaitpid already registered a no-op callback in _bidi_pipe, we can still call it again when registering it into our event loop to ensure EPOLL_CTL_DEL fires.
2023-01-27git: use --batch-command in git 2.36+ to save processes
`git cat-file --batch-command' combines the functionality of `--batch' and `--batch-check' into a single process. This reduces the amount of running processes and is primarily useful for coderepos (e.g. solver). This also fixes prior use of `print { $git->{out} }' which is a a potential (but unlikely) bug since commit d4ba8828ab23f278 (git: fix asynchronous batching for deep pipelines, 2023-01-04) Lack of libgit2 on one of my test machines also uncovered fixes necessary for t/imapd.t, t/nntpd.t and t/nntpd-v2.t.
2023-01-13www_coderepo: /tree/ redirects to /$OID/s/
This is for compatibility with cgit to ease migration.
2022-10-05www_coderepo: wire up snapshot support
These should be compatible with cgit results
2022-10-05www_coderepo: an alternative to cgit
This will allow it to easily map a single coderepo to multiple inboxes (or multiple coderepos to any number of inboxes). For now, this is just a summary, but $REPO/$OID/s/ support will be added, along with archive downloads. Indexing of coderepos will probably be supported via -extindex, only.
2022-10-01www_stream: use DESTROY to cleanup temporary gits
Relying on a timer to handle cleanup in f9ac22a4b485 was sub-optimal since the delay could prove expensive under heavy traffic. So rely on ->DESTROY instead since we we no longer hold reference cycles by the time the show_blob callback executes. Fixes: f9ac22a4b485 ("git_async_cat: automatically cleanup temporary gits")
2022-10-01git_async_cat: automatically cleanup temporary gits
This prevents temporary directories and git processes from lingering around after WWW solver requests.
2022-09-12git_async_cat: don't use Gcf2 for temporary git dirs
We don't want to be holding references to temporary directories longer than necessary, an Gcf is intended to be long-lived.
2022-08-23ibx_async_cat: access ->{git} directly
This will enable callers to pass non-Inbox-ish hashrefs as the arg. This benefits existing Inbox-ish objects, too, as it avoids a slow method dispatch for both ExtSearch and Inbox.
2022-08-09imap: limit ibx_async_prefetch to idle git processes
This improves fairness while having no measurable performance impact for a single uncached IMAP client (mutt) opening a folder for the first time. I noticed this problem with the public-inbox.org IMAP server where a few IMAP clients were unfairly monopolizing the -netd process.
2021-10-08git: async_abort includes --batch-check requests
We need to abort both check-only and cat requests when aborting, since we'll be aborting more aggressively in in read-write paths.
2021-06-24favor git(1) rather than libgit2 for ExtSearch
While both git and libgit2 take around 16 minutes to load 100K alternates there's already a proposed patch to make git faster: <https://lore.kernel.org/git/20210624005806.12079-1-e@80x24.org/> It's also easier to patch and install git locally since the git.git build system defaults to prefix=$HOME and dealing with dynamic linking with libgit2 is more difficult for end users relying on Inline::C. libgit2 remains in use for the non-ALL.git case, but maybe it's not necessary (libgit2 is significantly slower than git in Debian 10 due to SHA-1 collision checking).
2021-01-03gcf2client: split out request API from regular git
While Gcf2Client is designed to mimic what git-cat-file writes to stdout, its request format is different to support requests with a git repository path included. We'll highlight the distinction and make the GitAsyncCat support code easier-to-follow as a result. Since Gcf2Client relies on DS, we can rely on DS-specific code here, too, and use a single Unix socket instead of separate input and output pipes, reducing memory overhead in both users and kernel space. Due to the interactive nature of requests and responses, the buffer size limitations of Unix sockets on Linux seems inconsequential here (just like it is for existing "git cat-file --batch" use).
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-11-30git: set non-blocking flag in case of other bugs
This makes GitAsyncCat more resilient to bugs in Gcf2 or even git-cat-file itself. I noticed -imapd stuck on read(2) from the Gcf2 pipe, so there may be a bug somewhere in Gcf2 or PublicInbox::Git. This should make us more resilient to them and hopefully help us notice and fix them.
2020-09-28gcf2: improve error handling and do not ->fail on wbuf
For historical reasons, both Danga::Socket::write and PublicInbox::DS::write will return 0 when data is buffered; so Gcf2Client must not call ->fail when DS::write returns 0. We'll also improve robustness by recreating the entire Gcf2Client object if it does die for other reasons, instead of risking mismatched fields due to deferred close. We also need to ensure we only get one EPOLLERR wakeup and issue EPOLL_CTL_DEL if ->event_step is triggered by a dying Gcf2 process, so always register the FD with EPOLLONESHOT.
2020-09-19gcf2: wire up read-only daemons and rm -gcf2 script
It seems easiest to have a singleton Gcf2Client client object per daemon worker for all inboxes to use. This reduces overall FD usage from pipes. The `public-inbox-gcf2' command + manpage are gone and a `$^X' one-liner is used, instead. This saves inodes for internal commands and hopefully makes it easier to avoid mismatched PERL5LIB include paths (as noticed during development :x). We'll also make the existing cat-file process management infrastructure more resilient to BOFHs on process killing sprees (or in case our libgit2-based code fails on us). (Rare) PublicInbox::WWW PSGI users NOT using public-inbox-httpd won't automatically benefit from this change, and extra configuration will be required (to be documented later).
2020-09-19gcf2: require git dir with OID
This amortizes the cost of recreating PublicInbox::Gcf2 objects when alternates change in v2 all.git.
2020-09-18git_async_cat: inline + drop redundant batch_prepare call
$git->cat_async already calls $git->batch_prepare iff needed, so we can reduce subroutine calls and inline a one-off subroutine to save some memory, here.
2020-09-16git_async_cat: fix outdated comment
We replaced Danga::Socket with PublicInbox::DS roughly a year before GitAsyncCat was introduced into our git history.
2020-07-06git_async_cat: unref pipes on EOF from git->cleanup
We avoided a managed circular reference in 10ee3548084c125f but introduced a pipe FD leak, instead. So handle the EOF we get when the "git cat-file --batch" process exits and closes its stdout FD. v2: remove ->close entirely. PublicInbox::Git->cleanup handles all cleanup. This prevents us from inadvertantly deleting the {async_cat} field associated with a different pipe than the one GAC is monitoring. Fixes: 10ee3548084c125f ("git_async_cat: remove circular reference")
2020-06-28ds: remove fields.pm usage
Since the removal of pseudo-hash support in Perl 5.10, the "fields" module no longer provides the space or speed benefits it did in 5.8. It also does not allow for compile-time checks, only run-time checks. To me, the extra developer overhead in maintaining "use fields" args has become a hassle. None of our non-DS-related code uses fields.pm, nor do any of our current dependencies. In fact, Danga::Socket (which DS was originally forked from) and its subclasses are the only fields.pm users I've ever encountered in the wild. Removing fields may make our code more approachable to other Perl hackers. So stop using fields.pm and locked hashes, but continue to document what fields do for non-trivial classes.
2020-06-25git_async_cat: remove circular reference
While this circular reference was carefully managed to not leak memory; it was still triggering a warning at -imapd/-nntpd shutdown due to the EPOLL_CTL_DEL op failing after the $Epoll FD gets closed. So remove the circular reference by providing a ref to `undef', instead.
2020-06-13git: move async_cat reference to PublicInbox::Git
Trying to avoid a circular reference by relying on $ibx object here makes no sense, since skipping GitCatAsync::close will result in an FD leak, anyways. So keep GitAsyncCat contained to git-only operations, since we'll be using it for Solver in the distant feature.
2020-06-13git: idle rbuf for async
We do this for the C10K-oriented HTTP/NNTP/IMAP processes, and we may support thousands of git-cat-file processes in the future.
2020-06-13imap: use git-cat-file asynchronously
This ought to improve overall performance with multiple clients. Single client performance suffers a tiny bit due to extra syscall overhead from epoll. This also makes the existing async interface easier-to-use, since calling cat_async_begin is no longer required.