Git Mailing List Archive mirror
 help / color / mirror / Atom feed
* Feature request: secondary index by path fragment
@ 2024-05-06 23:11 Robin H. Johnson
  2024-05-06 23:22 ` Junio C Hamano
  0 siblings, 1 reply; 4+ messages in thread
From: Robin H. Johnson @ 2024-05-06 23:11 UTC (permalink / raw
  To: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1055 bytes --]

Maybe this already happens in the code, but if not, please consider it
as a feature request.

Gentoo has some tooling that boils down to repeated runs of 'git log -- somepath/'
via cgit as well as other shell tooling.

If the path is relatively deep for the tree (e.g. to a specific file or
sub-directory), the size of history [1] makes that a very slow operation
to go all the way back to the initial repo commit: ~12 seconds per
operation on fast hardware, ~45 seconds on slower harder - even with the
packs cached.

I was wondering if Git could gain a secondary index of commits, based on
path prefixes, that would speed up the 'git log' run.

It would need to be fast to append to the secondary index, because
Gentoo gets a steady flow of commits 24/7.

[1] 825k+ commits based on GitHub stats.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
E-Mail   : robbat2@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1113 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Feature request: secondary index by path fragment
  2024-05-06 23:11 Feature request: secondary index by path fragment Robin H. Johnson
@ 2024-05-06 23:22 ` Junio C Hamano
  2024-05-07  4:25   ` Patrick Steinhardt
  0 siblings, 1 reply; 4+ messages in thread
From: Junio C Hamano @ 2024-05-06 23:22 UTC (permalink / raw
  To: Robin H. Johnson; +Cc: Git Mailing List

"Robin H. Johnson" <robbat2@gentoo.org> writes:

> Gentoo has some tooling that boils down to repeated runs of 'git log -- somepath/'
> via cgit as well as other shell tooling.
> ...
> I was wondering if Git could gain a secondary index of commits, based on
> path prefixes, that would speed up the 'git log' run.

Perhaps the bloom filters are good fit for the use case?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Feature request: secondary index by path fragment
  2024-05-06 23:22 ` Junio C Hamano
@ 2024-05-07  4:25   ` Patrick Steinhardt
  2024-05-07  5:38     ` Robin H. Johnson
  0 siblings, 1 reply; 4+ messages in thread
From: Patrick Steinhardt @ 2024-05-07  4:25 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Robin H. Johnson, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1344 bytes --]

On Mon, May 06, 2024 at 04:22:11PM -0700, Junio C Hamano wrote:
> "Robin H. Johnson" <robbat2@gentoo.org> writes:
> 
> > Gentoo has some tooling that boils down to repeated runs of 'git log -- somepath/'
> > via cgit as well as other shell tooling.
> > ...
> > I was wondering if Git could gain a secondary index of commits, based on
> > path prefixes, that would speed up the 'git log' run.
> 
> Perhaps the bloom filters are good fit for the use case?

Yes, Bloom filters are the first thing that pop into my mind here as
they are exactly designed to solve this problem. So if you rewrite your
commit graphs with `git commit-graph write --changed-paths --reachable`
you should hopefully see a significant speedup.

It does surface some a usability issues though:

  - There is no easy way to enable the computation of bloom filters via
    configuration, to the best of my knowledge.

  - How would a non-Git-expert know?

It makes me wonder whether we can maybe enable generation of Bloom
filters by default. The biggest downside is of course that writing
commit graphs becomes slower. But that should happen in the background
for normal users anyway, and most forges probably hand-roll maintenance
and thus wouldn't care.

Is there any other thing I'm missing why those are not written by
default?

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Feature request: secondary index by path fragment
  2024-05-07  4:25   ` Patrick Steinhardt
@ 2024-05-07  5:38     ` Robin H. Johnson
  0 siblings, 0 replies; 4+ messages in thread
From: Robin H. Johnson @ 2024-05-07  5:38 UTC (permalink / raw
  To: Patrick Steinhardt, Git Mailing List; +Cc: Junio C Hamano, Robin H. Johnson

[-- Attachment #1: Type: text/plain, Size: 1787 bytes --]

On Tue, May 07, 2024 at 06:25:08AM +0200, Patrick Steinhardt wrote:
> On Mon, May 06, 2024 at 04:22:11PM -0700, Junio C Hamano wrote:
> > "Robin H. Johnson" <robbat2@gentoo.org> writes:
> > 
> > > Gentoo has some tooling that boils down to repeated runs of 'git log -- somepath/'
> > > via cgit as well as other shell tooling.
> > > ...
> > > I was wondering if Git could gain a secondary index of commits, based on
> > > path prefixes, that would speed up the 'git log' run.
> > 
> > Perhaps the bloom filters are good fit for the use case?
> 
> Yes, Bloom filters are the first thing that pop into my mind here as
> they are exactly designed to solve this problem. So if you rewrite your
> commit graphs with `git commit-graph write --changed-paths --reachable`
> you should hopefully see a significant speedup.

Good news & bad news.
"git log -- sys-apps/pv >/dev/null" as my testcase from before:
The fast system (2.45.0) went from 11 seconds to ~1 second!
The slow system (2.44.0) went from 45 seconds to 49 seconds :-(.

I'll try to trace down why one system slowed down.

commit-graph command:
fast: 1m10s
slow: 3m43s

> It makes me wonder whether we can maybe enable generation of Bloom
> filters by default. The biggest downside is of course that writing
> commit graphs becomes slower. But that should happen in the background
> for normal users anyway, and most forges probably hand-roll maintenance
> and thus wouldn't care.
Most repos are also MUCH smaller than this, so it should be safe to
enable.


-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
E-Mail   : robbat2@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1113 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-05-07  5:38 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-06 23:11 Feature request: secondary index by path fragment Robin H. Johnson
2024-05-06 23:22 ` Junio C Hamano
2024-05-07  4:25   ` Patrick Steinhardt
2024-05-07  5:38     ` Robin H. Johnson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).