Git Mailing List Archive mirror
 help / color / mirror / Atom feed
* [RFC] Avaiable disk space when automatic garbage collection kicks in
@ 2024-04-08 16:29 Dragan Simic
  2024-06-12 16:25 ` Dragan Simic
  0 siblings, 1 reply; 4+ messages in thread
From: Dragan Simic @ 2024-04-08 16:29 UTC (permalink / raw
  To: git

Hello all,

A few days ago I've noticed a rather unusual issue, but still
a realistic one.  When automatic garbage collection kicks in,
as a result of gc.auto >= 0, which is also the default, the
local repository can be left in a rather strange state if there
isn't enough free space available on the respective filesystem
for writing the objects, etc.

It might be a good idea to estimate the required amount of free
filesystem space before starting the garbage collection, be it
automatic or manual, and refuse the operation if there isn't
enough free space available.

As a note, the need_to_gc() function already does something a bit
similar with the available system RAM.

Any thoughts?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] Avaiable disk space when automatic garbage collection kicks in
  2024-04-08 16:29 [RFC] Avaiable disk space when automatic garbage collection kicks in Dragan Simic
@ 2024-06-12 16:25 ` Dragan Simic
  2024-06-12 17:04   ` rsbecker
  0 siblings, 1 reply; 4+ messages in thread
From: Dragan Simic @ 2024-06-12 16:25 UTC (permalink / raw
  To: git

[Maybe this RFC deserves a "bump", so let me try.]

On 2024-04-08 18:29, Dragan Simic wrote:
> Hello all,
> 
> A few days ago I've noticed a rather unusual issue, but still
> a realistic one.  When automatic garbage collection kicks in,
> as a result of gc.auto >= 0, which is also the default, the
> local repository can be left in a rather strange state if there
> isn't enough free space available on the respective filesystem
> for writing the objects, etc.
> 
> It might be a good idea to estimate the required amount of free
> filesystem space before starting the garbage collection, be it
> automatic or manual, and refuse the operation if there isn't
> enough free space available.
> 
> As a note, the need_to_gc() function already does something a bit
> similar with the available system RAM.
> 
> Any thoughts?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [RFC] Avaiable disk space when automatic garbage collection kicks in
  2024-06-12 16:25 ` Dragan Simic
@ 2024-06-12 17:04   ` rsbecker
  2024-06-12 17:25     ` Dragan Simic
  0 siblings, 1 reply; 4+ messages in thread
From: rsbecker @ 2024-06-12 17:04 UTC (permalink / raw
  To: 'Dragan Simic', git

On Wednesday, June 12, 2024 12:25 PM, Dragan Simic wrote:
>[Maybe this RFC deserves a "bump", so let me try.]
>On 2024-04-08 18:29, Dragan Simic wrote:
>> Hello all,
>>
>> A few days ago I've noticed a rather unusual issue, but still a
>> realistic one.  When automatic garbage collection kicks in, as a
>> result of gc.auto >= 0, which is also the default, the local
>> repository can be left in a rather strange state if there isn't enough
>> free space available on the respective filesystem for writing the
>> objects, etc.
>>
>> It might be a good idea to estimate the required amount of free
>> filesystem space before starting the garbage collection, be it
>> automatic or manual, and refuse the operation if there isn't enough
>> free space available.
>>
>> As a note, the need_to_gc() function already does something a bit
>> similar with the available system RAM.
>>
>> Any thoughts?

I am not sure there is a good portable way of reliably doing this using OS
APIs, particularly with virtual disks and shared file sets. An edge
condition would be setting up a separate file set for content inside .git
for massive repositories, so taking an estimate in the working index would
not fix the above.

It might be useful to add a configuration item like: 

gc.reserve = size   # possibly with mb, kb, gb, tb, or some other suffix
indicating how much space must be available to reserve prior to starting the
operation.

Then creating a file (with real content) inside .git (or .git/objects) with
the reserved size. If the file cannot be constructed, gc gets suppressed.
This can happen for more than size issue - permissions, for example. Note
also that some file systems to not actually allocate the entire space just
setting EOF, so that technique, while fast, will also not work portably.

After the reserve works, it can be removed (and hopefully NFS will properly
close it), providing a lock is put in place, followed by gc running. It
might be useful to do this even on a non-auto gc. While this can be
expensive (writing a block of stuff twice), it is safer this way.

Just a thought.

Randall.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] Avaiable disk space when automatic garbage collection kicks in
  2024-06-12 17:04   ` rsbecker
@ 2024-06-12 17:25     ` Dragan Simic
  0 siblings, 0 replies; 4+ messages in thread
From: Dragan Simic @ 2024-06-12 17:25 UTC (permalink / raw
  To: rsbecker; +Cc: git

Hello Randall,

On 2024-06-12 19:04, rsbecker@nexbridge.com wrote:
> On Wednesday, June 12, 2024 12:25 PM, Dragan Simic wrote:
>> [Maybe this RFC deserves a "bump", so let me try.]
>> On 2024-04-08 18:29, Dragan Simic wrote:
>>> A few days ago I've noticed a rather unusual issue, but still a
>>> realistic one.  When automatic garbage collection kicks in, as a
>>> result of gc.auto >= 0, which is also the default, the local
>>> repository can be left in a rather strange state if there isn't 
>>> enough
>>> free space available on the respective filesystem for writing the
>>> objects, etc.
>>> 
>>> It might be a good idea to estimate the required amount of free
>>> filesystem space before starting the garbage collection, be it
>>> automatic or manual, and refuse the operation if there isn't enough
>>> free space available.
>>> 
>>> As a note, the need_to_gc() function already does something a bit
>>> similar with the available system RAM.
>>> 
>>> Any thoughts?
> 
> I am not sure there is a good portable way of reliably doing this using 
> OS
> APIs, particularly with virtual disks and shared file sets. An edge
> condition would be setting up a separate file set for content inside 
> .git
> for massive repositories, so taking an estimate in the working index 
> would
> not fix the above.
> 
> It might be useful to add a configuration item like:
> 
> gc.reserve = size   # possibly with mb, kb, gb, tb, or some other 
> suffix
> indicating how much space must be available to reserve prior to 
> starting the
> operation.
> 
> Then creating a file (with real content) inside .git (or .git/objects) 
> with
> the reserved size. If the file cannot be constructed, gc gets 
> suppressed.
> This can happen for more than size issue - permissions, for example. 
> Note
> also that some file systems to not actually allocate the entire space 
> just
> setting EOF, so that technique, while fast, will also not work 
> portably.
> 
> After the reserve works, it can be removed (and hopefully NFS will 
> properly
> close it), providing a lock is put in place, followed by gc running. It
> might be useful to do this even on a non-auto gc. While this can be
> expensive (writing a block of stuff twice), it is safer this way.

Thanks for your response!

One of the troubles with the introduction of "gc.reserve" is that it 
would
be probably used by advanced users only, which may already turn 
automatic
garbage collection off for their repositories on filesystems without 
enough
free space for the garbage collection to succeed.  Another issue is that 
the
on-disk footprint of large repositories can grow significantly over 
time,
so rather frequent updates to the "gc.reserve" values would be needed.

There are aven more issues, which you already mentioned...  One of them 
is
the additional time required to create a large file, and another is the
additional wear that creating a large temporary file puts on flash-based
storage.  Moreover, if the total block usage of an underlying SSD gets 
close
to 100% after the large temporary file is created, we'd be putting that 
SSD
in a rather unfavorable position because no TRIM operation may be 
performed
on that large file when it gets removed, and we'd then "hammer" the SSD
with a whole lot of small writes.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-06-12 17:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-08 16:29 [RFC] Avaiable disk space when automatic garbage collection kicks in Dragan Simic
2024-06-12 16:25 ` Dragan Simic
2024-06-12 17:04   ` rsbecker
2024-06-12 17:25     ` Dragan Simic

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).