Util-Linux Archive mirror
 help / color / mirror / Atom feed
From: Karel Zak <kzak@redhat.com>
To: Mikko Rantalainen <mikko.rantalainen@peda.net>
Cc: util-linux@vger.kernel.org
Subject: Re: RFE: hardlink: support specifying max_size too?
Date: Wed, 24 Apr 2024 11:37:03 +0200	[thread overview]
Message-ID: <20240424093703.tiyoxp2guzfboq3h@ws.net.home> (raw)
In-Reply-To: <0aa615c2-8e17-4eb3-9a25-c8af39b35d81@peda.net>


 Hi Mikko,

On Tue, Apr 23, 2024 at 04:58:10PM +0300, Mikko Rantalainen wrote:
> I have huge directory hierarchies that I would like to run hardlink
> against but comparing a lot of files against each other results in high
> RAM usage because so much of the file metadata is kept in memory.

Good point. I have tried to optimize the content comparison (using the
kernel crypto API), but the binary tree is still the original
implementation and there is probably room for further optimization.

Perhaps storing all 'struct stat' information for every file is
excessive, as there is information that we do not need (such as atime,
ctime, st_blksize, st_blocks). Some information is only necessary if
respect_{mode,owner,time,xattrs} are enabled.

The tree also contains paths for all the files. If you have many
subdirectories or long directory names, there is a lot of duplicate
data in the binary tree. One possible solution could be to keep
directory paths in a separate hash table and only store pointers to
the names table in the metadata tree.

Another problem I see is that the hardlink keeps the entire
binary tree in memory during the second stage when it compares file
contents in the visitor() function. However, at this point, we do not
need the tree entries that are already unique and will never be used
to compare file contents.

> Could you add max_size (--maximum-size) option in addition to min_size
> (--minimum-size)? This would allow splitting the work into small
> fragments where hardlink only needs to process files in given range and
> immediately ignore all other files. Or it could be used to run full
> linking in multiple parallel tasks with sensible RAM requirements if you
> can run hardlink without size limitations (e.g. one task for 1–1MB
> files, another for 1MB–10MB and third task for files bigger than 10MB).

This is not a trivial task. It would be better to begin with
optimizing memory usage before implementing more invasive changes. 

I am unsure how you plan to compare all files if the metadata is
stored in multiple independent trees.

> It might also make sense to reorder the test for filesize and regex
> processing in inserter() because testing for size is probably faster
> because the stat() has already been made. Currently the stats.files is
> also increased for files that get ignored by size filter which may not
> be intentional.

Good point, send patch :-)

> I think I could provide patches if I just know which Git repo I should
> use as the basis. Is https://github.com/util-linux/util-linux the
> correct one?

Yes, GitHub is the best repository. You can also use it for pull
requests and reviews.

My suggestion is to add debug messages to see where the problem is,
calculate the size of metadata, the size of paths, and the size of
calculated data checksums. Please share the results.

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com


      reply	other threads:[~2024-04-24  9:37 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-23 13:58 RFE: hardlink: support specifying max_size too? Mikko Rantalainen
2024-04-24  9:37 ` Karel Zak [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240424093703.tiyoxp2guzfboq3h@ws.net.home \
    --to=kzak@redhat.com \
    --cc=mikko.rantalainen@peda.net \
    --cc=util-linux@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).