From: Calvin Wan <calvinwan@google.com>
To: Git Mailing List <git@vger.kernel.org>
Subject: [RFD] Libification proposal: separate internal and external interfaces
Date: Tue, 2 Apr 2024 07:18:51 -0700 [thread overview]
Message-ID: <CAFySSZAB09QB7U6UxntK2jRJF0df5R7YGnnLSsYc9MYhHsBhWA@mail.gmail.com> (raw)
This proposal was originally written by Kyle Lippincott, but he’s
currently on vacation for the next two weeks so I’m helping start this
discussion for him (from here on out Kyle is the “I”).
TL;DR: I'm proposing that when creating a library for code from the Git
codebase, we have two interfaces to this library: the "internal" one
that the rest of the Git codebase uses, and the "external" one for use
by other projects. The external interface will have a different coding
style and platform support than the rest of the codebase.
When thinking about potential issues and complications with
libification, I encountered a few broad categories of issues, and I'd
like to list them briefly (edit: turns out I can't be brief to save my
life) and float a proposal that may help minimize them.
Definitions
-----------
- When I say "Git" or "the git executable/binary" or whatever, I'm
referring to "the collection of binaries, tests, etc. that are part of
the main git repo" unless I say otherwise.
- Similarly, when I say "internal" I mean "for use by <that collection
of programs>". When I say "external" I mean for use by stuff that's
not part of the Git repository.
Assumptions
-----------
- Libraries that we're providing can be either statically or dynamically
linked. Git will link statically to its own Git libraries. External
projects may use either.
- Git must continue to be compilable and usable on all platforms it's
currently supported on. Libification can't take that away. However,
since libification is producing new interfaces for new use cases,
there is no requirement that we make these new interfaces usable on
all platforms, especially at first.
- We'd like as little churn and "uglification" of the main codebase as
possible.
Issues
------
- Symbol name collisions: Since C doesn't have namespacing or other
official name mangling mechanisms, all of the symbols inside of the
library that aren't static are going to be at risk of colliding with
symbols in the external project. This is especially a problem for
common symbols like "error()".
- Header files: This is actually several related problems:
- Git codebase's header files assume that anything that's brought in
via <git-compat-util.h> is available; this includes system header
files, but also macro definitions, including ones that change how
various headers behave. Example: _GNU_SOURCE and
_FILE_OFFSET_BITS=64 cause headers like <unistd.h> to change
behavior; _GNU_SOURCE makes it provide different/additional
functionality, and _FILE_OFFSET_BITS=64 makes types like `off_t` be
64-bit (on some platforms it might be 32-bit without this define).
- <git-compat-util.h> is expected to be included as the first header
file in the translation unit, so as to make _GNU_SOURCE and similar
#defines have the desired effect. If a translation unit (in an
external library consumer) has already included <unistd.h>, we can't
rely on them having had _GNU_SOURCE defined ahead of time
- We can't just `#include <git-compat-util.h>` at the top of our
external interface headers,
- Git's header files make regular use of inlining. We can't assume
that external projects are going to use static linking, and we can't
assume that external projects are going to use a C-compatible
language (they might not use our header files at all), so inline
functions seem risky at the interface layer.
- Compatibility: Using code from the git codebase as a library is a new
use case, we do not have the backwards compatibility requirements that
we do for Git itself. We should take full advantage of this, and
explicitly state what compatibility guarantees we are providing (or
not providing).
Proposal
--------
Let's have a distinction between the "internal" interface (used by Git),
and the "external" interface (used by everyone else). The "external"
interface has several differences from the rest of the git codebase:
- Minimal. Only include symbols and types that we explicitly want to be
part of the interface
- This is both for API evolution abilities and providing a "well-lit
path" to usage. Internal header files may have a lot of similar
but slightly different functions that can be very confusing, or
are highly specialized.
- Most languages will not be able to include our headers. Reducing
the interface to the minimal necessary means it's easier to
identify when the interfaces change and update the
non-C-compatible-language bindings.
- The external interface should have as little code/new
functionality as possible. All actual functionality should be in
the internal interface(s).
- No inline functions. This is similar to minimal. We should put as
little as possible in the header files, especially since many use
cases involve using the library from a language that can't even
#include them at all.
- Self Contained. The header files must work if they are the first/only
#include in the external project. They must include everything they
need, and not assume it was already handled for them.
- Tolerant. The header files probably won't be the first/only #include
in the external project's translation unit, and they should still
work. This means not using types like `off_t` or `struct stat` in the
interfaces provided, since their sizes are dependent on the execution
environment (what's been included, #defines, CFLAGS, etc.)
- Non-interfering. Our header files must not change fundamental things
about the execution environment. This means they must not do things
like #define _GNU_SOURCE or #define _FILE_OFFSET_BITS=64
- Limited Platform Compatibility. The external interfaces are able to
assume that <stdint.h> and other C99 (or maybe even C11+)
functionality exists and use it immediately, without weather balloons
or #ifdefs. If some platform requires special handling, that platform
isn't supported, at least initially.
- Non-colliding. Symbol names in these external interface headers should
have a standard, obvious prefix as a manual namespacing solution.
Something like `gitlib_`. (Prior art: cURL uses a `curl_` prefix for
externally-visible symbols, libgit2 uses `git_<module_name>_foo`
names; for internal symbols, they use `Curl_` and `git_<module>__foo`
(with a double underscore), respectively)
- Translating. The external interface provides "external" symbol names,
and potentially more compatible function interfaces than the internal
interface does, and exists to translate from one domain to another.
Most functions in the external interface will be just a single call to
the internal interface. Examples:
- Internal interface is `void foo();`; external interface would be
`void gitlib_foo() { foo(); }`
- Internal interface is `void foo(off_t val);`; external interface
could be `void gitlib_foo(int64_t val) { foo(val); }` -- here we
accept int64_t instead of off_t due to the issues around the size
of off_t
- Internal interface is `void foo(strbuf *s);`; the external
interface might be `void gitlib_foo(char *s, size_t s_len) {
strbuf sb; strbuf_init(&sb, s_len + 1); strbuf_add(&sb, s, s_len);
foo(&sb); } ` -- since strbufs own the memory they hold, strings
that come via the external interface might need to be copied to be
memory safe.
- Internal interface was `void foo();` but gained a new parameter.
We don't need to expose this parameter in the external interface,
and instead can just use a sensible default. External interface
can remain `void gitlib_foo() { foo(NULL); }`
Proof of Concept
----------------
I think we should continue with the git-std-lib work as a manual
separation of the .c files and associated header files that comprise the
very lowest level of functionality in the git codebase. This manual
separation would only produce a library with an "internal" interface. We
should also start to apply these ideas by defining an "external"
interface which has a subset of the functionality in git-std-lib.
Automatic symbol hiding
-----------------------
One of the main driving forces behind my proposal above is avoiding
significant churn in the git codebase, for example needing to rename
every function in the codebase that's not static. While many function
names are unlikely to collide, such as `parse_oid_hex`, others are
significantly more likely, like `error` or `hex_to_bytes`. Needing to
rename all "plausible" collisions to things that are unlikely to
collide, like `GIT_error` or `GIT_hex_to_bytes` is tedious, error prone,
and unpleasant.
I have possibly discovered a truly remarkable solution, but this
footnote is too small to contain it. Wait, no it's not. This isn't fully
tested yet, but has shown promise in my initial tests using clang on a
Linux machine.
- Compile the "internal" interface(s) and all supporting code with
`-fvisibility=hidden` to produce .o files for each .c file
- Compile the "external" interface(s) without hiding the symbols
- Produce a .a file that contains that code, for use by git itself
- "Partially link" the everything, using `ld -r`, to produce a single .o
file
- Use `objcopy --localize-hidden` to actually hide the internal symbols
from the "partially linked" .o file
This should leave us with two static libraries: one that has the symbols
marked as "hidden" but still usable, for use in git itself, and one that
contains the external interface, but doesn't expose the hidden symbols.
There may be similar solutions possible on other platforms, or there may
not, and we may need to do the great renaming (either in the code
itself, or via something like a giant set of linker scripts). While my
proposal to have a separation between the internal and external
interfaces is a requirement for making this automatic symbol hiding
solution work, I don't think that a failure to make the automatic symbol
hiding solution work means that we shouldn't have the internal/external
split. It's only one contributing point in favor of having the
internal/external split.
next reply other threads:[~2024-04-02 14:19 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-02 14:18 Calvin Wan [this message]
2024-04-07 21:33 ` [RFD] Libification proposal: separate internal and external interfaces brian m. carlson
2024-04-07 21:48 ` rsbecker
2024-04-08 1:09 ` brian m. carlson
2024-04-08 11:07 ` rsbecker
2024-04-08 21:29 ` Junio C Hamano
2024-04-09 0:35 ` brian m. carlson
2024-04-09 17:26 ` Calvin Wan
2024-04-09 9:40 ` Phillip Wood
2024-04-09 17:30 ` Calvin Wan
2024-04-22 16:26 ` Calvin Wan
2024-04-22 20:28 ` Junio C Hamano
2024-04-23 9:57 ` phillip.wood123
2024-05-09 1:00 ` Kyle Lippincott
2024-05-10 9:52 ` Phillip Wood
2024-05-10 21:35 ` Kyle Lippincott
2024-05-09 19:45 ` Kyle Lippincott
2024-05-09 20:14 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAFySSZAB09QB7U6UxntK2jRJF0df5R7YGnnLSsYc9MYhHsBhWA@mail.gmail.com \
--to=calvinwan@google.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).