about summary refs log tree commit homepage
path: root/lib/PublicInbox/LeiMailSync.pm
DateCommit message (Collapse)
2024-01-04lei: MH: support inotify to detect updates
This should help us deal with MH sequence number packing and invalidating mail_sync.sqlite3.
2023-12-30lei: support reading MH for convert+import+index
The MH format is widely-supported and used by various MUAs such as mutt and sylpheed, and a MH-like format is used by mlmmj for archives, as well. Locking implementations for writes are inconsistent, so this commit doesn't support writes, yet. inotify|EVFILT_VNODE watches aren't supported, yet, but that'll have to come since MH allows packing unused integers and renaming files.
2023-11-03move read_all, try_cat, and poll_in to PublicInbox::IO
The IO package seems like a better home for I/O subs than the Git package. We lose the 60 second read timeout for `git cat-file --batch-*' processes since it's probably not necessary given how reliable the code has proven and things would fall over hard in other ways if the storage device were completely hosed.
2023-10-18use read_all in more places to improve safety
`readline' ops may not detect errors on partial reads. This saves us some code to reduce cognitive overhead for readers. We'll also support reusing a destination buffers so it can work more nicely with existing code.
2023-06-09add compat package for List::Util::uniqstr
This will make it easier to switch in the far future while making callers easier-to-read (and more callers will be added). Anyways, Perl 5.26 is a long time away for enterprise users; but isolating compatibility code away can improve readability of code we actually care about in the meantime.
2023-04-20lei_mail_sync: prepare to support SHA-256
I'm not sure how combining SHA-1 and SHA-256 in a single git repo will work, eventually. But this is an obvious place to do the right thing if we ever see a 64-byte hex string (unless git adds support for another hash which uses 64-byte hex string representations, which would break many assumptions elsewhere, too...).
2023-04-13lei_mail_sync: cleanup stale/dangling fids if possible
I'm not sure how it happens or if/when it was fixed, but my earliest lei installations have hit some "E: fid=$fid for $oidhex unknown" messages on `lei import' invocations. This really should've enabled the foreign keys pragma to begin with; but we'll probably start using that in the future. For now, at least rely on a transaction to keep things consistent in SQLite.
2022-04-18lei_mail_sync: explicit bind for old SQL_VARCHAR compat
This avoids repeated work for incremental "lei import" runs when users upgrade from 1.7 to current public-inbox.git (and eventually 1.8). We need the explicit bind_param for fallback calls because previous bind_param calls are "sticky" for a given statement handle. The DBI(3pm) manpage states: The data type is 'sticky' in that bind values passed to execute() are bound with the data type specified by earlier bind_param() calls, if any. Portable applications should not rely on being able to change the data type after the first "bind_param" call.
2022-04-05lei: always open mail_sync.sqlite3 R/W
This will make transparently upgrading from 1.7.0 -> 1.8.x easier. Only a single user has access to mail_sync.sqlite3, and R/W at the kernel-level is required for WAL, anyways.
2022-04-02lei_mail_sync: store OIDs and Maildir filenames as blobs
DBD::SQLite doesn't seem to use SQL_BLOB automatically, which can lead to ambiguity in some cases (especially interoperating with other tools). Downgrading to lei 1.7.0 will cause problems, but upgrading appears transparent after weeks of tests.
2022-04-02lei_mail_sync: ensure URLs and folder names are stored as binary
Apparently leaving {sqlite_unicode} unset isn't enough, and there's subtle differences where BLOBs are stored differently than TEXT when dealing with binary data. We also want to avoid odd cases where SQLite will attempt to treat a number-like value as an integer. This should avoid problems in case non-UTF-8 URLs and pathnames are used. They'll automatically be upgraded if not, but downgrades to older lei would cause duplicates to appear.
2022-01-31rewrite Linux nodatacow use in pure Perl w/o system
btrfs is Linux-only at the moment (and likely to remain that way for practical purposes). So rely on Linux ABI stability and use the `syscall' and `ioctl' perlops rather than relying on Inline::C. Inline::C (and gcc||clang) are monstrous dependencies which we can't expect users to have. This makes supporting new architectures more difficult, but new architectures come along rarely and this reduces the burden for the majority of Linux users on popular architectures (while still avoiding the distribution of pre-built binaries). Link: https://public-inbox.org/meta/YbCPWGaJEkV6eWfo@codewreck.org/
2021-10-22lei_mail_sync: mv_src: use transaction, check UNIQUE
We need a transaction across two SQL statements so readers (which don't use flock) will see the result as atomic. This may help against some occasional test failures I'm seeing from t/lei-auto-watch.t and t/lei-watch.t, or make the problem more apparent.
2021-10-19lei_mail_sync: show non-matching SHA
It could prove useful for diagnosing bugs (either on our end or an MUA's), or storage device failures.
2021-10-13lei: use standard warn() in more places
warn() is easier to augment with context information, and frankly unavoidable in the presence of 3rd-party libraries we don't control.
2021-10-13index: optimize after all SQLite DB commits
This covers v1 inboxes, as well. We also guard the execution since "PRAGMA optimize" was only introduced in SQLite 3.18.0 (2017-03-30)
2021-10-12sqlite: PRAGMA optimize on close
As recommended by SQLite documentation[1]: To achieve the best long-term query performance without the need to do a detailed engineering analysis of the application schema and SQL, it is recommended that applications run "PRAGMA optimize" (with no arguments) just before closing each database connection. Hopefully that works for our use cases and can make things faster for us. [1] https://www.sqlite.org/pragma.html#pragma_optimize
2021-09-21lei: various completion improvements
"lei export-kw" no longer completes for anonymous sources. More commands use "lei refresh-mail-sync" as a basis for their completion work, as well. ";AUTH=ANONYMOUS@" is stripped from completions since it was preventing bash completion from working on AUTH=ANONYMOUS IMAP URLs. I'm not sure if there's a better way, but all of our code works fine without specifying AUTH=ANONYMOUS as a command-line arg. Finally, we fallback to using more candidates if none can be found, allowing multiple URLs to be completed.
2021-09-21lei lcat: support NNTP URLs
NNTP URLs are probably more prevalent in public message archives than IMAP URLs.
2021-09-21lei: simplify internal arg2folder usage
We can set opt->{quiet} for (internal) 'note-event' command to quiet ->qerr, since we use ->qerr everywhere else. And we'll just die() instead of setting a ->{fail} message, since eval + die are more inline with the rest of our Perl code.
2021-09-21lei_mail_sync: account for non-unique cases
NNTP servers, IMAP servers, and various MUAs may recycle "unique" identifiers due to software bugs or careless BOFHs. Warn about them, but always be prepared to account for them.
2021-09-21lei inspect: support NNTP URLs
No reason not to support them, since there's more public-inbox-nntpd instances than -imapd instances, currently.
2021-09-18lei_mail_sync: set nodatacow on btrfs
As with other SQLite3 databases, copy-on-write with files experiencing random writes leads to write amplification and low performance.
2021-09-18lei_mail_sync: rely on flock(2), avoid IPC
Since 44917fdd24a8bec1 ("lei_mail_sync: do not use transactions"), relying on lei/store to serialize access was a pointless endeavor. Rely on flock(2) to serialize multiple writers since (in my experience) it's the easiest way to deal with parallel writers when using SQLite. This allows us to simplify existing callers while speeding up 'lei refresh-mail-sync --all=local' by 5% or so.
2021-09-17lei_mail_sync: don't hold statement handle into callback
This can cause readers and writers to conflict since the implicit transaction from SELECT in a LeiRefreshMailSync worker would block the LeiStore process.
2021-09-02lei_mail_sync: do not use transactions
For lei-index to work in parallel with MUA access and upcoming inotify-based updates, mail_sync.sqlite3 needs to always be up-to-date to read-only worker processes (ahead of everything else). So rely on the default auto-commit behavior and hope SQLite WAL can reduce some of the overheads involved with writes.
2021-08-31lei_mail_sync: set_src uses binary OIDs
Another step towards moving more of our internals to use binary OIDs to avoid needless conversions before hitting disk.
2021-08-31lei_mail_sync: make rename_folder more robust
We need to account for past canonicalization errors and deal with cases which violate uniqueness constraints in mail_sync.sqlite3
2021-08-31lei_mail_sync: simplify group2folders
No need to loop when we can rely on grep.
2021-08-31lei prune-mail-sync: handle --all (no args)
This still needs tests, but I noticed "--all" w/o "local" or "remote" was not working correctly since split() returned an empty array.
2021-08-31lei_mail_sync: forget_folder: simplify code
No need to bump refcounts of {dbh} nor declare extra variables for a rarely-called function.
2021-08-25lei_mail_sync: remove warning message from caller
We can afford to be liberal in what messages we accept internally, since LeiToMail uses a trailing slash internally.
2021-08-21lei: implicitly watch all Maildirs it knows about
This allows MUA-made flag changes to Maildirs to be instantly read and acknowledged for future search results. In the future, it may be used to speed up --augment and --import-before (the default) with with "lei q".
2021-08-05lei export-kw: workaround race in updating Maildir locations
Inotify updates may simultaneously remove or update the location of a message, so ensure we at least have knowledge of the new location if the old one cannot be updated.
2021-07-25lei_mail_sync: locations_for API uses oidbin for comparisons
Favor oidbin use internally to reduce internal memory traffic.
2021-06-09lei_mail_sync: hoist out --all handling from export-kw
We'll be reusing it in other commands, too.
2021-06-09lei/store: do eidx_init before creating R/W lms dbh
Sharing lms->{dbh} with eidx shards appears to be the cause of the "Issuing rollback() due to DESTROY without explicit disconnect() of DBD::SQLite::db handle" messages I've been seeing from "lei up".
2021-06-08lei import: speed up repeated Maildir imports
On a 4-core CPU, this speeds up "lei import" on a largish Maildir inbox with 75K messages from ~8 minutes down to ~40s. Parallelizing alone did not bring any improvement and may even hurt performance slightly, depending on CPU availability. However, creating the index on the "fid" and "name" columns in blob2name yields us the same speedup we got. Parallelizing IMAP makes more sense due to the fact most IMAP stores are non-local and subject to network latency. Followup-to: bdecd7ed8e0dcf0b45491b947cd737ba8cfe38a3 ("lei import: speed up kw updates for old IMAP messages")
2021-06-03lei import: speed up kw updates for old IMAP messages
On a 4-core CPU, this speeds up "lei import" on a largish IMAP inbox with 75K messages from ~21 minutes down to 40s. Parallelizing with the new LeiImportKw WQ worker class gives a near-linear speedup and brought the runtime down to ~5:40. The new idx_fid_uid index on the "fid" and "uid" columns of blob2num in mail_sync.sqlite3 brought us the final speedup. An additional index on over.sqlite3#xref3(oidbin) did not help, since idx_nntp already exists and speeds up the new ->oidbin_exists internal API. I initially experimented with a separate "lei import-kw" command but decided against it since it's useless outside of IMAP+JMAP and would require extra cognitive overhead for both users and hackers. So LeiImportKw is just a WQ worker used by "lei import" and not its own user-visible command. v2: fix ikw_done_wait arg handling (ugh, confusing API :x)
2021-06-01lei_mail_sync: more debug info for uncommitted txn
I'm not actually sure if I hit an uncommitted transaction just now, it doesn't seem like it.
2021-05-30lei import: import IMAP flag changes from old messages
This makes "lei import" behavior with IMAP folders more consistent with that with Maildir. Opening IMAP folders read-write with "SELECT" (instead of read-only with "EXAMINE") was necessary, since it lets an IMAP server communicate to us as to whether or not it's worth refetching IMAP flags of previously imported messages. Fetching UID+FLAGS only is one of the fastest IMAP operations with dovecot, our -imapd and presumably other common IMAP servers. It is issued by common MUAs such as mutt after every SELECT. Users may now rely on "lei import" exclusively to merge mail and keywords into lei/store, and "lei export-kw" to propagate keyword changes back to IMAP servers. A sticks-and-stones workflow for personal mailboxes is currently: lei import imaps://$MY_PERSONAL_INBOX lei q --mua=$MUA -o /tmp/results SEARCH TERMS... # do stuff from within $MUA to /tmp/results lei import /tmp/results # read keyword changes from MUA lei export-kw imaps://$MY_PERSONAL_INBOX # repeat when new stuff shows up in personal inbox The next goal is to automate repeated imports + export-kw commands with with inotify and IMAP IDLE.
2021-05-30lei import|lcat: improve+fix single message IMAP support
lcat can now dump the memoized contents of entire IMAP folders, not just a single UID. It's now parallelized and pipelined for multiple lei2mail workers. Furthemore, various forms of JSON output work consistently with blob-only output, now. While working on this, I noticed NetReader was passing UID URLs to imap_each callbacks, which was causing mail_sync.sqlite3 to store UIDs in `folders' and clearly wrong so it's now fixed.
2021-05-28lei: handle a single IMAP message in most places
"lei import" can now import a single IMAP message via <imaps://example.com/MAILBOX/;UID=$UID> Likewise, "lei inspect" can show the blob information for UID URLs and "lei lcat" can display the blob without network access if imported. "lei lcat" also gets rid of some unused code and supports "blob:$OIDHEX" syntax as described in the comments (and used by our "text" output format). v2: enforce UID in URL, fail without v3: fix error reporting (s/fail/child_error/)
2021-05-28lei_mail_sync: debug code for uncommitted txn
I'm not 100% sure why, but "lei up" seems to cause uncommitted transaction errors. LeiToMail calls sto->set_sync_info, but LeiXSearch should call sto->done and lms_commit, so I'm not sure where the uncommited transaction is coming from...
2021-05-25lei forget-mail-sync: new command to drop sync information
Sometimes a user stops caring to sync an IMAP or Maildir folder, or wants to force a resync. Let them run this command to have lei forget all the sync information about the mail folder. This won't delete any stored messages in git, but will leave "lei index" users with dangling references.
2021-05-25lei_mail_sync: args2folder: common folder lookup sub
This lets us have a more consistent UX for mapping easily-typed command-line arguments to canonical folder locations.
2021-05-24lei_mail_sync: reject IMAP URLs w/o UIDVALIDITY
It's inappropriate to store sync information without UIDVALIDITY, so add an assertion to prevent it.
2021-05-24lei inspect: use LeiMailSync->match_imap_url
Move match_imap_url into LeiMailSync so it can be used in more places, such as "lei inspect". Upcoming commands such as "lei forget-mail-sync" and {add,forget,pause,resume}-watch will also support relaxed IMAP matching rules since there's no reasonable way to expect users use ";UIDVALIDITY=" on the command-line.
2021-05-23lei export-kw: new command to export keywords to Maildirs
IMAP will eventually be supported.
2021-05-19lei: relax rules for "new" in Maildir
mbsync and offlineimap both use ":2," suffixes for filenames in "new/", however my interpretation of the Maildir spec at <https://cr.yp.to/proto/maildir.html> is that ":2," is only for files in "cur/". My interpretation also matches that of doveecot, but we'll allow what mbsync and offlineimap do given their popularity.