about summary refs log tree commit homepage
path: root/lib/PublicInbox/LeiImport.pm
DateCommit message (Collapse)
2023-12-30lei: support reading MH for convert+import+index
The MH format is widely-supported and used by various MUAs such as mutt and sylpheed, and a MH-like format is used by mlmmj for archives, as well. Locking implementations for writes are inconsistent, so this commit doesn't support writes, yet. inotify|EVFILT_VNODE watches aren't supported, yet, but that'll have to come since MH allows packing unused integers and renaming files.
2023-10-02lei: do label/keyword parsing in optparse
Calling vmd_mod_extract after optparse causes the implicit stdin-as-input functionality to fail, as the implicit stdin requires a lack of inputs remaining in argv after option parsing (along with a regular file or pipe as stdin). This allows commands such as `lei import -F eml +kw:seen' to work without `--stdin', `-' or any path names when importing a single message. This also ensures commands like `lei import +kw:seen' without any inputs/locations will fail reliably, as the extra +kw: arg won't be a false-positive.
2023-06-09add compat package for List::Util::uniqstr
This will make it easier to switch in the far future while making callers easier-to-read (and more callers will be added). Anyways, Perl 5.26 is a long time away for enterprise users; but isolating compatibility code away can improve readability of code we actually care about in the meantime.
2023-03-25lei: improve bash completion involving colons
This fixes completions of labels (`+L:' for `lei import' and `L:' for `lei q') so they can appear anywhere in the command-line. I mainly wanted this for `lei import $URL +L:label', but this also fixes `lei forget-external' completions for URLs (which involve colons).
2022-05-02lei import: add label completions (+L:$LABEL)
This can probably be added for "lei q", too, but we typically import first. Labels can probably be made persistent on a per-folder basis in the future.
2022-04-05lei: always open mail_sync.sqlite3 R/W
This will make transparently upgrading from 1.7.0 -> 1.8.x easier. Only a single user has access to mail_sync.sqlite3, and R/W at the kernel-level is required for WAL, anyways.
2021-11-02lei: simplify common LeiInput users with ->wq1_start
This method replaces a common pattern of starting workers, preparing internal auth ops, and asynchronous waiting of command completion. It also adds missing LeiAuth support to rediff and rm which rarely need auth.
2021-10-24lei: always pass $lei to LeiAuth->op_merge
This will make future developments easier.
2021-10-13lei: use standard warn() in more places
warn() is easier to augment with context information, and frankly unavoidable in the presence of 3rd-party libraries we don't control.
2021-09-27lei completion: workaround old Perl bug
While `$argv[-1]' is `undef' on an empty @argv, using `$argv[-1]' as a subroutine argument would fail incorrectly with: Modification of non-creatable array value attempted, subscript -1 at ... ...even though we'd never attempt to modify @_ itself in the subroutines being called. Work around the bug (tested on 5.16.3) by passing `undef' explicitly when `$argv[-1]' is already `undef'. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/meta/20210927124056.kj5okiefvs4ztk27@meerkat.local/
2021-09-21lei: various completion improvements
"lei export-kw" no longer completes for anonymous sources. More commands use "lei refresh-mail-sync" as a basis for their completion work, as well. ";AUTH=ANONYMOUS@" is stripped from completions since it was preventing bash completion from working on AUTH=ANONYMOUS IMAP URLs. I'm not sure if there's a better way, but all of our code works fine without specifying AUTH=ANONYMOUS as a command-line arg. Finally, we fallback to using more candidates if none can be found, allowing multiple URLs to be completed.
2021-09-21lei_mail_sync: account for non-unique cases
NNTP servers, IMAP servers, and various MUAs may recycle "unique" identifiers due to software bugs or careless BOFHs. Warn about them, but always be prepared to account for them.
2021-09-19lei/store: use SOCK_SEQPACKET rather than pipe
This has several advantages: * no need to use ipc.lock to protect a pipe for non-atomic writes * ability to pass FDs. In another commit, this will let us simplify lei->sto_done_request and pass newly-created sockets to lei/store directly. disadvantages: - an extra pipe is required for rare messages over several hundred KB, this is probably a non-issue, though The performance delta is unknown, but I expect shards (which remain pipes) to be the primary bottleneck IPC-wise for lei/store.
2021-09-18lei_mail_sync: rely on flock(2), avoid IPC
Since 44917fdd24a8bec1 ("lei_mail_sync: do not use transactions"), relying on lei/store to serialize access was a pointless endeavor. Rely on flock(2) to serialize multiple writers since (in my experience) it's the easiest way to deal with parallel writers when using SQLite. This allows us to simplify existing callers while speeding up 'lei refresh-mail-sync --all=local' by 5% or so.
2021-09-18lei: lock worker counts
It doesn't seem worthwhile to change worker counts dynamically on a per-command-basis with lei, and I don't know how such an interface would even work...
2021-09-06lei_auth: simplify users
There's no need to alias net_merge_all in each WQ class which uses LeiAuth, `$obj->$sub' works even when `$sub' is a fully-qualified subroutine name with `::' in it. perlobj(1) documents it under "Method Call Variations".
2021-06-13lei import: use url_folder_cache for completion
And fix "lei index" completion while we're at it.
2021-06-10lei import: support --new-only for IMAP
Taking ~40s to synchronize a ~75K message IMAP folder is still a lot of time, so support an option to only touch new messages. This is similar to "offlineimap -q" (quick) or "mbsync --new" switches, but lei already accepts "-q" as a shortcut for --quiet. "--new" could work, but "--new-only" might be more descriptive (or "--only-new"?), since the default fetches also fetches new messages. v2: warn for non-IMAP sources, I'm not sure it's worth it for Maildir or other sources, yet. It will also make sense for MH and JMAP once we support them.
2021-06-09lei tag: parallelize Maildir access
Since Maildir isn't guaranteed to have any sort of order, we can parallelize inputs, here. On a 4-core system, this reduced one of my tag invocations from 5.5 to 1.4s.
2021-06-09mdir_reader: maildir_each_file: pass flags, skip Trash
This is a slight behavior change for "lei q": Trashed (but not-yet-expunged) messages no longer get unlinked when --output is used without --augment.
2021-06-08lei import: speed up repeated Maildir imports
On a 4-core CPU, this speeds up "lei import" on a largish Maildir inbox with 75K messages from ~8 minutes down to ~40s. Parallelizing alone did not bring any improvement and may even hurt performance slightly, depending on CPU availability. However, creating the index on the "fid" and "name" columns in blob2name yields us the same speedup we got. Parallelizing IMAP makes more sense due to the fact most IMAP stores are non-local and subject to network latency. Followup-to: bdecd7ed8e0dcf0b45491b947cd737ba8cfe38a3 ("lei import: speed up kw updates for old IMAP messages")
2021-06-08lei: generalize auxiliary WQ handling
op_wait_event is now more lei-specific since we no longer have to care about oneshot and use a synchronous loop. {ikw} (import-keywords) started a trend, but LeiPmdir (parallel Maildir) is an upcoming WQ class that will follow this idea. Eventually, {l2m} usage may be updated to follow this, too.
2021-06-03lei import: speed up kw updates for old IMAP messages
On a 4-core CPU, this speeds up "lei import" on a largish IMAP inbox with 75K messages from ~21 minutes down to 40s. Parallelizing with the new LeiImportKw WQ worker class gives a near-linear speedup and brought the runtime down to ~5:40. The new idx_fid_uid index on the "fid" and "uid" columns of blob2num in mail_sync.sqlite3 brought us the final speedup. An additional index on over.sqlite3#xref3(oidbin) did not help, since idx_nntp already exists and speeds up the new ->oidbin_exists internal API. I initially experimented with a separate "lei import-kw" command but decided against it since it's useless outside of IMAP+JMAP and would require extra cognitive overhead for both users and hackers. So LeiImportKw is just a WQ worker used by "lei import" and not its own user-visible command. v2: fix ikw_done_wait arg handling (ugh, confusing API :x)
2021-06-01lei import: reduce writes to lei/store on IMAP sync
We don't need to write VMD changes to lei/store if local keywords are unchanged.
2021-05-30lei import: import IMAP flag changes from old messages
This makes "lei import" behavior with IMAP folders more consistent with that with Maildir. Opening IMAP folders read-write with "SELECT" (instead of read-only with "EXAMINE") was necessary, since it lets an IMAP server communicate to us as to whether or not it's worth refetching IMAP flags of previously imported messages. Fetching UID+FLAGS only is one of the fastest IMAP operations with dovecot, our -imapd and presumably other common IMAP servers. It is issued by common MUAs such as mutt after every SELECT. Users may now rely on "lei import" exclusively to merge mail and keywords into lei/store, and "lei export-kw" to propagate keyword changes back to IMAP servers. A sticks-and-stones workflow for personal mailboxes is currently: lei import imaps://$MY_PERSONAL_INBOX lei q --mua=$MUA -o /tmp/results SEARCH TERMS... # do stuff from within $MUA to /tmp/results lei import /tmp/results # read keyword changes from MUA lei export-kw imaps://$MY_PERSONAL_INBOX # repeat when new stuff shows up in personal inbox The next goal is to automate repeated imports + export-kw commands with with inotify and IMAP IDLE.
2021-05-23net_reader|net_writer: pass URI refs deeper into callbacks
This will give us more flexibility in the future w.r.t. dealing with UIDVALIDITY and AUTH= info with IMAP. The LoC reduction is welcome, too.
2021-05-04lei index: new command to index mail w/o git storage
Since completely purging blobs from git is slow, users may wish to index messages in Maildirs (and eventually other local storage) without storing data in git. Much code from LeiImport and LeiInput is reused, and a new dummy FakeImport class supplies a non-storing $im->add and minimize changes to LeiStore. The tricky part of this command is to support "lei import" after a message has gone through "lei index". Relying on $smsg->{bytes} == 0 (as we do for external-only vmd storage) does not work here, since it would break searching for "z:" byte-ranges when not using externals. This eventually required PublicInbox::Import::add to use a SharedKV to keep track of imported blobs and prevent duplication.
2021-05-03lei: simplify workers_start API
In most cases, we just name the worker process based on the command. The only change is for LeiMirror vs "lei add-external --mirror", but I doubt it matters.
2021-05-03lei_input: common net_merge_all_done for lei <import|tag>
I suspect there'll be more lei_input-only things in the future.
2021-05-01lei_auth: s/net_merge_complete/net_merge_all_done/
We use the "done" term elsewhere for similar things, and my easily-confused mind equates "complete" with shell completion.
2021-04-30lei import: support shell completion of known folders
This also fixes completion of "lei up" for IMAP folders.
2021-04-30lei import: avoid IMAPTracker, use LeiMailSync more
IMAPTracker has a UNIQUE constraint on the `url' column, which may cause compatibility and/or rollback problems in attempting to deal with UIDVALIDITY changes. Having multiple sources of truth leads to confusion and bugs, so relying on LeiMailSync exclusively ought to simplify things. Furthermore, since LeiMailSync is only written to by LeiStore, it is safer in that it won't mark a UID or article as imported until git-fast-import has seen it, and the SQLite commit always happens after "done\n" is sent to fast-import. This mostly reverts recent commits to IMAPTracker to support lei, those are: 1) commit 7632d8f7590daf70c65d4270e750c36552fa9389 ("net_reader: restart on first UID when UIDVALIDITY changes") 2) commit 311a5d37ad275cd75b1e64d87827c4d13fe4bfab ("imap_tracker: prepare for use with lei"). This means public-inbox-watch will not change between 1.6 and 1.7: -watch stops synching a folder when UIDVALIDITY changes.
2021-04-28lei: simple WQ workers use {wq1} field
This lets us share more code and reduces cognitive overhead when it comes to picking names (because {lsss} was ridiculous). We'll need to ensure the first error set in lei is the actual error we exit with, otherwise things can get confusing and errors may get lost.
2021-04-28lei: quiet down Eml-related warnings consistently
"lei import" is probably the only place where it users might care about warnings.
2021-04-27lei: standardize on _lei_wq_eof callback for workers
Simplify our internals a little bit.
2021-04-24lei import: keep sync info for Maildir and IMAP folders
We aren't using it, yet, but the plan is to be able to use this information to propagate keyword changes back to IMAP and Maildir folders using some to-be-implemented command. "lei inspect" is a half-baked new command to make testing this change easier. It will be updated to support more SQLite+Xapian introspection duties in the future, including public-inbox things independent of lei.
2021-04-23lei import: support adding keywords and labels on import
This saves some work and makes it easier to set volatile metadata on a message at import time.
2021-04-22lei import: --incremental default for NNTP and IMAP
No point in burning through bandwidth to import stuff we already saw. All this logic is shared with -watch but uses a different pathname for lei since it's tied to lei/store (and not a public-inbox).
2021-04-21lei: share common *done_wait callbacks
Code is the enemy, and there's no need to duplicate things, here. There may be further opportunities along these lines to further deduplicate things...
2021-03-31lei_input: reduce IPC traffic with multiple inputs
No point in sending a command for every input when a single one will do. We'll also trigger LeiStore->done sooner in the worker rather than later.
2021-03-31lei: fix IMAP auth failure handling
We must use the $ops hashref returned by lei->workers_start, since it's modified to include extra handlers for auth failures and whatnot. Fixes: 954581b8e575966a ("lei: simplify PktOp callers")
2021-03-29lei_input: avoid special case sub for --stdin
We can consistently open /dev/stdin correctly nowadays, so drop the input_stdin and just use the normal ->path_to_fd code path.
2021-03-28lei: simplify PktOp callers
Provide a consistent ->op_wait_event method instead of forcing callers to loop (or not) at each callsite. This also avoid a leak possibility by avoiding circular references.
2021-03-25lei import: force store, improve test diagnostics
"lei import" should never be without a {sto}, and *_done should not be called multiple times, so ensure we can fail if it's missing. Update some existing tests to complain loudly by introducing a handy "xbail" function which wraps "explain" and BAIL_OUT. BAIL_OUT was painful to type and concatenating the result of "explain" doesn't work as I thought it would since "explain" always returns an array, and BAIL_OUT only accepts a single scalar arg (unlike "die").
2021-03-24lei: improve management around short-lived workers
Instead of creating a short-lived circular reference, ensure they don't exist in the first place. Note the following changes to hold an extra ref to $sto: - $self->_lei_store(1)->write_prepare($self); + my $sto = $self->_lei_store(1); + $sto->write_prepare($self); I'm not a perlguts expert, but I actually wanted to switch to the one-line version for LeiImport, but xt/lei-auth-fail.t was getting stuck for some reason. It seems the extra ref to the LeiStore ($sto) object is necessary.
2021-03-24lei_input: more common code between <mark|convert|import>
"lei convert" is actually a bit of the odd one, since it uses lei2mail for auth, unlike the others.
2021-03-24lei mark: command for (un)setting keywords and labels
Only tested for keywords and labels with file inputs, so far; but it seems to do what it needs to do. There's a bit more redundant code than I'd like, and more opportunities for code sharing in the future "lei import" will be expanded to support +kw:$KEYWORD and +L:$LABEL in the future.
2021-03-23lei import: ignore Status headers in "eml" messages
Those headers only have meaning with for mboxes. Don't surprise users by trying to make sense of a header that is defined for mboxes. It's possible to send email with (Status|X-Status) headers and have those headers show up in a recipient's IMAP mailbox. This was bad because an IMAP user may want to import a single message through their MUA and pipe its contents to "lei import" without noticing a mischievious sender stuck "X-Status: F" (flagged/important) in there.
2021-03-23lei_input: common filehandle reader for eml + mbox
This improve code regularity, and will let us deal with the "RFC822" messages with "From " line that mutt pipes to.
2021-03-23mbox_reader: add ->reads method to avoid nonsensical formats
Relying on UNIVERSAL::can may cause internal helper methods to be used, which can lead to failures or nonsensical results.