Some context to help make sense of requirements further below:
- A lot of code in MediaWiki assume the presence of ParserCache and work with that internal function-level API to access parser output.
- Parsoid has its content cached (stored) in RESTBase and Parsoid clients interact with the RESTBase HTTP API to access Parsoid output and do transformations. But, some of these clients will switch over to accessing Parsoid internally via a function-level API instead of the HTTP API.
- Our understanding is that Platform Engineering Is phasing out RESTBase and transitioning that functionality into other components. Given that, our understanding is that RESTBase functionality will be transitioned over to ParserCache. So, that means:
- ParserCache needs to provide multi-bucket support and ability to tie them together with a key (revid / tid, etc.). Parsoid produces 3 components per page: HTML, data-parsoid JSON blob, and data-mw JSON blob. For networking and computational efficiency reasons, these are stored separately in RESTBase (minor detail: data-mw is not stored separately right now, but will be if RESTBase continues to be around). Not all Parsoid clients need all blobs. So, the API needs to be able to fetch individual blobs.
- ParserCache (or whatever code component it is) needs to support the stashing functionality for editing clients to provide "storage semantics" (instead of caching semantics where cached content can get evicted arbitrarily as far as clients are concerned) so presence of stashed content is guaranteed within session / time windows. RESTBase provides this.
- The REST API needs to be integrated with ParserCache at some layer so that all REST API requests don't result in fresh parse requests to Parsoid.
In addition to supporting RESTBase functionality, @EvanProdromou has framed this enhanced-ParserCache functionality as a Multi-Parser-Cache (MPC from here on) solution. It has the following constraints / product requirements:
- Switchover from core parser to Parsoid read views is going to be done in a phased manner and there might be reverts, etc. So, for quite a while, MPC needs to support caching of output from both core parser as well as Parsoid.
- Parsoid's HTML blob is roughly the same size as the core parser's HTML blob. However, Parsoid produces two additional blobs (data-parsoid & data-mw) which also need to be stored in MPC.
- Because of the two reasons above, MPC will have much higher storage needs compared to ParserCache.
- MPC should provide an unified library interface that supports both ParserCache as well as RESTBase functionality to minimize code churn for existing ParserCache and RESTBase / Parsoid clients.
In addition to legacy HTML / Parsoid HTML / data-mw / data-parsoid, there may be additional derived fields, discussed in the comments below. W/o necessarily adding them to the requirements, these fields might include (for example), linter output, "structured comments" (ie, the output of the DiscussionTools parser), and even perhaps auxiliary data to help track annotations on the DOM (like mappings between node IDs of this revision and previous revisions).
Another fact to consider: whether the cache is tagged on "revision ID" or "timestamp" or something else -- that is, a given revision ID might have multiple parses, because its dependencies change. RESTBase exposed this via timestamp IDs it exposed. This functionality doesn't exist in core parser cache (as far as I know). FlaggedRevisions exposes another sort of distinguishing factor that might cause parses to vary -- it renders inclusions using the latest "flagged revision" of that template. This is perhaps a special case of "timestamp". Finally you might consider (w/ an eye to the future) a way of combining the various dependencies into a merkle tree something like a git commit hash, so that the etag uniquely identifies the set of dependency versions and (in theory) parses at different timestamps could result in the same etag if none of the dependencies were updated between the timestamps.