Benchmarking AutoML Vendors and Open Source Time Series Packages
Performance overview of an algorithm that crawls from one live data stream to another

Benchmarking AutoML Vendors and Open Source Time Series Packages

This post shows how to create a crawler at Microprediction.Org to help evaluate vendor or open source libraries that claim general purpose time series prediction capability.

Choosing an automated machine learning vendor

How should you choose an Automated Machine Learning product? There are many products on the market today which attempt to automate all aspects of the model discovery process. They may certainly be capable of improving your bottom line, but assessment is time consuming. In arriving at a careful decision this little snippet of Python might help your process in some small way, at least as far as time series prediction is concerned.

Microprediction.Org includes live data streams (like this one) contributed by individuals and companies, which is to say anyone. In turn those streams are attacked by time series prediction algorithms that can be authored by anyone - thus giving rise to a higher and higher standard of benchmark over time. Qualitative benchmarking for AutoML vendors for time series using Microprediction.Org has some other advantages. It is:

  1. Free.
  2. Easy. Writing a "crawler" can be accomplished in a few lines of Python (like this)
  3. Ongoing. Once you script a test, you can run it three months from now to keep your algorithm or favorite library honest. Is it keeping up? How does it do against new challenges?
  4. Out of sample. Only live updating data is used. This eliminates data leakage which is a huge problem for assessing time series algorithms (discussion)
  5. Less likely to be gamed. You can bet vendor libraries are trained on M4 and other well known canned time series. The only good place to hide data is in the future.
  6. Anonymous. No registration is required.
  7. Extensible. You can add to the test streams by publishing your own live data.
  8. Pragmatic. A few lines of Python (like this) should suffice. But but the relative convenience of using a vendor product in a live environment will quickly become apparent.
  9. Realistic. Submitting a time series is free but requires an expenditure of CPU, similar to bitcoin mining. So people tend to publish real world live streams that are interesting. Here we start with a curated list for simplicity, but I remark briefly on how to enlarge that.
  10. Civic. You might accidentally help someone else in a number of ways. You can read about the goals of Microprediction.Org.

To briefly elaborate on points 8 and 9:

  • It is important to discover quickly operational details. Does the library handle online updates? Is the API or client well thought out? Is there going to be a gap between research and production use? How quickly does it estimate models? If fitting is intended to be performed offline or in batch, how much does fitting latency degrade performance? Do edge cases get tripped?
  • There are some interesting purely generative time series at Microprediction.Org such as simulated agent models for epidemics (stream) and physical systems with noise (stream) which test different modeling abilities. It is nice to have a few of these although you are free to disregard them as textbook exercises if you wish.
  • The trickiest tests come from examples like electricity prices (vibrant discussion going on) with its spikes and occasional negative prices, or from the quasi-periodic behavior of bike sharing activity near New York city hospitals (stream). The behavior of an instrumented laboratory helicopter (stream) provides a tough test given the incomplete physics (more remarks here) and is the reason the SciML group chose it for their Julia Day challenge. The lurching and potentially bimodal behavior of a badminton player's neck position during a game (article) might not fall out of an obvious generative model. And so it goes.

Sounds great right? Full disclosure I'm the author.

This post covers univariate time series prediction only. Your vendor product should at least get a passing grade on univariate time series, and don't take that for granted as, anecdotally, some have a tendency to overfit. I will include exogenous variables in a future post.

Criteria:

  1. Does the product provide distributional predictions, as compared with single number point estimates.
  2. Does the model fit in a few seconds?

If yes to both questions, we are off to the races. If not, stay tuned (workarounds are pretty easy and obvious but I wanted to keep this particular post as simple as possible).

Python walkthrough:

$pip install microprediction

Creating a crawler

You will need to derive from SimpleCrawler and modify the sample method. Perhaps something like the following:

No alt text provided for this image

And then run it:

No alt text provided for this image

That's all it takes.

Let it run, as in "run Forrest, run". If you don't believe that long running processes are a solved problem, there is at least one suggestion in this article.

Naturally the above is pseudo-code because model does not exist. Of course that will need to be supplied by the library or vendor of your choice. The only requirement placed on the model is that it has a method that returns the inverse cumulative distribution function for the next data point in the series, here called model.invcdf.

Inverse cumulative distribution method

It is possible that your library does not provide an inverse cumulative distribution function. You may choose to interpret a confidence interval or whatever is supplied (talk to them) and promote it to the status of a distributional estimate. By some means, your crawler must produce a collection of 225 data points that approximate the distribution. If there isn't sufficient information in the output of your vendor product to guide you to that in some approximation, I'm not entirely sure it really qualifies as automated anything.

Aside: A brief justification of the importance of distributional estimates, as well as an overview of mechanics of scoring at Microprediction.Org (nearest the pin) is provided in this article and also discussed with respect to the badminton stream here specifically. Yes scoring rule literature is developed and point estimates can be interpreted carefully in special situations, but this is the real world and you aren't in Kaggle anymore.

What to ask of your vendor

The example code includes a straw man so that it actually can be run. You might even want to do this before modifying it.

No alt text provided for this image

This is not the best prediction model. However it certainly estimates itself very fast and it provides the all important invcdf method, thus meeting the requirements. Ask your vendor to knock together something fitting this pattern. That can't possibly take more than half an hour, including testing.

Viewing the results

When you run the crawler it will create an identity before initializing (this takes a little time, but only once per crawler) and prints a key. Copy that key and, before you lose it, rush over to www.Microprediction.Org and plonk the key into your dashboard (then email it to yourself). After a while you will see some activity. Come back in a day or a week or a month.

No alt text provided for this image

You can browse (thanks again Eric Lou) to individual streams and leaderboards.

Oops did your crawler go bankrupt?

By default your crawler generates a write_key of difficulty 10. To give it a longer shelf life you can supply a previously created write_key that is more rare. Note the difficulty argument:

No alt text provided for this image

and chance the difficulty to 11 or 12. I recommend running that in a separate terminal off to the side as it can take a long time. But once done, this precious key can be supplied to your crawler:

No alt text provided for this image

Alternatively you can supply a difficulty parameter to the crawler constructor.

No alt text provided for this image

Whether you use a write_key of difficulty 10, 11 or 12 you'll probably get a sense of which streams your algorithm is performing well on and whether your fancy pants vendor product is actually better than something I knocked up while binge watching Line of Duty (it probably is, but remember that anyone can submit algorithms at Microprediction.Org and there are monthly cash prizes, so I'm the least of your worries).

Remarks

Please comment below if there is something I can clarify.

Interpreting and retrieving results

You can view results on the site or retrieve using the API or microprediction library (see the get_performance method in MicroReader). One should interpret results carefully based on the mechanics of the site not preconceptions. However the following is a reasonably accurate statement:

Positive performance on a stream when a crawler makes predictions (without looking at other algorithms predictions) indicates that the crawler is doing a better job of estimating the distribution of outcomes than the existing community contributions

Note, however, that the existing contributions act in a collective fashion. It should not necessarily be inferred that a negative performance implies that the vendor product is poor, just that it isn't typically adding anything to what is already there. It is entirely possible that an algorithm could sink down the leaderboard even if it is the "best" by some other criterion. The caveat in the parenthesis is also important, though it is true unless you go out of your way to change it.

These considerations aside, if vendor product A is green and B is red, that might at least provoke a careful look.

Broadening the scope

We have derived from SimpleCrawler which only visits a curated list of streams and only predicts for the shortest time horizon. If you would like to let your crawler wander further afield, you can instead derive from MicroCrawler and program whatever navigation logic you like. See my article about modifying stream navigation for a crawler.

Boasting

If you are a vendor or you have written an open source time series library, then you hold the key, quite literally, to your performance and to leaderboards. You can extract everything you see and report it as you wish. Create your own leaderboard or reset performance. We will likely introduce optional deanonymization, and certified LinkedIn bragging at some point.

Related: ongoing performance analysis for in-production models

Many industries are required to perform ongoing performance analysis. This might consume a lengthy article in itself but for now here is one suggestion, which of course can be applied also to Automated Machine Learning libraries and products as well:

  • Publish live model residuals

This is a different use of Microprediction.Org APIs. See the publishing instructions. You'll need a difficulty 12 key or you can beg me for one.

If your in-house modelers have a good grasp of the problem to which they have been assigned then they should have some distributional view on the (signed) difference between their model predictions and realized outcomes. For instance, perhaps they think their errors are normally distributed with mean 0 and variance 2.5. But whatever their view they should be able to perform well in a contest to predict the distribution of their own model residuals.

Can they?

(If privacy of model residuals is an issue here, which is unlikely, chose an obscure stream name, take three model residuals and pipe them through a random matrix first).

Anna Jarvis

Quant at DUNE AlphaCapture

2y

Hi Peter Cotton, PhD do you have anything more recent on this subject? Have been spending some time with temporal fusion transformers and it's quite outstanding compared to OLS and random forests. Do you have a view on it?

Like
Reply
Sajid Ahmad

Sales Relationship Manager at Easy Credit financial services

3y
Like
Reply

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics