Economical Statistics: How to Modify a Prediction Crawler's Navigation

Economical Statistics: How to Modify a Prediction Crawler's Navigation

Can affordable tailored statistical work be delivered far and wide, by economically aware statistical algorithms? Can they spin a real-time feature space that delivers plummeting margin cost for intelligent applications of all kinds? The experiment is underway.

For most organizations bespoke statistical work isn't close to being affordable. The problem isn't a lack of powerful free open source statistical algorithms. The problem is that we rely on humans to bring algorithms to real-world real-time needs. But humans ... well we are so needy, easily fatigued and expensive.

Crawlers try to address this problem. They find problems themselves. They are economic agents. If you are not familiar there is a site Microprediction.Org where prediction algorithms crawl all over data that you publish there. Some live sources of data attract prize-money - you'll find $4,000 up for grabs this month, for example.

Crawlers can be created easily by deriving from MicroCrawler, a Python class in the microprediction package on PyPI and GitHub (source). This post is for data scientists looking to understand and improve the default navigation of their crawling algorithms, as distinct from their raw predictive ability.

Outline:

  1. Using MicroCrawler with no changes at all
  2. Deriving from MicroCrawler and overriding sample()
  3. How your crawler navigates.
  4. Changing behavior by passing arguments to your crawler's constructor
  5. Overriding other methods to modify how the crawler chooses streams and horizons, exploits free time, and makes rudimentary economic decisions.

Some readers may prefer to first read a higher level explanation in this article.

1. Using MicroCrawler with no changes.

Here is your Hello World example of instantiating a default MicroCrawler and running it:

No alt text provided for this image

Running the default MicroCrawler might constitute a contribution. Though perhaps you should increase the difficulty of that write_key to 11 or 12 if you really want to find out. As is your crawler it is likely to be declared bankrupt quite soon, and play no further role.

2. Deriving from MicroCrawler and overriding sample()

More likely you'll want to derive from MicroCrawler in order to modify the prediction algorithm that it arrives with. Since we are here I briefly give an example of modifying the all important sample method, although the focus of this post is more the way the tank drives around and not the gun you choose to mount on the top of it.

Let's say we are convinced that taking the average of the last two values of any time series is a perfect prediction of the next. We need only derive from MicroCrawler and then set about overriding the sample() method

No alt text provided for this image

The sample() method should expect a list of lagged values of the time series (most recent first). Here it will return the average of the last two values repeated 225 times over. The crawler code in its entirety is:

No alt text provided for this image

I would call that an overconfident crawler. For a better example I'd refer you to a previous post. In that example a crawler that uses an Echo State Network to make prediction. An example crawler is found in the package echochamber on PyPI. There are also some minimalist crawler examples in the microprediction repo.

3. How your crawler navigates

Now we come to our topic. Yes you are good to go and issue a crawler.run() ... but read on if you want:

  1. To understand your crawler's primitive economic behavior
  2. To modify it by overriding methods
  3. To decide whether you want to use MicroCrawler at all, as compared with using the MicroWriter class directly to create your own crawler with different logic.

I won't say much here about the third possibility but it seems noble. Note the hierarchy of Python classes on the microprediction package README

No alt text provided for this image

For now let's content ourselves with the confines of MicroCrawler, and the promise that a tiny amount of economic sense is a lot better than none. In a loop the crawler:

  1. Checks on its performance across all active horizons.
  2. Sends requests to withdraw from some if stop loss is exceeded.
  3. Maybe looks for more things to predict, if there is time
  4. Submits predictions to some horizons, if the time is right
  5. Having submitted predictions, predicts when data will next arrive and determines how to update a manifest of "what and when".

What's this talk of horizons you say? Formerly a horizon is a two-tuple specified by a time interval and choice of data stream. For instance:

70::mystream.json          

might reference a 1 minute second ahead forecast for mystream.json. I refer you to this article explaining quarantined predictions for a precise discussion, but loosely speaking the crawler chooses streams and how far ahead to predict, where the choices are 1min, 5min, 15min and 1hour ahead (the best reference for conventions used at Microprediction.Org is the microconventions code).

As you can see, pretty rudimental economic logic.

The state maintained by MicroCrawler

The crawler keeps various types of state. Your best source is the code but at time of writing it caches the performance record and active horizon list, just to be considerate. The definition of active is any horizon the crawler has sent predictions for (a small gotcha: withdrawing from a stream is not immediate as that would be unfair).

No alt text provided for this image

Perhaps the most important state in flight is this:

No alt text provided for this image

As the code comment, this can be thought of as a list of reminder notices the crawler makes to itself, informed by estimates of when data for different streams will next arrive.

4. Ways to control the crawler with arguments to the constructor

I refer you to the code but the most important arguments are:

No alt text provided for this image

which I hope is self-explanatory. Rewards for predictions are incremental. The performance on each horizon is tracked. If the stop loss is 10 and the performance falls below -10.0, time to move on. However note that you can reset performance any time you want.

Another important parameter is

No alt text provided for this image


which will prevent the crawler from trying to predict streams for which there is insufficient lagged data received by the system. Supply min_lags=500 to only look at data streams that have been around for a little while - though be aware that there may be rewards for rushing in like a fool where angels fear to tread.

You might even design algorithms that are likely to be competitive for a short while only (e.g. set max_lags=50 say). Your crawler might try to pick up some pennies and then flee before the Machine Learning steamroller arrives.

5. Ways to control the crawler by overriding methods

(Pythonistas will know that you can monkey patch instead of overriding if you really want).

Choosing streams by sponsor

To instruct your crawler to include or exclude streams created by a given sponsor, you can override the include_sponsor and/or the exclude_sponsor methods. For example

No alt text provided for this image

should return boolean when given a sponsor's short name (an animal description like "Cellose Bobcat").

Choosing streams by name

Similarly, you can provide inclusion/exclusion methods for stream names

No alt text provided for this image

or alternatively you can override the method candidate_streams that calls both. Just be careful to ensure a value list of streams is the result. For example:

No alt text provided for this image

Choosing how far ahead to predict

As with stream names or sponsors, you can use inclusion or exclusion as follows:

No alt text provided for this image

where the delay parameter expects an integer from the set self.DELAYS=[70,310,910,3555]

Choosing how often to update predictions

A very important point to be aware of is that predictions of your algorithm, once submitted, will continue to be assessed against incoming data until you issue a request to cancel them. This means you do not need to send predictions all the time (say after each arriving data point) unless you deem it necessary. In fact, if you are predicting a die and you don't think the distribution will every change, maybe you believe you need only submit once.

There are numerous z-streams (explained in this article) and also some near-martingales (such as changes in stock prices) where submission of an updated distributional estimate every data point is probably overkill. By default your crawler's update frequency is once every data point for primary (regular) streams and once every 20 data points (roughly, on average) for z-streams. Feel free to override.

No alt text provided for this image

Not that z-streams contain tildes in their names.

Deciding what to do with spare time

If your crawler has a good way to take advantage of downtime, such as offline estimation or burning more MUIDs, consider overriding the downtime method. Be sure to include the seconds parameter even if you don't use it, and for forward compatibility include **ignored.

No alt text provided for this image

By default your crawler sleeps.

Other user callbacks

You can also override a number of callback methods that do nothing by default, in order to perform additional tasks on startup, when giving up on a choice of horizon due to bad performance, when submitting predictions and when retiring (your crawler may eventually lay down and die). For example

No alt text provided for this image

could be overridden to send yourself an email alerting you to an unexpected restart.

Code for MicroCrawler

Of course your best reference is the MicroCrawler class code. See link in the comments. I'm sorry that there is only a Python crawler for now - though I warmly welcome development of crawlers in other languages such as R and Julia. They should not be too hard to make using api.Microprediction.org

Good luck. Enjoy.

And remember, you are not in Kaggle anymore !


About me

Hi I'm the author of Microprediction: Building an Open AI Network published by MIT Press. I create open-source Python packages such as timemachines, precise and humpday for benchmarking, and I maintain a live prediction exchange at www.microprediction.org which you can participate in (see docs). I also develop portfolio techniques for Intech Investments unifying hierarchical and optimization perspectives.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics