System Parameter Permutation FAQ

  • The SPP process you've described requires the system developer to choose the scan (i.e., parameter) ranges to evaluate. Why did you leave this choice up to the system developer? Doesn't the choice of the parameters to use introduce bias and invalidate the results? 

We understand leaving the choice of parameters up to the system developer is something some people are not comfortable with. Understandably, it would be preferable to use a methodology where all aspects of the process are completely decision-free. As a result, this is probably the most commonly asked question about SPP. 

The short answer is “no,” leaving the choice of parameters to test up to the developer does not invalidate the results. It does, however, force the developer to exercise judgment BEFORE optimizing a system over a huge range of parameters. If you cannot determine a reasonable range of parameters to test before optimizing, then the median result from the full optimization is the most logical “point estimate” (see additional information on point estimates in an answer below), which is why SPP uses it. 

Note, however, that the paper does not advocate relying on any one point estimate, including the median, nor does it advocate not using other methods to reduce Data Mining Bias (DMB). Rather, the paper proposes looking at the entire range of SPP results and making probabilistic decisions based on them.

Unfortunately, there is no substitute for a system developer exercising judgment about how the system is expected to work, why, and under what conditions. We believe these decisions should be made before letting the computer churn through many, many combinations to see, in hindsight, the best range of parameters to use. Selecting based on performance using hindsight is the very definition of Data Mining Bias.

The paper refers the reader to other sources that provide some guidance on how to determine appropriate scan ranges: Kaufman (2013) and Pardo (2008).

  • Ok, but isn't there a way to automate the size of the scan ranges so the choice isn't up to the system developer?

Heuristics (“rules of thumb”) may be defined based on experience. However, a given set of heuristics would probably not be broadly applicable to all types of trading systems. It would probably not be reasonable to apply the same size scan ranges for different timeframes or indicator types. If +/- 25% is reasonable to apply to say a moving average value, does this make sense to apply to some sort of normalized indicator such as Wilder’s RSI? 

For example, if a system is looking for a long term uptrend by filtering on a close price above the 200 bar simple moving average, the system may still work by varying the length of the moving average +/- 25% (between 150 bars and 250 bars). Another system may use RSI(14) > 50 as a trend indicator. Is it valid to vary the RSI length +/- 25% (11 to 17)? And then what about the threshold value of 50? Should that also be varied +/- 25%? And what constitutes 25% variation for the RSI threshold value? Should it be varied by 25% of 50 (between 37.5 and 62.5)? Or should the application be different because the RSI threshold value is normalized (already represents a percentage)? Imagine applying 25% variation to a value of 90. The upper end of the range would be 112.5 in this case which is not possible. Applying this heuristic blindly leads to unexpected behavior. 

Again, our opinion is that optimization (and thus SPP) scan ranges should be well thought-out before the optimization process starts.

  • In SPP, the scan range determines the median, which is a key aspect of the evaluation results. If the scan range is too wide, the results you get will be based on data that is essentially irrelevant because the trader would never actually pick that combination of parameters to trade. If the range is too narrow, then the entire exercise is pointless. It seems like leaving the choice of system parameters up to the system developer makes SPP less useful.

At its core, this question is about optimizing the range of parameters to use with SPP in order to get a more favorable outcome. If you would never trade a certain set of parameters, then exclude them from your initial optimization and they will be excluded from SPP. 

The issue is most people scan a huge range of parameters because they don’t know what range to use until they look at all possible values, which is what leads to Data Mining Bias in the first place. In other words, it is very dangerous to say you want to be able to scan a huge range of potential parameters, and then after you get the answer of which parameter range seems to work well, you decide you would only trade a portion of the scan range because that is what gave you the most stable set of results. You only know the stable set of parameters in hindsight (i.e., after the initial, broad optimization). Again, selecting based on performance using hindsight is the very definition of Data Mining Bias. 

Aronson (2007) shows that the more combinations of parameter values you test, the higher the amount of DMB you introduce. This is a critical point. If you want to reduce DMB, you either should not test a huge range of parameters or you should assume that the distribution of results across the entire spectrum of optimization parameters, which is what SPP uses, is the most reasonable range to evaluate.

  • How do you choose the optimal SPP range?

The paper refers the reader to other sources that provide some guidance on how to determine appropriate scan ranges: Kaufman (2013) and Pardo (2008).

  • What are the "parameter scan ranges?" Is this the beginning and ending of the data? or is this  using 5day, 20day, 60day for a Moving Average?

Parameter scan ranges are simply the minimum and maximum values you select for each parameter. For example, if you want to test a variety of simple moving averages with lengths that vary between 50 and 250 bars, your scan range begins at 50 and ends at 250.

  • The paper says that the scan range is divided into observation points. What is an observation point, and why do the observation points matter?

An observation point is a specific value within the parameter scan range you plan to evaluate. For example, going back to the example in answer to the question above, if your scan range begins at 50 and ends at 250, you could choose to evaluate the following specific moving average values: 50, 75, 100, 150, 200, and 250. The six values are observation points.

Observation points are important because they are the specific values which will result in actual tested combinations for optimization and SPP. The spacing of observation points should also be well thought-out. If the spacing of the observation points is too wide, lots of variation may happen in between to which you’ll be blind (values never tested). If the spacing of the observation points is too narrow, redundancy of information is possible. Kaufman (2013) provides guidelines on choosing observation point spacing.

  • Do observation points have anything to do with the time frame selected for back-testing?

No. The SPP method uses the same time frame for all back-tests, regardless of the system parameters and observation points.

  • Can I use any trading system software, or similar back-testing processes to execute the SPP process?  

Any software that allows you to run an exhaustive optimization across a range of parameter values, and allows you to save the results from those optimizations should enable you to perform the SPP process. The vast majority of off-the-shelf software provides this capability. In fact, the ability to perform the process with off-the-shelf software was one of the driving factors behind why I published this particular validation process (we use many other proprietary validation processes that require customized software and code).

  • You talk about “system variants” in the paper. What is a system variant? Is it a variation of the key concept underlying the system? Or a variation of the parameter scan range? 

A system variant is a unique combination of system parameters. You end up with one system variant for each optimization permutation you perform. For example, let’s say you have a system that uses a long-term simple moving average that varies in length between 50 and 250 (i.e., the scan range for the moving average); this system also uses a short term simple moving average that varies in length between 5 and 25.

If you choose to evaluate six specific values within the long-term moving average scan range (e.g., 50, 75, 100, 150, 200, 250), and six values within the short-term moving average scan range (e.g. 5, 7, 9, 12, 16, 20), you will have 36 system variants – one for each combination of parameter values. 

  • In the paper, you used SPP to evaluate [ABC] system metrics. I typically use metrics like the Sharpe Ratio, Sortino Ratio, and Ulcer Index to evaluate my trading systems. Can I use SPP to get a range of estimates of how these metrics might perform in the future?

Absolutely. Any metric your back-test software can produce as part of the exhaustive optimization process can be used with SPP.

  • I usually forward test my systems. By that, I mean I will design a system, and then either paper trade it or trade it will very small position sizes. This allows me to get “real” results rather than relying on theoretical estimates of future returns, which is what you get via SPP. Why would I use a theoretical process like SPP when I can just forward test?

I completely agree that forward testing is an invaluable part of a larger system validation process. We forward test all systems before we trade them, and we recommend all of our customers forward test their systems. 

That said, a forward test is one of many validation techniques we use. Every validation technique has its strengths and weaknesses. For example, forward testing gives you results based on out of sample data – in other words, a sample that doesn’t suffer from the Data Mining Bias (DMB), which is great. 

But forward testing also has several weaknesses. First, forward testing only gives you a single data point as an estimate of future performance. For example, if you take 50 trades during your forward test, and your max drawdown during the 50 trades is 10%, you only have a single estimate (10%) of future results. Is that 10% result representative of what you can expect in the future? Hard to tell. You could have just had a certain type of market for those 50 trades that is unlikely to repeat anytime soon.

Also a related consideration, the nature of forward testing is that to get the most realistic, statistically significant set of results requires lots and lots of trades. You could easily forward test for several years to get a statistically valid sample of how your system will perform across different types of markets. While SPP has its own flaws, one of its strengths is it doesn't suffer from the same weaknesses as forward testing, making them complementary system validation methods.

  • Can you explain in more detail what you mean by the term “point estimate”, and why it isn't a good way to estimate future system performance. 

A point estimate is a single number. As an example, let’s say you back-test Trading System A, and the results show that System A produced a 10% Compounded Annual Return (CAR).  10% CAR is a point estimate – a single number. In contrast, a best practice in forecasting anything, including future trading system performance, is to use a range of possible outcomes based on meaningful probabilities. 

This is one of the strengths of SPP – it gives you a range of possible performance. For System A, where the point estimate told you the system might provide a 10% CAR going forward, SPP would give you a range of possible CAR values depending on the optimization data you feed into the process. For example, SPP could tell you that System A is expected to return a CAR between 12% and 4% with a 95% confidence. While having a range of potential performance is often less satisfying than having a single number, a range is typically more useful and accurate.

  • Using Out of Sample (OOS) testing to validate systems seems like a well-accepted practice. Why not just use OOS testing instead of SPP?

Before getting into the specifics, as we mention above, we fully support using additional methods to validate your system beyond SPP.

OOS is a type of cross-validation testing. Cross-validation testing relies on “virgin” data, meaning the data can be used only during the OOS test and even then, used only once. Even for system developers who try to do ‘pure’ OOS testing, it is very difficult in practice not to modify your rules based on the outcome of OOS testing results. And if you do make changes based on the OOS data, the data is now considered ‘in-sample’, which means your results are just as likely to suffer from DMB.

OOS testing also requires you to split your data set and not use part of it to build your system. Let’s say you have 10 years of data and you decide to make 7 years in-sample, and 3 years out of sample. You have just reduced your data set by 30%.

Also, cross-validation does not solve the “point estimate” problem. Instead, it creates an unbiased point estimate on a limited amount of data. Michael Harris wrote a blog article showing that if you search long enough you will find a worthless system idea that easily “passes” cross-validation. Cross-validation produces two point estimates (an in-sample and an out-of-sample) which may be compared; this is still very limiting. Instead, SPP generates a complete sampling distribution of results (thousands of point estimates). You can use the distribution to calculate confidence intervals. Cross-validation does not enable that.

Closely related to above, what happens when market conditions are completely different IS vs. OOS? You could end up throwing away a perfectly good system because the in-sample and out-of-sample just happen to vary significantly.

  • Can SPP be used to evaluate a single indicator? I’ve tried it, but the results don’t really make sense to me.

SPP is not designed to evaluate indicators outside of a more comprehensive trading system. In brief, just evaluating a single indicator does not enable enough random interactions. For example, the system used in the paper had four parameters and resulted in over 4,000 different combinations. It is unlikely you can get enough meaningful samples just using a single indicator.  

SPP is a method intended for application to a complete system in a portfolio context, meaning inclusion of filters, setups, entries, exits, signal ranking rules, money management rules and also including commissions and slippage. 

One of the finer points I make in the paper is that variation in the parameter values in the optimization ranges leads to variations in the way ALL the system components interact. This variation leads to randomness in trading results. Some entries are earlier, some later, some trades are shorter, some longer, etc. You can do the same thought exercise for all system components. This randomness allows the creation of the SPP distribution from which you calculate probabilities.

SPP is really not appropriate to evaluate an indicator because you are effectively just varying one parameter. This does not allow you to see the random interactions of all the system components I mentioned above. The system component interactions are the key drivers of the randomness that leads to the distribution of results. Testing an indicator by itself does not allow these interactions.