I’ve been working on a side project where the goal is strategy-first trading, not signals or copy trading.
The idea is simple:
build rule-based strategies → run them live in simulation → compare performance before even thinking about execution.
A few things surprised me while building this:
• Many traders think they’re systematic, but can’t clearly explain why a trade triggered
• Real-time simulation is much harder than backtesting — especially around fees, slippage, and partial fills
• Showing why a trade happened is often more valuable than the PnL itself
I’m still unsure about a few things and would love perspectives from people here:
• How do you personally decide when a strategy is “ready” for real capital?
• Do you trust live paper trading more than backtests, or vice versa?
• What’s the biggest failure mode you’ve seen when people move from sim → live?
Thanks
Why do you think using a black box model is not being systematic?
Isn’t what you are doing a paper trading? Or am I missing something?
I should clarify what I mean by “systematic.”
I don’t think black-box models are inherently non-systematic. If the inputs, training regime, risk constraints, and execution logic are fixed, that’s still a system.
The distinction I’m trying to make is more about explainability and iteration at the strategy-development stage. With many black-box approaches, it’s harder for traders to answer why a trade happened or what to adjust when performance degrades.
And yes, what I’m doing is essentially live paper trading, but with an emphasis on:
Running multiple strategies concurrently
Enforcing the same fee / execution logic they’d face live
Making the trigger conditions explicit rather than implicit
I am curious how you approach validation, do you rely more on walk-forward / backtests, or do you paper trade live before allocating real capital?
Thanks for clarifying.
I prefer stress tests with simulations/data generated from existing past unrelated data. I leave backtesting only to the pre-deployment phase as the final validation set. I do not paper trade because the time frames I work on nowadays make it unfeasible, as my algo trades between 4 and 8 times per year
That makes a lot of sense, especially at that trade frequency.
At 4–8 trades per year, paper trading is almost meaningless from a time-to-signal perspective, and I agree that stress-testing on synthetic or regime-shifted data is far more informative than re-running the same historical path.
The audience I’m seeing benefit most from live simulation is traders operating at much higher frequency (intra-day to short-term swing), where execution details, fee drag, and behavioral tweaks surface quickly.
When you run stress tests on generated or unrelated data, what failure modes tend to disqualify a strategy for you? Regime sensitivity, tail risk, parameter instability… something else?
You named them, that’s about it I would say
Agreed that tends to cover most of it in practice.
Appreciate you engaging, these are the kinds of trade-offs that don’t get discussed enough outside of implementation.
Thats an approach id say that i never thought, mine is test full 2020 to today if i wanna check if something really works long term, tho i like your approach also. Another thing is also i take big crashes dates for symbols and check even there how it will behave etc. Thx tho ill keep in mind also your approach
What ill do is first backtest signals if i want to actually check signals and always use next candle open to get in trade since with past candles they are already fully closed. On paper trading i do the same for example get signal for last candle if fully closed and then jump on new candle open, example start bot > check current time and candle, if current time for example 1:04:25 wait untill new candle appears so it will wait untill 1:05:00, check last candle if had signal like lets call it long_entry.iloc[-2] if it had signal then go in if not start over untill long_entry.iloc[-2] has signal, so i do same way backtesting same way paper live same way live with real money.
That’s a clean approach, and I like the consistency you’re enforcing across all phases.
Using next candle open everywhere removes a huge class of lookahead and execution bias that a lot of people unintentionally introduce when switching between backtest, paper, and live.
The
.iloc[-2]check on a fully closed candle is exactly the kind of detail that separates “signals that look good” from signals that are actually tradable.One thing I’ve noticed building around this workflow is that most errors don’t come from signal generation anymore, but from
Inconsistent candle alignment
Mismatched fee/slippage assumptions
State handling when multiple signals cluster
Do you ever run into edge cases around session boundaries or missing candles, or has that been mostly a non-issue in your setup?
Yup whatever i do on backtesting im very well aware on what will happen in paper/live so i build exactly consistency on working 1:1 in a very well detailed approach. Never had any problems actually exept when i first started, now i already know when i wanna build something the workflow so never had problems in live also. So i never got issues since following exactly same workflow, also im talking about crypto dunno about stocks or anything else tradable since i never tried, also i always use market orders, now you gonna tell me about fees like many people do, but i implement my fees directly into take profit so my orders are instant always instead of using limits
That makes sense, once you’ve internalized the full workflow and enforced 1:1 consistency, most of those edge cases disappear.
Especially in crypto, where continuous sessions + market orders simplify a lot of the boundary issues you’d see in equities.
Baking fees directly into TP and sticking to market orders is a pragmatic choice, you trade a bit of price uncertainty for execution certainty, which makes the system behavior far more predictable.
At that point it really becomes less about debugging mechanics and more about whether the strategy logic itself survives regime changes.
Appreciate you sharing the details, it’s refreshing to see someone think through execution before worrying about indicators.
Your welcome bro, iv debugged soo much everything that i have memorized everything how it should work lol. Fk indicators thats the most easy part like i see peopel backtesting simple crossover's and they are like :O while in live they always move non stop on current candle that almost nothing works exept if you go with how i actually do it, i also never use common logic for entries, example for ema crossovers ill also implement more things and not just a crossover, for example step 1 crossover happens pass, step 2 check distance of open from ema, if is ok pass check 3 rsi, and every shit possible you can think off, if everything passes then go long short or whatever, if not reset signal wait if stopped out in check 1 2 or whatever again reset signal start over. The only way to be live and currently working lol
Exactly, that’s the gap I keep seeing as well.
Indicators themselves are rarely the problem. It’s the stateful logic around them, sequencing, resets, distance checks, confirmation windows — that decides whether something survives live conditions or falls apart on the current candle.
Once you think in terms of step-based validation instead of single triggers, a lot of “working backtests” stop looking tradable very quickly.
Appreciate you spelling it out so clearly.
Ya hopefully more people reads the right approaches since when i started i didnt had anyone tell me and i learned everything myself the hard way. Hopefully helps people out what they should do exactly instead of looking at wrong backtests and thinking they found out holy grail to just go live and then be disappointed.
Yeah, that’s really it.
Most people don’t fail because they’re bad at indicators, they fail because they trust backtests that never stood a chance of behaving the same way live.
If this thread helps even a few people avoid that “holy grail → live → disappointment” cycle, it’s already done something useful.
Holy shit all of the OPs replies are obvious ChatGPT
You can also look into platforms like strategy quant x to assess your strategies.
Yeah, StrategyQuant X is solid, especially for large-scale generation and stress testing of rule-based systems.
The gap I’m focused on is less about strategy discovery and more about the live behavior of already-defined logic, things like state handling, sequencing, fee drag, and “does this behave the same way once candles start moving.”
I see those tools as complementary: generate and stress-test ideas there, then validate whether the logic actually survives live conditions without leaking assumptions.
Appreciate you calling it out.
Good write-up. The live simulation approach is underrated - paper trading catches things backtests miss, especially around execution assumptions.
One thing I'd add: tracking your simulated fills vs what actually happened in the market is gold. Even in paper trading, log the spread at signal time, the price 1 second later, 5 seconds later. You'll quickly see how much slippage your backtest was hiding.
The point about interpretability is well taken too. In my experience, the strategies that survive long-term are ones where you can explain why they should work, not just that they worked historically.
That’s a great addition.
Logging the spread and short-horizon price evolution around the signal is exactly where a lot of “acceptable” backtest assumptions quietly break down. Even small delays or widening spreads add up fast, especially once you move beyond toy position sizes.
I’ve found that once you start looking at those micro-windows (signal +1s, +5s), it becomes very obvious which strategies were only ever viable on paper.
And completely agree on interpretability, strategies that survive tend to have a mechanism you can articulate, not just a statistical footprint. That usually makes it much easier to reason about when they should stop working as well.
Exactly right on the micro-window analysis. We call it "post-signal decay" internally - how quickly does your edge evaporate after the signal fires? For anything market-making adjacent, even 100ms can be the difference between profitable and underwater.
The "acceptable backtest assumptions" point deserves its own post honestly. I've seen strategies that looked great at 10 lots completely fall apart at 100 because the backtest assumed infinite liquidity at the bid/ask. Reality is messier.
Good luck with the project - sounds like you're building it the right way.
“Post-signal decay” is a great way to put it, that captures the problem perfectly.
Once you start measuring how quickly edge evaporates after the trigger, a lot of strategies stop being about prediction and start being about reaction speed, queue position, and realism around liquidity.
And completely agree on sizing exposing bad assumptions. Infinite liquidity at the bid/ask is one of those things that feels harmless in a backtest until you cross a threshold and the whole profile changes.
Appreciate the kind words, and the terminology. This thread has been a great reality check.
Exactly - real-time adaptation sounds good until you realize you're fitting to noise. We've seen strategies that "adapted" themselves into negative expectancy because they kept chasing phantom patterns.
The hard truth: if your edge decays fast enough to need real-time adjustment, it might not be an edge at all - just autocorrelation in noise. Stable edges tend to be structural (market microstructure, order flow) rather than statistical.
One heuristic we use: if parameter changes improve backtest but worsen forward test consistently, you're likely overfitting to regime-specific noise rather than capturing actual alpha.
I have spent 3 years landing on a rule-based platform for momentum trading ( there are about 100 trades a year). I would never have gotten to the end point back testing as the data is just not there sub minute and tic…..so I painstakingly got there with real time analysis.
I just spent a year building a Python based Algo platform that autonomously trades 4 Algo’s associated with the model.
I am currently testing the platform before going live in January.
My biggest concerns are slippage and partial fills. I am with IBKR. The platform trades Options of the SPX.
Trades top out at $10,000 VAR ( 10-50 contracts). This will evolve, if successful to $50-100K VAR spread over 10 a/c (100-500 contracts).
I am building the platform with two options for trade execution:
1) Adaptive Algo Limit Order with step out offset to Bid/Ask set on Urgent for fill 2) Rel limit Order with % off-set to bid/ask
I am not sure which will give me best results wrt slippage, adverse selection, partial fills. Ian also hoping to engage a ‘Plumber’ to help keep my order flow off the radar and to get complete fills as fast as possible.
Any feedback would also be appreciated….
That’s a solid amount of work, especially getting there without reliable sub-minute historical data.
At those sizes and instruments, you’re right to be thinking about slippage, partials, and adverse selection before going live, those tend to dominate outcomes more than signal quality once VAR scales.
In my experience (and from what I’ve seen others run into), the trade-off you’re navigating is roughly:
More aggressive/adaptive logic → better fill probability, but higher adverse selection risk More passive relative limits → cleaner fills when they happen, but higher opportunity cost and partials
A lot of teams I’ve spoken to end up learning more from instrument-specific live simulation and detailed fill logging than from theoretical optimization alone, especially around how often urgency actually helps versus just crossing spread at the wrong moments.
IBKR’s adaptive algos are convenient, but they can also hide micro-decisions that make it harder to reason about why fills behaved a certain way, which becomes important when you start scaling.
Sounds like you’re asking the right questions at the right time, curious how you’re planning to evaluate the two approaches side-by-side before January.
Thanks for your feedback.
My Beta system testing will only annotate the entry and exits on the screen.
I plan to test between the two order approaches during a 12 month pilot by tracking trade execution details and timing ( order sent,order fill, fill completion, etc). I will be manually switching between order types to , I hope, pretty clearly identify the clear winner. The pilot is capped at $10,000 orders ( 30-50 contracts).
I am worried most about what will be the best methodology.
That sounds like a sensible plan overall, and the fact that you’re capping size and explicitly logging execution timestamps is already a big step in the right direction.
The main methodological risk I’d watch for is manual switching across time, if the order types aren’t exposed to broadly comparable market conditions, it becomes hard to separate execution quality from regime effects.
One approach I’ve seen work better than pure sequential testing is: • predefine the evaluation metrics (fill rate, slippage vs mid, adverse selection post-fill, completion time) • keep those fixed for the entire pilot • and, where possible, alternate or randomize order type selection so both approaches experience similar conditions over time
Even then, the answer is often probabilistic rather than definitive, the goal is usually to understand when each approach degrades, not just which one “wins” on average.
Sounds like you’re asking the right questions before committing real size, which is usually the hardest part.
Good finds for sure, I do think deciding when a strat is “ready” should be objective tho, like driven by metrics that don’t lie to you. Same goes for going from paper to live. You can have a Sharpe >3 in sim but if your backtesting engine isn’t even remotely realistic about fees, slippage, latency, partial fills etc you’re just setting yourself up to become liquidity. Paper trading is better because it’s at least real market data and timing but even paper is optimistic on fills (and impact but only if youre a whale), so edge often disappears again once you actually go live. So yeah backtest, paper and live all lie in different ways, and the trick should be getting consistency across the three without overfitting any single layer.
Biggest failure mode moving from paper to live though is honestly the f*cking human. Stuff like “hmm this is trading too little lets loosen something” or “lets add more assets so it finds more opportunities” or the pinnacle of em all: tweaking risk after a drawdown. Once you do that you basically mess up the experiment itself. You’re no longer observing the system you tested, you’re creating mixed data that’s hard to interpret afterwards because you intervened halfway. Most sim to live failures arent because the idea is terrible, but because the operator panicked and couldnt leave it alone long enough to see it behave as intended, i honestly thing algotrading is a way to get human emotions out of the loop, bc they fuck shit up... ask my manual portfolio......
The biggest failure mode is that they were not rigorous in their building/testing. They had leaky data. They used data that primarily represented a single market condition. They didn't use enough data. They override the model when it goes live. They can't handle drawdowns emotionally because they are overallocated. All the stuff that happens live that they don't have rules for because they have never seen it before since their experience is six months total.
I did all of these at the beginning (and the not-so-beginning).
That’s a really good summary.
A lot of “sim → live” failures aren’t caused by a single flaw, but by a stack of small shortcuts compounding, leaky data, narrow regimes, unrealistic sizing, and then human overrides once drawdowns show up.
The part about not having rules for situations you’ve never seen really resonates. Live markets surface edge cases you simply don’t encounter in a few months of testing, and without predefined responses, discretion sneaks back in.
Appreciate you calling out both the technical and behavioral sides, most people only learn that the hard way.
When its profitable after 1 month of testnet / paper trading.
Yes. The DEX I use (Everstrike) has a very realistic testnet with same fees/liquidity as mainnet.
Trusting sim too much. For example Binance testnet doesnt mimic Binance mainnet at all.
That’s fair, especially if the testnet genuinely mirrors mainnet conditions, which unfortunately isn’t true for a lot of venues.
I think the key nuance is what that month actually contains. A profitable month that only spans one volatility regime or liquidity profile can still be very fragile, whereas a shorter but more diverse sample sometimes tells you more.
I’ve also found that the biggest risk with trusting sims isn’t the PnL itself, but the false confidence they create when subtle things (queue position, partials, latency, changing spreads) aren’t modeled the same way live.
So I tend to treat paper/testnet as a behavioral and execution sanity check, not a green light by itself, especially when scaling beyond small size.
When you talk about 'custom strategy', you're falling into a typical overfitting problem. You might backtest the strategy, remove unwanted indicators, blacklist less performant patterns or whatever, but when you go live, it will bite you. In simple terms, be careful when developing rule based strategies.
That’s a valid concern, unconstrained customization is one of the fastest ways to overfit.
When I say “custom strategy,” I’m not thinking in terms of endlessly pruning indicators or blacklisting patterns after the fact. The intent is closer to explicit, constrained rule sets that are defined before testing, then validated across regimes without being tweaked midstream.
In practice, most failures I’ve seen come from treating flexibility as a feature instead of a liability, once you start optimizing logic to recent outcomes, you’re already in trouble.
Appreciate the caution, it’s an important one to keep front and center.
Backtests are useful, but then need substantiated with live demo trading and then this needs substantiated with small equity live trading over a decent period of time. Then scale equity and position size as your model proves real world practical returns.
Biggest failures are not accounting for the psychology of real money e.g losses flashing across the screen, choosing a broker with terrible spreads and miscalculating leverage into risk profile
That's a cool project! I totally agree about the systematic thinking thing. It's way harder to be truly rule-based than most people (including myself sometimes) want to admit. The real-time simulation point is huge too. Backtesting is nice, but doesn't always translate.
For me, a strategy is 'ready' when the live sim PnL is consistently beating buy-and-hold *after* fees and slippage, over a meaningful timeframe (like a few months at least). Even then, I start with tiny positions. As for paper vs. backtesting, I trust live paper trading more, provided your simulation is realistic. Garbage in, garbage out, right? The biggest failure I've seen is people not accounting for black swan events or unexpected volatility. Strategies that appear effective in normal times can be quickly compromised when things become chaotic.
You need faith in your algo. When I first started live testing I used to mess around with my trades, and that reduced profits.