LLM Stock Market Showdown: Eight-Month Backtest
Description
The podcast describes an experiment called the AI Trade Arena, which was created to evaluate the predictive and analytical capabilities of large language models within the financial markets. Researchers conducted an eight-month backtest simulation from February to October 2025, providing five major LLMs—including GPT-5, Grok, and Gemini—with $100,000 in paper capital to execute daily stock trades. To ensure valid results, all external information, such as news APIs and market data, was strictly time-filtered so models could not access future outcomes. The primary finding showed that Grok and DeepSeek were the top performers, a success largely attributed to the models' tendency to create tech-heavy portfolios. The project emphasizes transparency, making the reasoning behind every trade publicly available, and plans to move from simulations to live paper and real-world trading to refine model evaluation.





