Adaptive Dynamic Portfolio Optimization via a PPO-DQN Hierarchical Reinforcement Learning Framework
Abstract
In view of the increasing dynamics and complexity of the financial market, traditional quantitative investment models are difficult to adapt to the high-frequency and changeable trading environment, while deep reinforcement learning (DRL) has gradually become a hot topic in portfolio optimization research with its adaptive decision-making advantages. This study combines the strategy stability of Nearest Neighbor Strategy Optimization (PPO) with the value evaluation ability of Deep Q Network (DQN), aiming to solve the problems of large fluctuations in strategy updates and difficult risk-return balance in dynamic asset allocation. The model combines the clipping mechanism of PPO with the experience replay of DQN to optimize long-term value prediction and limit the scope of strategy updates based on historical experience, thereby improving the robustness of investment decisions. The experiment of constructing a dynamic portfolio based on 15 Chinese A-share stocks (backtest period 2020-2025) shows that the cumulative return of the improved PPO algorithm with the introduction of the invalid action shielding mechanism is 74.8% and the annualized return is 33.7%, which is significantly higher than the original PPO (annualized only 2.3%). In terms of risk control, the maximum drawdown of the model is 5.85%, and the annualized Sharpe ratio is stable at 1.555, which is better than the traditional risk parity model (maximum drawdown of 11.86%). By adjusting the configuration of the neural network hidden layer, the cumulative return of PPO increased to 33.7% after adding a single hidden layer, which verified the effectiveness of structural optimization. Compared with traditional machine learning models (such as random forests), the framework has an annualized return increase of about 12%, and it recovers faster and is more resilient to risks during periods of extreme volatility. The data was normalized by Z-score and corrected by 3σ outliers, divided by 7:1.5:1.5 (rolling window 252 trading days); PPO module with 3-layer fully connected network (128/64/32), γ=0.95, λ=0.9, clipping range [0.8, 1.2]; DQN was used with a dual network (playback pool 106, batch size 256, initial ε=0.9), combined with 4-head attention fusion, alternating training for 500 rounds (200 episodes per round, 60 decisions per step), and using Adam optimization. Research shows that the PPO-DQN synergy framework can continuously optimize investment portfolios by dynamically weighing returns and risks, providing innovative solutions for smart financial decision-making.
Full Text:
PDFDOI: https://doi.org/10.31449/inf.v49i27.9966
This work is licensed under a Creative Commons Attribution 3.0 License.








