In this article, we explore the implementation of a Deep Q-Network (DQN) agent in a custom-built trading environment using Python, TensorFlow, and OpenAI Gym. The goal is to simulate an automated trading system that makes decisions based on market data, such as buying, selling, or holding an asset, while optimizing long-term profits.
The Concept Behind the Model
Reinforcement learning (RL) is well-suited for trading systems because it allows agents to learn from interaction with the environment. In our setup:
State: The environment's current condition, represented by three key variables: the asset price, cash in hand, and the number of assets held.
Action: The agent has three possible actions—Buy, Sell, or Hold.
Reward: The reward is based on the net worth, calculated as the sum of cash and the value of held assets.
By interacting with the environment over multiple episodes, the DQN agent learns optimal strategies for trading.
Setting Up the Environment
We start by designing a basic trading environment using gym.Env. This custom environment simulates an asset price following a random walk. Here's the key breakdown:
Action Space: The agent can take one of three actions: Hold, Buy, or Sell.
Observation Space: The state of the environment includes the asset price, cash in hand, and the quantity of the asset held by the agent.
Reward Function: The agent's reward is derived from its net worth, with a higher net worth indicating better performance.
class TradingEnv(gym.Env):
def __init__(self):
# Define action and observation space
self.action_space = spaces.Discrete(3) # Hold, Buy, Sell
self.observation_space = spaces.Box(low=0, high=np.inf, shape= (3,), dtype=np.float32)...
The agent starts with an initial cash balance and no assets. As the agent takes actions (buying, selling, or holding), the asset price changes based on a random walk model, and the agent's balance adjusts accordingly.
Designing the DQN Agent
A DQN agent learns by approximating the Q-value function, which estimates the future rewards of each action given a state. The architecture of our DQN agent consists of:
Experience Replay: The agent stores experiences from each episode in a memory buffer to replay and learn from previous actions.
Target Network: A separate target network that updates less frequently to stabilize learning.
Epsilon-Greedy Strategy: A balance between exploration (trying new actions) and exploitation (choosing actions based on learned policies).
class DQNAgent:
def __init__(self, state_size, action_size):
self.state_size = state_size
self.action_size = action_size
self.memory = deque(maxlen=2000)
self.epsilon = 1.0 # Initial exploration
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.gamma = 0.95 # Discount factor
self.learning_rate = 0.001
self.model = self._build_model() # Main model
self.target_model = self._build_model() # Target model
self.update_target_model()
The neural network architecture is simple: two hidden layers with 24 neurons each and ReLU activation functions. The final layer has three outputs, corresponding to the three possible actions. The agent uses mean squared error (MSE) as the loss function and the Adam optimizer for efficient training.
Training the Agent
The training loop is organized as follows:
The agent resets the environment and starts in an initial state.
It selects actions based on the current state using an epsilon-greedy policy.
After each step, the agent stores the state, action, reward, and next state in its memory buffer.
Once a sufficient number of experiences are collected, the agent trains on a random batch of experiences, updating the Q-value approximations.
After each episode, the target network is updated to reflect the new learned values.
for e in range(episodes):
state = env.reset()
state = np.reshape(state, [1, state_size])
total_reward = 0
for time in range(env.max_steps):
action = agent.act(state)
next_state, reward, done, _ = env.step(action)
agent.remember(state, action, reward, next_state, done)
state = next_state
total_reward += reward
if done:
print(f"Episode: {e+1}/{episodes}, Total Reward: {total_reward:.2f}")
break
agent.replay(batch_size) # Train the model
agent.update_target_model() # Update target network
Results and Performance
At the end of each episode, the agent's total reward is recorded, representing its profitability. The results are then visualized in a plot that shows the evolution of the agent's performance over multiple episodes.
# Plotting the results
plt.plot(results, label='Total Reward per Episode')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Profitability of the Strategy Over Time')
plt.show()
Conclusion
This example demonstrates how deep reinforcement learning techniques like DQN can be applied to financial trading. The agent gradually learns to optimize its strategy, balancing exploration with exploitation. This approach can be extended by incorporating more sophisticated features, such as real market data, advanced neural network architectures (e.g., LSTMs), or alternative reward structures.
Comments