Build a Trading Agent w/Reinforcement Learning

Manogane Sydwell

Oct 20, 20243 min read

Updated: 2 hours ago

In this article, we explore the implementation of a Deep Q-Network (DQN) agent in a custom-built trading environment using Python, TensorFlow, and OpenAI Gym. The goal is to simulate an automated trading system that makes decisions based on market data, such as buying, selling, or holding an asset, while optimizing long-term profits.

The Concept Behind the Model

Reinforcement learning (RL) is well-suited for trading systems because it allows agents to learn from interaction with the environment. In our setup:

State: The environment's current condition, represented by three key variables: the asset price, cash in hand, and the number of assets held.
Action: The agent has three possible actions—Buy, Sell, or Hold.
Reward: The reward is based on the net worth, calculated as the sum of cash and the value of held assets.

By interacting with the environment over multiple episodes, the DQN agent learns optimal strategies for trading.

Setting Up the Environment

We start by designing a basic trading environment using gym.Env. This custom environment simulates an asset price following a random walk. Here's the key breakdown:

Action Space: The agent can take one of three actions: Hold, Buy, or Sell.
Observation Space: The state of the environment includes the asset price, cash in hand, and the quantity of the asset held by the agent.
Reward Function: The agent's reward is derived from its net worth, with a higher net worth indicating better performance.

class TradingEnv(gym.Env):
    def __init__(self):

        # Define action and observation space

        self.action_space = spaces.Discrete(3)  # Hold, Buy, Sell

        self.observation_space = spaces.Box(low=0, high=np.inf, shape=					          	      (3,), dtype=np.float32)...

The agent starts with an initial cash balance and no assets. As the agent takes actions (buying, selling, or holding), the asset price changes based on a random walk model, and the agent's balance adjusts accordingly.

Designing the DQN Agent

A DQN agent learns by approximating the Q-value function, which estimates the future rewards of each action given a state. The architecture of our DQN agent consists of:

Experience Replay: The agent stores experiences from each episode in a memory buffer to replay and learn from previous actions.
Target Network: A separate target network that updates less frequently to stabilize learning.
Epsilon-Greedy Strategy: A balance between exploration (trying new actions) and exploitation (choosing actions based on learned policies).

class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.epsilon = 1.0  # Initial exploration
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.gamma = 0.95  # Discount factor
        self.learning_rate = 0.001
        self.model = self._build_model()  # Main model
        self.target_model = self._build_model()  # Target model
        self.update_target_model()

The neural network architecture is simple: two hidden layers with 24 neurons each and ReLU activation functions. The final layer has three outputs, corresponding to the three possible actions. The agent uses mean squared error (MSE) as the loss function and the Adam optimizer for efficient training.

Training the Agent

The training loop is organized as follows:

The agent resets the environment and starts in an initial state.
It selects actions based on the current state using an epsilon-greedy policy.
After each step, the agent stores the state, action, reward, and next state in its memory buffer.
Once a sufficient number of experiences are collected, the agent trains on a random batch of experiences, updating the Q-value approximations.
After each episode, the target network is updated to reflect the new learned values.

for e in range(episodes):
	state = env.reset()

    state = np.reshape(state, [1, state_size])

    total_reward = 0



    for time in range(env.max_steps):

        action = agent.act(state)

        next_state, reward, done, _ = env.step(action)

        agent.remember(state, action, reward, next_state, done)

        state = next_state

        total_reward += reward



        if done:

            print(f"Episode: {e+1}/{episodes}, Total Reward: {total_reward:.2f}")

            break



    agent.replay(batch_size)  # Train the model

    agent.update_target_model()  # Update target network

Results and Performance

At the end of each episode, the agent's total reward is recorded, representing its profitability. The results are then visualized in a plot that shows the evolution of the agent's performance over multiple episodes.

# Plotting the results
plt.plot(results, label='Total Reward per Episode')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Profitability of the Strategy Over Time')
plt.show()

Conclusion

This example demonstrates how deep reinforcement learning techniques like DQN can be applied to financial trading. The agent gradually learns to optimize its strategy, balancing exploration with exploitation. This approach can be extended by incorporating more sophisticated features, such as real market data, advanced neural network architectures (e.g., LSTMs), or alternative reward structures.