During the last few years, the volume of trading activities carried out electronically has increased significantly. According to a recent study, in 2019 around 92% of trading in the Forex market was performed by algorithms, and it is expected to grow at a CAGR (Compound Annual Growth Rate) of 11.23% from 2021 to 2026.

The purpose of this study is to develop an algorithmic trading system based on deep reinforcement learning and test it against some already well-known trading strategies.

Reinforcement learning is defined as “the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment”. At the core of a reinforcement learning model lies the interaction of two components:

- Agent: the one who develops the optimal policy, i.e. the set of optimal actions to take in a particular situation in order to maximize an user-defined reward function;

- Environment: the “world” of the agent, where the learning process takes place. When the agent takes an action, the environment returns a new state and the agent moves into it: this process continues all the way until the end of the so-called episodes.

When we talk about policy, we refer to “the learning

agent’s way of behaving at a given time”. The optimal policy is the one that maximizes the reward function, i.e. an user-defined measure that represents, in a certain sense, the goal of the reinforcement learning problem. At each step of the learning process, the agent “delivers” a reward signal represented by the fraction of the total reward gained/lost in that step; at the end of all the episodes, all these reward signals are summed up and the result will be the total reward. When dealing with discrete states (like in a game of chess with a finite set of moves) the agent learns (develops the optimal policy) by using a tabular procedure called Q-Learning basically consists in using iterative updates based on the Bellman equation and store the results in the so-called Q-Table. On the contrary, continuous states (like when dealing with financial time series where the next price can be any positive number) imply an infinite number of values which makes it impossible to tabulate them. Therefore, we adopt a solution called Deep Q Learning, which, in plain words, consists in estimating the value of the available actions for a given state using a deep neural network.

The trading strategies I took as benchmark to check how the system performed are: Buy & Hold: it consists simply in buying at beginning of the period and selling at the end; Moving Average Crossover: for this strategy, we will calculate two moving averages: the short-term MA (shorter window) and the long-term MA (larger window). The strategy consists in buying when the short-term MA crosses upwards the long-term MA, and selling on the opposite situation (the short-term MA crosses downwards the long-term MA).

The following setup was used to run the experiments:

- Programming language: Python;

- The Deep Reinforcement Learning agent was trained for 100 episodes using the hourly prices of three cryptocurrencies (BTC, ETH, XRP) in the time interval from 30/10/2019 to 30/10/2020;

- For all the three strategies the initial capital was €1.000.000, divided by 3 (number of assets), ending up with €333.333 to for each cryptocurrency. This was done to prevent a situation in which there will be no money left to execute the strategy because we already invested everything;

- We assumed zero fees;

- In order to analyse how the different models behave in different situations, they were tested in four different scenarios:

- Good: a two-months period where prices mainly go up, from 30/11/2020 to 31/01/2021; - Bad: a two-months period where prices mainly go down, from 30/11/2021 to 31/01/2022;

- Stationary: a two-months period where prices go both up and down, from 31/05/2021 to 31/07/2021;

- General: a one-year period where prices go both up and down, similar to the stationary one but larger (and therefore more representative of the reality), from 30/10/2020 to 30/10/2021. The results obtained in this scenario are not comparable with the others due to the different timeframes, however I decided to show them the same in order to show the behaviour of the strategy in the long run.

The following results were obtained:

The obtained results show that the results are all more or less in line, but for simplicity, fees were not included in the simulation, the buy and hold strategy is not much informative as it basically just follows the market without being able to make a profit or reduce the losses during bad periods, and our model seems to be the most risk averse, as during good times it achieves a lower profit, but during bad times it is also able to contain the losses. Furthermore, it is the only model able to achieve a profit in the stationary scenario! As I mentioned, for simplicity fees are not involved in the simulation, but in reality they exist and are in the order of 0,5% per operation. Normally it would be an insignificant amount, but if we look at the graph on the left we can see the high number of operations performed when employing the moving average crossover strategy with respect to the others, meaning that fees could erode the profit or further worsen the losses in a significant way. On the right, instead, you have a visual representation of the returns achieved by each strategy in each scenario.

When working with AI, there is an infinite number of parameters involved, and changing one of them could result in completely different results. This is a double edge weapon because if from one side it shows the excessive variability of these models, on the other hand there could be a high chance that if we keep trying and trying to change the parameters, at a certain point we could come up with a setup able to outperform all the other models and agents. Having said that, apart from the parameter tuning, these are some of the most significant improvements that may help improving the model: 0 200 400 600 800 1000 1200 1400 1600 B&H MA Crossover Deep RL B&H MA Crossover Deep RL B&H MA Crossover Deep RL B&H MA Crossover Deep RL General Good Stationary Bad Number of operations Scenarios and methods Number of operations per scenario and method -100% 0% 100% 200% 300% 400% 500% 600% B&H MA Crossover Deep RL B&H MA Crossover Deep RL B&H MA Crossover Deep RL B&H MA Crossover Deep RL General Good Stationary Bad % Return Scenarios and methods % Return per scenario and method

Building a machine learning model that predicts the price of an asset for the next time step and then adding that prediction into the state, so that the agent can take it into account when acting;

Substitute prices with returns, as these are generally stationary and neural networks work better with these kinds of data in some situations;

Integrating a natural language processing solution that analyses the sentiment about a certain asset by performing a so-called sentiment analysis on social networks (Twitter is the most used for these kinds of activities). Then, adding the outcome of this analysis (generally a number between 0 and 1) into the state so that the agent can also take into account how investors feel about investing in a certain asset at a specific point in time. However, this may pose some problems, such as the difficulty in gathering data (tweets in this case) from 3 years ago for training the model.

## Comments