Chap3 finished (Paused, not totally finished)

master
Jason Zhu 2020-07-23 17:27:46 +10:00
parent e7112671b6
commit 6782f3c79b
8 changed files with 319 additions and 2 deletions

3
.gitignore vendored
View File

@ -1,2 +1,3 @@
.vscode
**.pdf
**.pdf
**__pycache__**

View File

@ -1,3 +1,7 @@
{
"python.pythonPath": "/home/jason/miniconda3/envs/quants/bin/python"
"python.pythonPath": "/home/jason/miniconda3/envs/quants/bin/python",
"cSpell.words": [
"GOOG",
"stdev"
]
}

View File

@ -278,3 +278,22 @@ MOM的含义
时间序列分析因为较为advanced,因此这里暂时略过
## Conclusion
* 介绍了产生交易信号的基本概念,如 **阻力线、支撑线**
* 介绍了一些简单的技术指标
* 介绍了一些更加高级的数学方法
* Autoregressive (AR)
* Moving Average (MA)
* Differentiation (D)
* AutoCorrelation Function (ACF)
* Partial Autocorrelation Function (PACF)
## Further Reading
需要了解更多的概念:
* stochastic processing
* random walks
* martingales
* time series analysis

View File

@ -0,0 +1,199 @@
# Chapter 3. Predicting the Markets with Basic Machine Learning
## 1. 了解术语和符号
* **Supervised Learning Problem (有监督学习) vs Unsupervised Learning Problem**:
> 有监督学习是从标签化训练数据集中推断出函数的机器学习任务
>
> 无监督学习是一种机器学习的训练方式,它本质上是一个统计手段,在没有标签的数据里可以发现潜在的一些结构的一种训练方式。
> 它主要具备3个特点
> * 无监督学习没有明确的目的
> * 无监督学习不需要给数据打标签
> * 无监督学习无法量化效果
ML在算法交易的领域经常用的是有监督学习
* **Regression Problem vs Classification Problem**:
> 监督学习中,如果预测的变量是离散的,我们称其为**分类**(如决策树,支持向量机等),如果预测的变量是连续的,我们称其为**回归**。
* **Training Model & Testing Model**
使用已由数据来训练模型,这个过程被成为 training model, 得出的模型的 parameter 被称为 **statistical inference of these parametric learning models**
在训练模型结束后,使用训练好的模型来预测,这个过程被称为 **testing model**, 使用的数据为 **test data**
* **Performance Metrics (性能指标)**
得到测试结果后,需要建立 performance metrics 来评估模型
* 对回归问题,需要减少 predicted value 和 actual value 之间的**差 residual errors** (i.e. minimize residual errors), 差的计算方法可以是:
* sum of residual errors
* 或 square of residual errors $R^2$
### 1.1 Exploring our financial dataset (处理金融数据)
ML技术需要预处理数据集可分为3步:
1. 获取数据
2. 定义我们想要预测的数据(应变量)
3. 将数据集分为训练数据集和测试数据集
#### Step 1: Getting the data
此处我们仍然使用 GOOG 股价
```
High Low Open Close Volume Adj Close
Date
2004-08-19 51.835709 47.800831 49.813286 49.982655 44871300.0 49.982655
2004-08-20 54.336334 50.062355 50.316402 53.952770 22942800.0 53.952770
2004-08-23 56.528118 54.321388 55.168217 54.495735 18342800.0 54.495735
2004-08-24 55.591629 51.591621 55.412300 52.239193 15319700.0 52.239193
2004-08-25 53.798351 51.746044 52.284027 52.802086 9232100.0 52.802086
... ... ... ... ... ... ...
2017-12-22 1064.199951 1059.439941 1061.109985 1060.119995 755100.0 1060.119995
2017-12-26 1060.119995 1050.199951 1058.069946 1056.739990 760600.0 1056.739990
2017-12-27 1058.369995 1048.050049 1057.390015 1049.369995 1271900.0 1049.369995
2017-12-28 1054.750000 1044.770020 1051.599976 1048.140015 837100.0 1048.140015
2017-12-29 1049.699951 1044.900024 1046.719971 1046.400024 887500.0 1046.400024
[3366 rows x 6 columns]
```
#### Step 2: Creating objectives (trading conditions that we want to predict)
获取数据以后,需要创造出需要预测的应变量(response variable); 对GOOG股价我们会预测
1. 若为了预测未来价格,就要预测价格方向(上升、下降、不便),和程度(+10, +3.4, -4); 可用回归
2. 若为了只预测价格上升下降;可用分类
为了处理这两种情况需要两种trading condition generation 方法:
* `create_classification_trading_condition(df)`: 如果明天的收盘价高于今天 classification response variable为$+1$;反之为$-1$;
* `create_regression_trading_condition(df)`: 如果明天的收盘价高于今天, classification response variable 为 positive value; 反之则为负值;
Classification trading condition created:
```
( Open-Close High-Low
Date
2004-08-19 -0.169369 4.034878
2004-08-20 -3.636368 4.273979
2004-08-23 0.672482 2.206730
2004-08-24 3.173107 4.000008
2004-08-25 -0.518059 2.052307
... ... ...
2017-12-22 0.989990 4.760010
2017-12-26 1.329956 9.920044
2017-12-27 8.020020 10.319946
2017-12-28 3.459961 9.979980
2017-12-29 0.319946 4.799927
[3366 rows x 2 columns], array([ 1, 1, -1, ..., -1, -1, -1]))
```
Regression trading condition created:
```
( Open-Close High-Low
Date
2004-08-19 -0.169369 4.034878
2004-08-20 -3.636368 4.273979
2004-08-23 0.672482 2.206730
2004-08-24 3.173107 4.000008
2004-08-25 -0.518059 2.052307
... ... ...
2017-12-22 0.989990 4.760010
2017-12-26 1.329956 9.920044
2017-12-27 8.020020 10.319946
2017-12-28 3.459961 9.979980
2017-12-29 0.319946 4.799927
[3366 rows x 2 columns], Date
2004-08-19 3.970116
2004-08-20 0.542965
2004-08-23 -2.256542
2004-08-24 0.562893
2004-08-25 0.951431
...
2017-12-22 -3.380005
2017-12-26 -7.369995
2017-12-27 -1.229980
2017-12-28 -1.739990
2017-12-29 NaN
```
#### Step 3: 将数据集分成训练集和测试集
一般来说,我们会把已有数据分为几分,然后在一份数据集上进行建模,另一份进行测试
## 2. Creating predictive models using linear regression methods 用线性回归来建立预测模型
Regression: Linear & Non-linear
* 线性回归:
* Ordinary Least Squares (OLS)
* Lasso
* Ridge
* Elastic Net
* 非线性回归:
* Decision Tree
### 2.1 Ordinary Linear Squares (普通最小二乘法)
OLS:
* 假设有:
* $y$ 为 $m\times 1$的 target variable
* feature values 有 $m \times$行, 每行 $1\times n$
* OLS 希望能够发现
OLS 要解决的数学问题$min||X\bullet W - y||^2_2$ 即找出$X\bullet W=y$的最近似方程;
* $X$ 是 matrix of feature values
* $W$ 是 $n\times 1$ matrix/vector
$m=4$ & $n=2$
$$
min
\begin{Vmatrix}
\begin{bmatrix}
x00 & x01\\
x10 & x11\\
x20 & x21\\
x30 & x31
\end{bmatrix}
\bullet
\begin{bmatrix}
w_0\\
w_1
\end{bmatrix}
-
\begin{bmatrix}
y_0\\
y_1\\
y_2\\
y_3
\end{bmatrix}
\end{Vmatrix}^2_2
$$
### 2.2 Regularization and shrinkage
### 2.3 Decision tree regression
## 3. Creating predictive models using linear classification methods
### 3.1 K-nearest Neighbors
### 3.2 Support Vector Machine
### 3.3 Logistic Regression

View File

View File

@ -0,0 +1,55 @@
from itertools import tee
import pandas as pd
import matplotlib.pyplot as plt
from pandas import isna
from prepare_financial_data import *
dir_path = os.path.dirname(os.path.realpath(__file__))
goog_data = load_financial_data(
start_date = '2001-01-01',
end_date = '2018-01-01',
output_file = 'goog_data_large.pkl'
)
X, Y = create_regression_trading_condition(goog_data)
X = X[:-1]
Y = Y[:-1]
goog_data = goog_data.assign(Target=pd.Series(Y, index=goog_data.index))
print(goog_data)
goog_data = goog_data[:-1]
print(goog_data)
pd.plotting.scatter_matrix(goog_data[['Open-Close', 'High-Low', 'Target']], diagonal='kde')
plt.savefig(dir_path + "/scatter_matrix.png")
""" Split 80% of available data into training feature value and target variable; and remaining 20% of dataset into out-sample testing feature value """
X_train, X_test, Y_train, Y_test = create_train_split_group(X,Y,split_ratio=0.8)
from sklearn import linear_model
ols = linear_model.LinearRegression()
ols.fit(X_train, Y_train)
print('Coefficients: ', ols.coef_)
""" Performance Matrices """
from sklearn.metrics import mean_squared_error, r2_score
# The mean squared error
print("Mean squared error: {}".format(mean_squared_error(Y_train, ols.predict(X_train))))
print(Y_test)
print(X_test)
# Explained variance score: 1 is perfect prediction
# print("Variance score: " + (r2_score(Y_test, ols.predict(X_test))))
goog_data['Predicted_Signal'] = ols.predict(X)
goog_data['GOOG_Returns'] = np.log(goog_data['Close']/goog_data['Close'].shift(1))
def calculate_return(df, split)

View File

@ -0,0 +1,39 @@
import pandas as pd
import os as os
from pandas_datareader import data
from sklearn.model_selection import train_test_split
import numpy as np
dir_path = os.path.dirname(os.path.realpath(__file__))
def load_financial_data(start_date, end_date, output_file):
try:
df = pd.read_pickle(output_file)
print("File data found...reading GOOG data")
except FileNotFoundError:
print("File not found...downloading the GOOG data")
df = data.DataReader('GOOG', 'yahoo', start_date, end_date)
df.to_pickle(output_file)
return df
def create_classification_trading_condition(df):
df['Open-Close'] = df.Open - df.Close
df['High-Low'] = df.High - df.Low
df = df.dropna()
X = df[['Open-Close', 'High-Low']]
Y = np.where(df['Close'].shift(-1) > df['Close'], 1, -1)
return (X,Y)
def create_regression_trading_condition(df):
df['Open-Close'] = df.Open - df.Close
df['High-Low'] = df.High - df.Low
df = df.dropna()
print(df)
X = df[['Open-Close', 'High-Low']]
Y = df['Close'].shift(-1) - df['Close']
return(X,Y)
def create_train_split_group(X, Y, split_ratio=0.8):
# Split dataset into two groups
return train_test_split(X, Y, shuffle=False, train_size=split_ratio)

Binary file not shown.

After

Width:  |  Height:  |  Size: 79 KiB