pandasの基本的な使い方を覚えたら
次はデータの前処理に使うやりかたを覚えましょう
次はデータの前処理に使うやりかたを覚えましょう
データの抽出¶
In [25]:
titanic_df[['Age','Sex']].head()
Out[25]:
In [26]:
titanic_df[100:104]
Out[26]:
取得行をインデックスでも指定可能¶
In [27]:
fx_df = pd.read_csv('data/DAT_ASCII_USDJPY_M1_201710.csv',
sep=';',
names=('Time','Open','High','Low','Close',''),
index_col='Time',
parse_dates=True)
fx_df['2017-10-01 17:03:00':'2017-10-01 17:07:00']
Out[27]:
In [ ]:
In [28]:
titanic_df = pd.read_csv('data/titanic_train.csv')
# Age が70より上の行
titanic_df[titanic_df.Age > 70]
Out[28]:
In [29]:
# Ageが71より大きい Age Sex カラムのみ取得
titanic_df[['Age','Sex']][titanic_df.Age > 71]
Out[29]:
In [30]:
# Sex がmale の行を抽出
titanic_df[titanic_df['Sex'].isin(['male'])].head(3)
Out[30]:
データフレーム名[データフレーム名['カラム名'].str.contains('条件')]¶
In [31]:
titanic_df[titanic_df['Name'].str.contains('Henry')].head(3)
Out[31]:
複数条件¶
In [32]:
titanic_df[titanic_df['Name'].str.contains('Henry') & titanic_df['Sex'].isin(['male'])].head(3)
Out[32]:
In [ ]:
In [33]:
# 欠損値の有る行の削除
titanic_df.dropna().count()
Out[33]:
In [34]:
titanic_df.dropna(subset=['Age']).count()
Out[34]:
In [35]:
titanic_df['job'] = 'nojob'
titanic_df.head()
Out[35]:
In [36]:
data = {'a':[0, 1], 'b':[2, 3]}
d_df = pd.DataFrame(data=data)
d_df
Out[36]:
In [37]:
data2 = {'a':[5,6], 'b':[8, 9]}
d_df2 = pd.DataFrame(data=data2)
d_df2
Out[37]:
In [38]:
d_df.append(d_df2,ignore_index=True)
Out[38]:
In [39]:
titanic_df = pd.read_csv('data/titanic_train.csv')
titanic_df["Pclass"].value_counts()
Out[39]:
In [40]:
titanic_df.sort_values("Age",ascending=True).head(4)
Out[40]:
In [41]:
# Pclassごとの合計
titanic_df.groupby('Pclass').sum()
Out[41]:
In [42]:
# Pclass , Sex ごとの個数
titanic_df.groupby(['Pclass','Sex']).count()
Out[42]:
In [43]:
# Ageごとに Fare , Pclass の平均をだす
titanic_df.groupby('Age')[['Fare','Pclass']].mean().head(5)
Out[43]:
In [ ]:
リサンプリング¶
In [44]:
dataM1 = pd.read_csv('data/DAT_ASCII_USDJPY_M1_201710.csv',
sep=';',
names=('Time','Open','High','Low','Close', ''),
index_col='Time',
parse_dates=True)
dataM1.head(5)
Out[44]:
データフレーム名.resample('タイムフレーム').ohlc()¶
In [45]:
# dfのデータからtfで指定するタイムフレームの4本足データを作成する関数
def TF_ohlc(df, tf):
x = df.resample(tf).ohlc()
O = x['Open']['open']
H = x['High']['high']
L = x['Low']['low']
C = x['Close']['close']
ret = pd.DataFrame({'Open': O, 'High': H, 'Low': L, 'Close': C},
columns=['Open','High','Low','Close'])
return ret.dropna()
def MAonSeries(s, ma_period, ma_method):
return pd.Series(MAonArray(s.values, ma_period, ma_method), index=s.index)
def iEMA(df, ma_period, ma_shift=0, ma_method='EMA', applied_price='Close'):
return MAonSeries(df[applied_price], ma_period, ma_method).shift(ma_shift)
def MAonArray(a, ma_period, ma_method):
if ma_method == 'SMA':
y = SMAonArray(a, ma_period)
elif ma_method == 'EMA':
y = EMAonArray(a, 2/(ma_period+1))
elif ma_method == 'SMMA':
y = EMAonArray(a, 1/ma_period)
elif ma_method == 'LWMA':
h = np.arange(ma_period, 0, -1)*2/ma_period/(ma_period+1)
y = lfilter(h, 1, a)
y[:ma_period-1] = np.nan
return y
def EMAonArray(x, alpha):
x[np.isnan(x)] = 0
y = np.empty_like(x)
y[0] = x[0]
for i in range(1,len(x)):
y[i] = alpha*x[i] + (1-alpha)*y[i-1]
return y
def SMAonArray(x, ma_period):
x[np.isnan(x)] = 0
y = np.empty_like(x)
y[:ma_period-1] = np.nan
y[ma_period-1] = np.sum(x[:ma_period])
for i in range(ma_period, len(x)):
y[i] = y[i-1] + x[i] - x[i-ma_period]
return y/ma_period
In [46]:
df_5m = TF_ohlc(dataM1, '5Min') # 5分足
df_10m = TF_ohlc(dataM1, '10Min') # 10分足
df_15m = TF_ohlc(dataM1, '15Min') # 15分足
df_30m = TF_ohlc(dataM1, '30Min') # 30分足
df_1H = TF_ohlc(dataM1, '1H') # 1時間足
df_4H = TF_ohlc(dataM1, '4H') # 4時間足
df_1D = TF_ohlc(dataM1, 'D') # 日足
In [47]:
# 移動平均線(EMA)
FastMA_1H = iEMA(df_1H, 5) #短期移動平均
MiddMA_1H = iEMA(df_1H, 10) #中期移動平均
SlowMA_1H = iEMA(df_1H, 20) #長期移動平均
In [105]:
df = pd.DataFrame({'Close': df_1H['Close'] , 'FastMA': FastMA_1H, '': MiddMA_1H , 'SlowMA': SlowMA_1H})
display_charts(df, chart_type="stock", title="MA cross", figsize=(960,640), grid=True)
コメントする