pandasの基本的な使い方を覚えたら
次はデータの前処理に使うやりかたを覚えましょう




データの抽出

特定の列を取得

データフレーム名[['カラム名' , 'カラム名']]

In [25]:
titanic_df[['Age','Sex']].head()
Out[25]:
Age Sex
0 22.0 male
1 38.0 female
2 26.0 female
3 35.0 female
4 35.0 male

特定の区間の行を抽出

データフレーム名[開始位置 : 終了位置]

In [26]:
titanic_df[100:104]
Out[26]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
100 101 0 3 Petranec, Miss. Matilda female 28.0 0 0 349245 7.8958 NaN S
101 102 0 3 Petroff, Mr. Pastcho ("Pentcho") male NaN 0 0 349215 7.8958 NaN S
102 103 0 1 White, Mr. Richard Frasar male 21.0 0 1 35281 77.2875 D26 S
103 104 0 3 Johansson, Mr. Gustaf Joel male 33.0 0 0 7540 8.6542 NaN S

取得行をインデックスでも指定可能

In [27]:
fx_df = pd.read_csv('data/DAT_ASCII_USDJPY_M1_201710.csv', 
                     sep=';',
                     names=('Time','Open','High','Low','Close',''),
                     index_col='Time', 
                     parse_dates=True)
fx_df['2017-10-01 17:03:00':'2017-10-01 17:07:00']
Out[27]:
Open High Low Close
Time
2017-10-01 17:03:00 112.512 112.532 112.510 112.510 0
2017-10-01 17:04:00 112.513 112.514 112.513 112.514 0
2017-10-01 17:05:00 112.519 112.531 112.519 112.530 0
2017-10-01 17:06:00 112.529 112.529 112.509 112.509 0
2017-10-01 17:07:00 112.502 112.586 112.460 112.553 0
In [ ]:
 

条件を指定して行・列を取得

データフレーム名[データフレーム名.カラム名 条件式 ]

In [28]:
titanic_df = pd.read_csv('data/titanic_train.csv')
# Age が70より上の行
titanic_df[titanic_df.Age > 70]
Out[28]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
96 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C
116 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q
493 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C
630 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S
851 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S
In [29]:
# Ageが71より大きい Age Sex カラムのみ取得
titanic_df[['Age','Sex']][titanic_df.Age > 71]
Out[29]:
Age Sex
630 80.0 male
851 74.0 male

値が含まれる行の抽出

データフレーム名.isin([値,値])]

In [30]:
# Sex がmale の行を抽出
titanic_df[titanic_df['Sex'].isin(['male'])].head(3)
Out[30]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q

データフレーム名[データフレーム名['カラム名'].str.contains('条件')]

In [31]:
titanic_df[titanic_df['Name'].str.contains('Henry')].head(3)
Out[31]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
52 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.0 1 0 PC 17572 76.7292 D33 C

複数条件

In [32]:
titanic_df[titanic_df['Name'].str.contains('Henry')  & titanic_df['Sex'].isin(['male'])].head(3)
Out[32]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.050 NaN S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.050 NaN S
62 63 0 1 Harris, Mr. Henry Birkhardt male 45.0 1 0 36973 83.475 C83 S

AND &

OR |

In [ ]:
 

欠損値の削除

欠損がある行の削除

データフレーム名.dropna()

In [33]:
# 欠損値の有る行の削除
titanic_df.dropna().count()
Out[33]:
PassengerId    183
Survived       183
Pclass         183
Name           183
Sex            183
Age            183
SibSp          183
Parch          183
Ticket         183
Fare           183
Cabin          183
Embarked       183
dtype: int64

特定のカラムに欠損値がある行の削除

データフレーム名.dropna(subset=['カラム名'])

In [34]:
titanic_df.dropna(subset=['Age']).count()
Out[34]:
PassengerId    714
Survived       714
Pclass         714
Name           714
Sex            714
Age            714
SibSp          714
Parch          714
Ticket         714
Fare           714
Cabin          185
Embarked       712
dtype: int64

データ追加

カラム追加

データフレーム名['カラム名'] = 値

In [35]:
titanic_df['job'] = 'nojob'
titanic_df.head()
Out[35]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked job
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S nojob
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C nojob
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S nojob
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S nojob
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S nojob

行を追加

データフレーム名.append(データフレーム)

行番号を振りなおすならデータフレーム名.append(データフレーム,ignore_index=True)

In [36]:
data = {'a':[0, 1], 'b':[2, 3]}
d_df = pd.DataFrame(data=data)
d_df
Out[36]:
a b
0 0 2
1 1 3
In [37]:
data2 = {'a':[5,6], 'b':[8, 9]}
d_df2 = pd.DataFrame(data=data2)
d_df2 
Out[37]:
a b
0 5 8
1 6 9
In [38]:
d_df.append(d_df2,ignore_index=True)
Out[38]:
a b
0 0 2
1 1 3
2 5 8
3 6 9

集計

カラムの値でカウント

データフレーム名['カラム名'].value_counts()

In [39]:
titanic_df = pd.read_csv('data/titanic_train.csv')
titanic_df["Pclass"].value_counts()
Out[39]:
3    491
1    216
2    184
Name: Pclass, dtype: int64

並び替え、ソート

カラムの値でソートする

データフレーム名.sort_values('カラム名')

昇順 ascending=True

降順 ascending=False

In [40]:
titanic_df.sort_values("Age",ascending=True).head(4)
Out[40]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
803 804 1 3 Thomas, Master. Assad Alexander male 0.42 0 1 2625 8.5167 NaN C
755 756 1 2 Hamalainen, Master. Viljo male 0.67 1 1 250649 14.5000 NaN S
644 645 1 3 Baclini, Miss. Eugenie female 0.75 2 1 2666 19.2583 NaN C
469 470 1 3 Baclini, Miss. Helene Barbara female 0.75 2 1 2666 19.2583 NaN C

groupby

データフレーム名.groupby(["カラム名" , "カラム名"])

In [41]:
# Pclassごとの合計
titanic_df.groupby('Pclass').sum()
Out[41]:
PassengerId Survived Age SibSp Parch Fare
Pclass
1 99705 136 7111.42 90 77 18177.4125
2 82056 87 5168.83 74 70 3801.8417
3 215625 119 8924.92 302 193 6714.6951
In [42]:
# Pclass , Sex ごとの個数
titanic_df.groupby(['Pclass','Sex']).count()
Out[42]:
PassengerId Survived Name Age SibSp Parch Ticket Fare Cabin Embarked
Pclass Sex
1 female 94 94 94 85 94 94 94 94 81 92
male 122 122 122 101 122 122 122 122 95 122
2 female 76 76 76 74 76 76 76 76 10 76
male 108 108 108 99 108 108 108 108 6 108
3 female 144 144 144 102 144 144 144 144 6 144
male 347 347 347 253 347 347 347 347 6 347
In [43]:
# Ageごとに Fare , Pclass の平均をだす
titanic_df.groupby('Age')[['Fare','Pclass']].mean().head(5)
Out[43]:
Fare Pclass
Age
0.42 8.5167 3.0
0.67 14.5000 2.0
0.75 19.2583 3.0
0.83 23.8750 2.0
0.92 151.5500 1.0
In [ ]:
 

リサンプリング

In [44]:
dataM1 = pd.read_csv('data/DAT_ASCII_USDJPY_M1_201710.csv', 
                     sep=';',
                     names=('Time','Open','High','Low','Close', ''),
                     index_col='Time', 
                     parse_dates=True)
dataM1.head(5)
Out[44]:
Open High Low Close
Time
2017-10-01 17:00:00 112.627 112.658 112.526 112.526 0
2017-10-01 17:01:00 112.536 112.536 112.480 112.510 0
2017-10-01 17:03:00 112.512 112.532 112.510 112.510 0
2017-10-01 17:04:00 112.513 112.514 112.513 112.514 0
2017-10-01 17:05:00 112.519 112.531 112.519 112.530 0

データフレーム名.resample('タイムフレーム').ohlc()

In [45]:
# dfのデータからtfで指定するタイムフレームの4本足データを作成する関数
def TF_ohlc(df, tf):
    x = df.resample(tf).ohlc()
    O = x['Open']['open']
    H = x['High']['high']
    L = x['Low']['low']
    C = x['Close']['close']
    ret = pd.DataFrame({'Open': O, 'High': H, 'Low': L, 'Close': C},
                       columns=['Open','High','Low','Close'])
    return ret.dropna()

def MAonSeries(s, ma_period, ma_method):
    return pd.Series(MAonArray(s.values, ma_period, ma_method), index=s.index)
    
def iEMA(df, ma_period, ma_shift=0, ma_method='EMA', applied_price='Close'):
    return MAonSeries(df[applied_price], ma_period, ma_method).shift(ma_shift)

def MAonArray(a, ma_period, ma_method):
    if ma_method == 'SMA':
        y = SMAonArray(a, ma_period)
    elif ma_method == 'EMA':
        y = EMAonArray(a, 2/(ma_period+1))
    elif ma_method == 'SMMA':
        y = EMAonArray(a, 1/ma_period)
    elif ma_method == 'LWMA':
        h = np.arange(ma_period, 0, -1)*2/ma_period/(ma_period+1)
        y = lfilter(h, 1, a)
        y[:ma_period-1] = np.nan
    return y

def EMAonArray(x, alpha):
    x[np.isnan(x)] = 0
    y = np.empty_like(x)
    y[0] = x[0]
    for i in range(1,len(x)):
        y[i] = alpha*x[i] + (1-alpha)*y[i-1]
    return y

def SMAonArray(x, ma_period):
    x[np.isnan(x)] = 0
    y = np.empty_like(x)
    y[:ma_period-1] = np.nan
    y[ma_period-1] = np.sum(x[:ma_period])
    for i in range(ma_period, len(x)):
        y[i] = y[i-1] + x[i] - x[i-ma_period]
    return y/ma_period
In [46]:
df_5m  = TF_ohlc(dataM1, '5Min')  # 5分足
df_10m = TF_ohlc(dataM1, '10Min') # 10分足
df_15m = TF_ohlc(dataM1, '15Min') # 15分足
df_30m = TF_ohlc(dataM1, '30Min') # 30分足

df_1H = TF_ohlc(dataM1, '1H') # 1時間足
df_4H = TF_ohlc(dataM1, '4H') # 4時間足
df_1D = TF_ohlc(dataM1, 'D')  #    日足
In [47]:
# 移動平均線(EMA)
FastMA_1H = iEMA(df_1H, 5)  #短期移動平均
MiddMA_1H = iEMA(df_1H, 10) #中期移動平均
SlowMA_1H = iEMA(df_1H, 20) #長期移動平均
In [105]:
df = pd.DataFrame({'Close': df_1H['Close'] , 'FastMA': FastMA_1H, '': MiddMA_1H , 'SlowMA': SlowMA_1H})
display_charts(df, chart_type="stock", title="MA cross", figsize=(960,640), grid=True)