克隆策略

Pandas使用小技巧

本文将介绍一些处理大型数据集的Pandas技巧,适用于数据分析和处理。

Pandas内置函数Ufuncs

Pandas中apply函数应用方便,计算速度只比loop循环稍快。因此,在适当的 数据处理任务中,推荐使用Pandas内置函数ufuncs。Ufuncs是基于Numpy库, 以C语言实现的一类特殊函数,运行高效。

本文将重点介绍以下ufuncs的命令:.diff,.shift,.cumsum, .str commands(以字符串为操作对象),.dt commands(以日期为操作对象)等等。

示例数据-暑期活动

示例数据包含不同人在暑期进行的活动,时间点和费用信息,数据如下:

In [2]:
import pandas as pd
# 借助原始数据构建DataFrame
df = pd.DataFrame(index = [1,6,0,3,2,4,5,7,8,9,10,11,12],
                  columns = ['name','activity','timestamp','money_spent'])
df['name'] = ['Chandler Bing','Harry Kane','John Doe','Joey Tribbiani','Monica Keller','Ross Geller','Harry Kane','John Doe','Joey Tribbiani','Monica Keller','Harry Kane','John Doe','John Doe']
df['activity'] = ['party','football','beach','party','travel','surfing','beach','party','travel','surfing','surfing','party','party']
df['timestamp'] = ['2017-08-04 13:30:00','2017-08-04 13:00:00','2017-08-04 10:00:00','2017-08-04 10:00:00','2017-08-04 07:00:00','2017-08-04 14:10:00','2017-08-04 14:00:00','2017-08-04 14:00:00','2017-08-04 07:00:00','2017-08-04 14:00:00','2017-08-04 07:00:00','2017-08-04 14:10:00','2017-08-04 15:10:00']
df['money_spent'] = [34,98,43,543,35,53,5,75,750,13,788,56,890]
df
Out[2]:
name activity timestamp money_spent
1 Chandler Bing party 2017-08-04 13:30:00 34
6 Harry Kane football 2017-08-04 13:00:00 98
0 John Doe beach 2017-08-04 10:00:00 43
3 Joey Tribbiani party 2017-08-04 10:00:00 543
2 Monica Keller travel 2017-08-04 07:00:00 35
4 Ross Geller surfing 2017-08-04 14:10:00 53
5 Harry Kane beach 2017-08-04 14:00:00 5
7 John Doe party 2017-08-04 14:00:00 75
8 Joey Tribbiani travel 2017-08-04 07:00:00 750
9 Monica Keller surfing 2017-08-04 14:00:00 13
10 Harry Kane surfing 2017-08-04 07:00:00 788
11 John Doe party 2017-08-04 14:10:00 56
12 John Doe party 2017-08-04 15:10:00 890

1.字符串操作命令

在所有的字符串操作命令中,我们强烈推荐使用Pandas的字符串命令(Ufuncs函数)。

例如,你可以将某人的全名分解为两行,使用.str.split命令,设定expand = True。

In [3]:
df[['first name', 'last name']] = df.name.str.split(' ',  expand= True) # 将全名分解为两列
df
Out[3]:
name activity timestamp money_spent first name last name
1 Chandler Bing party 2017-08-04 13:30:00 34 Chandler Bing
6 Harry Kane football 2017-08-04 13:00:00 98 Harry Kane
0 John Doe beach 2017-08-04 10:00:00 43 John Doe
3 Joey Tribbiani party 2017-08-04 10:00:00 543 Joey Tribbiani
2 Monica Keller travel 2017-08-04 07:00:00 35 Monica Keller
4 Ross Geller surfing 2017-08-04 14:10:00 53 Ross Geller
5 Harry Kane beach 2017-08-04 14:00:00 5 Harry Kane
7 John Doe party 2017-08-04 14:00:00 75 John Doe
8 Joey Tribbiani travel 2017-08-04 07:00:00 750 Joey Tribbiani
9 Monica Keller surfing 2017-08-04 14:00:00 13 Monica Keller
10 Harry Kane surfing 2017-08-04 07:00:00 788 Harry Kane
11 John Doe party 2017-08-04 14:10:00 56 John Doe
12 John Doe party 2017-08-04 15:10:00 890 John Doe

2.聚合分组与计数

Groupby是非常有用的Pandas命令,你可以依照某个列进行分组,并对另一列进行计数。 现在,本文利用groupby和value_counts计算每个人参加的活动数量。

In [4]:
df.groupby('name')['activity'].value_counts()
Out[4]:
name            activity
Chandler Bing   party       1
Harry Kane      beach       1
                football    1
                surfing     1
Joey Tribbiani  party       1
                travel      1
John Doe        party       3
                beach       1
Monica Keller   surfing     1
                travel      1
Ross Geller     surfing     1
Name: activity, dtype: int64

这叫做多层索引,允许在dataframe中存在不同层次的索引。 在此例中,'name'是主索引,'activity'是第二层次索引。

3.行列转置

另一种获取该信息的方法是利用unstack命令,Unstack 将行转换为列,数据将会填充该表格。 若无数据,使用fillna命令以'0'填充缺失数据。

In [5]:
df.groupby('activity')['name'].value_counts().unstack().fillna(0)
Out[5]:
name Chandler Bing Harry Kane Joey Tribbiani John Doe Monica Keller Ross Geller
activity
beach 0.0 1.0 0.0 1.0 0.0 0.0
football 0.0 1.0 0.0 0.0 0.0 0.0
party 1.0 0.0 1.0 3.0 0.0 0.0
surfing 0.0 1.0 0.0 0.0 1.0 1.0
travel 0.0 0.0 1.0 0.0 1.0 0.0

快速数据分析:grouby,diff,shift和loc命令

以上我们已经获知每人参加的活动次数,接下来我们看看谁玩的时间最长?

最直接的方式是计算每个人不同时间点之间差值,利用groupby和diff函数。

In [16]:
import datetime
df['timestamp'] = df['timestamp'].map(lambda x: pd.to_datetime(x)) # 将字符型转化为日期时间型
df = df.sort_values(by=['name', 'timestamp'])
df['time_diff'] = df.groupby('name')['timestamp'].diff().fillna(0)
df[['name', 'timestamp', 'time_diff']]
Out[16]:
name timestamp time_diff
1 Chandler Bing 2017-08-04 13:30:00 00:00:00
10 Harry Kane 2017-08-04 07:00:00 00:00:00
6 Harry Kane 2017-08-04 13:00:00 06:00:00
5 Harry Kane 2017-08-04 14:00:00 01:00:00
8 Joey Tribbiani 2017-08-04 07:00:00 00:00:00
3 Joey Tribbiani 2017-08-04 10:00:00 03:00:00
0 John Doe 2017-08-04 10:00:00 00:00:00
7 John Doe 2017-08-04 14:00:00 04:00:00
11 John Doe 2017-08-04 14:10:00 00:10:00
12 John Doe 2017-08-04 15:10:00 01:00:00
2 Monica Keller 2017-08-04 07:00:00 00:00:00
9 Monica Keller 2017-08-04 14:00:00 07:00:00
4 Ross Geller 2017-08-04 14:10:00 00:00:00

若数据集足够大,采用另一种方式可以运行得更快,省略groupby,直接使用diff命令。 对不同人之间的时间差值做归零处理。

In [18]:
df = df.sort_values(by=['name', 'timestamp']) #进行排序
df['time_diff'] = df['timestamp'].diff() #计算时间差值
df.loc[df.name != df.name.shift(), 'time_diff'] = 0 #不同人之间的时间差值记为0
df
Out[18]:
name activity timestamp money_spent first name last name time_diff
1 Chandler Bing party 2017-08-04 13:30:00 34 Chandler Bing 00:00:00
10 Harry Kane surfing 2017-08-04 07:00:00 788 Harry Kane 00:00:00
6 Harry Kane football 2017-08-04 13:00:00 98 Harry Kane 06:00:00
5 Harry Kane beach 2017-08-04 14:00:00 5 Harry Kane 01:00:00
8 Joey Tribbiani travel 2017-08-04 07:00:00 750 Joey Tribbiani 00:00:00
3 Joey Tribbiani party 2017-08-04 10:00:00 543 Joey Tribbiani 03:00:00
0 John Doe beach 2017-08-04 10:00:00 43 John Doe 00:00:00
7 John Doe party 2017-08-04 14:00:00 75 John Doe 04:00:00
11 John Doe party 2017-08-04 14:10:00 56 John Doe 00:10:00
12 John Doe party 2017-08-04 15:10:00 890 John Doe 01:00:00
2 Monica Keller travel 2017-08-04 07:00:00 35 Monica Keller 00:00:00
9 Monica Keller surfing 2017-08-04 14:00:00 13 Monica Keller 07:00:00
4 Ross Geller surfing 2017-08-04 14:10:00 53 Ross Geller 00:00:00

将时间差值单位转为为'秒':

In [19]:
df['time_diff'] = df.time_diff.dt.total_seconds()
df
Out[19]:
name activity timestamp money_spent first name last name time_diff
1 Chandler Bing party 2017-08-04 13:30:00 34 Chandler Bing 0.0
10 Harry Kane surfing 2017-08-04 07:00:00 788 Harry Kane 0.0
6 Harry Kane football 2017-08-04 13:00:00 98 Harry Kane 21600.0
5 Harry Kane beach 2017-08-04 14:00:00 5 Harry Kane 3600.0
8 Joey Tribbiani travel 2017-08-04 07:00:00 750 Joey Tribbiani 0.0
3 Joey Tribbiani party 2017-08-04 10:00:00 543 Joey Tribbiani 10800.0
0 John Doe beach 2017-08-04 10:00:00 43 John Doe 0.0
7 John Doe party 2017-08-04 14:00:00 75 John Doe 14400.0
11 John Doe party 2017-08-04 14:10:00 56 John Doe 600.0
12 John Doe party 2017-08-04 15:10:00 890 John Doe 3600.0
2 Monica Keller travel 2017-08-04 07:00:00 35 Monica Keller 0.0
9 Monica Keller surfing 2017-08-04 14:00:00 13 Monica Keller 25200.0
4 Ross Geller surfing 2017-08-04 14:10:00 53 Ross Geller 0.0

获取每个活动的持续时间

In [20]:
df['activity_duration'] = df.time_diff.shift(-1).fillna(0)
df[['name', 'timestamp', 'activity', 'activity_duration', 'time_diff']]
Out[20]:
name timestamp activity activity_duration time_diff
1 Chandler Bing 2017-08-04 13:30:00 party 0.0 0.0
10 Harry Kane 2017-08-04 07:00:00 surfing 21600.0 0.0
6 Harry Kane 2017-08-04 13:00:00 football 3600.0 21600.0
5 Harry Kane 2017-08-04 14:00:00 beach 0.0 3600.0
8 Joey Tribbiani 2017-08-04 07:00:00 travel 10800.0 0.0
3 Joey Tribbiani 2017-08-04 10:00:00 party 0.0 10800.0
0 John Doe 2017-08-04 10:00:00 beach 14400.0 0.0
7 John Doe 2017-08-04 14:00:00 party 600.0 14400.0
11 John Doe 2017-08-04 14:10:00 party 3600.0 600.0
12 John Doe 2017-08-04 15:10:00 party 0.0 3600.0
2 Monica Keller 2017-08-04 07:00:00 travel 25200.0 0.0
9 Monica Keller 2017-08-04 14:00:00 surfing 0.0 25200.0
4 Ross Geller 2017-08-04 14:10:00 surfing 0.0 0.0

4.累计数和累计和

累计数帮助获取每个人的活动次序。 例如,了解每个人第二个或者第三个活动项目:

In [21]:
df = df.sort_values(by=['name', 'timestamp']) #数据排序
df2 = df[df.groupby('name').cumcount()==1] #获取每个人的第二个活动
df2[['name', 'timestamp', 'activity']]
Out[21]:
name timestamp activity
6 Harry Kane 2017-08-04 13:00:00 football
3 Joey Tribbiani 2017-08-04 10:00:00 party
7 John Doe 2017-08-04 14:00:00 party
9 Monica Keller 2017-08-04 14:00:00 surfing
In [22]:
df3 = df[df.groupby('name').cumcount()==2] #获取每个人的第三个活动
df3[['name', 'timestamp', 'activity']]
Out[22]:
name timestamp activity
5 Harry Kane 2017-08-04 14:00:00 beach
11 John Doe 2017-08-04 14:10:00 party

累计和是数据的累积和。例如,你可以得到不同时间点每个人总花费。

In [23]:
df = df.sort_values(by = ['name', 'timestamp'])
df['money_spent_so_far'] = df.groupby('name')['money_spent'].cumsum()
df[['name', 'timestamp', 'activity', 'money_spent', 'money_spent_so_far']]
Out[23]:
name timestamp activity money_spent money_spent_so_far
1 Chandler Bing 2017-08-04 13:30:00 party 34 34
10 Harry Kane 2017-08-04 07:00:00 surfing 788 788
6 Harry Kane 2017-08-04 13:00:00 football 98 886
5 Harry Kane 2017-08-04 14:00:00 beach 5 891
8 Joey Tribbiani 2017-08-04 07:00:00 travel 750 750
3 Joey Tribbiani 2017-08-04 10:00:00 party 543 1293
0 John Doe 2017-08-04 10:00:00 beach 43 43
7 John Doe 2017-08-04 14:00:00 party 75 118
11 John Doe 2017-08-04 14:10:00 party 56 174
12 John Doe 2017-08-04 15:10:00 party 890 1064
2 Monica Keller 2017-08-04 07:00:00 travel 35 35
9 Monica Keller 2017-08-04 14:00:00 surfing 13 48
4 Ross Geller 2017-08-04 14:10:00 surfing 53 53

衡量活动时间:groupby,max,min命令

现在我们想知道每个人在不同活动上花费的时间。但有时候数据可能对同一个人参同一个活动连续记录两次。 我们使用新的命令排除这种情况的发生。

In [24]:
# 5. groupby, max, min for measuring the duration of activities
df = df.sort_values(by = ['name', 'activity'])
df['activity_change'] = (df.activity != df.activity.shift()) | (df.name != df.name.shift()) # 当人或者活动改变时,change=True
df['activity_num'] = df.groupby('name')['activity_change'].cumsum() # 计算活动个数
activity_duration = df.groupby(['name', 'activity_num', 'activity'])['activity_duration'].sum() # 计算总的活动时间
df
Out[24]:
name activity timestamp money_spent first name last name time_diff activity_duration money_spent_so_far activity_change activity_num
1 Chandler Bing party 2017-08-04 13:30:00 34 Chandler Bing 0.0 0.0 34 True 1.0
5 Harry Kane beach 2017-08-04 14:00:00 5 Harry Kane 3600.0 0.0 891 True 1.0
6 Harry Kane football 2017-08-04 13:00:00 98 Harry Kane 21600.0 3600.0 886 True 2.0
10 Harry Kane surfing 2017-08-04 07:00:00 788 Harry Kane 0.0 21600.0 788 True 3.0
3 Joey Tribbiani party 2017-08-04 10:00:00 543 Joey Tribbiani 10800.0 0.0 1293 True 1.0
8 Joey Tribbiani travel 2017-08-04 07:00:00 750 Joey Tribbiani 0.0 10800.0 750 True 2.0
0 John Doe beach 2017-08-04 10:00:00 43 John Doe 0.0 14400.0 43 True 1.0
7 John Doe party 2017-08-04 14:00:00 75 John Doe 14400.0 600.0 118 True 2.0
11 John Doe party 2017-08-04 14:10:00 56 John Doe 600.0 3600.0 174 False 2.0
12 John Doe party 2017-08-04 15:10:00 890 John Doe 3600.0 0.0 1064 False 2.0
9 Monica Keller surfing 2017-08-04 14:00:00 13 Monica Keller 25200.0 0.0 48 True 1.0
2 Monica Keller travel 2017-08-04 07:00:00 35 Monica Keller 0.0 25200.0 35 True 2.0
4 Ross Geller surfing 2017-08-04 14:10:00 53 Ross Geller 0.0 0.0 53 True 1.0
In [25]:
activity_duration
Out[25]:
name            activity_num  activity
Chandler Bing   1.0           party           0.0
Harry Kane      1.0           beach           0.0
                2.0           football     3600.0
                3.0           surfing     21600.0
Joey Tribbiani  1.0           party           0.0
                2.0           travel      10800.0
John Doe        1.0           beach       14400.0
                2.0           party        4200.0
Monica Keller   1.0           surfing         0.0
                2.0           travel      25200.0
Ross Geller     1.0           surfing         0.0
Name: activity_duration, dtype: float64

利用min,max,median等函数分析每个人持续时间最长的活动

In [26]:
activity_duration.reset_index().groupby('name').max() #取持续时间最大的活动
Out[26]:
activity_num activity activity_duration
name
Chandler Bing 1.0 party 0.0
Harry Kane 3.0 surfing 21600.0
Joey Tribbiani 2.0 travel 10800.0
John Doe 2.0 party 14400.0
Monica Keller 2.0 travel 25200.0
Ross Geller 1.0 surfing 0.0