Reshaping and Pivot Tables¶
Reshaping by pivoting DataFrame objects¶
数据通常以所谓的“堆叠”或“记录”格式存储在CSV文件或数据库中:
In [1]: df
Out[1]:
date variable value
0 2000-01-03 A 0.469112
1 2000-01-04 A -0.282863
2 2000-01-05 A -1.509059
3 2000-01-03 B -1.135632
4 2000-01-04 B 1.212112
5 2000-01-05 B -0.173215
6 2000-01-03 C 0.119209
7 2000-01-04 C -1.044236
8 2000-01-05 C -0.861849
9 2000-01-03 D -2.104569
10 2000-01-04 D -0.494929
11 2000-01-05 D 1.071804
对于好奇这里是如何创建上面的DataFrame:
import pandas.util.testing as tm; tm.N = 3
def unpivot(frame):
N, K = frame.shape
data = {'value' : frame.values.ravel('F'),
'variable' : np.asarray(frame.columns).repeat(N),
'date' : np.tile(np.asarray(frame.index), K)}
return pd.DataFrame(data, columns=['date', 'variable', 'value'])
df = unpivot(tm.makeTimeDataFrame())
要选择变量A
的所有内容,我们可以:
In [2]: df[df['variable'] == 'A']
Out[2]:
date variable value
0 2000-01-03 A 0.469112
1 2000-01-04 A -0.282863
2 2000-01-05 A -1.509059
但是假设我们希望用变量进行时间序列运算。更好的表示是columns
是唯一变量,index
标识个别观察。要将数据重新整形为此表单,请使用pivot
函数:
In [3]: df.pivot(index='date', columns='variable', values='value')
Out[3]:
variable A B C D
date
2000-01-03 0.469112 -1.135632 0.119209 -2.104569
2000-01-04 -0.282863 1.212112 -1.044236 -0.494929
2000-01-05 -1.509059 -0.173215 -0.861849 1.071804
如果省略values
参数,并且输入DataFrame具有多个不用作pivot
的列或索引输入的值列,则生成的“pivoted” DataFrame将具有hierarchical columns,其最高级别表示相应的值列:
In [4]: df['value2'] = df['value'] * 2
In [5]: pivoted = df.pivot('date', 'variable')
In [6]: pivoted
Out[6]:
value value2 \
variable A B C D A B
date
2000-01-03 0.469112 -1.135632 0.119209 -2.104569 0.938225 -2.271265
2000-01-04 -0.282863 1.212112 -1.044236 -0.494929 -0.565727 2.424224
2000-01-05 -1.509059 -0.173215 -0.861849 1.071804 -3.018117 -0.346429
variable C D
date
2000-01-03 0.238417 -4.209138
2000-01-04 -2.088472 -0.989859
2000-01-05 -1.723698 2.143608
您当然可以从透视的DataFrame中选择子集:
In [7]: pivoted['value2']
Out[7]:
variable A B C D
date
2000-01-03 0.938225 -2.271265 0.238417 -4.209138
2000-01-04 -0.565727 2.424224 -2.088472 -0.989859
2000-01-05 -3.018117 -0.346429 -1.723698 2.143608
请注意,在数据是均匀类型的情况下,这将返回基础数据的视图。
Reshaping by stacking and unstacking¶
与pivot
函数紧密相关的是当前在Series和DataFrame上可用的相关stack
和unstack
功能。这些函数设计为与MultiIndex
对象一起使用(请参阅hierarchical indexing一节)。这些功能基本上是:
stack
:“枢轴”(可能是层次结构)列标签的级别,返回具有新的最内层行标签的索引的DataFrame。unstack
:从stack
的反向操作:将(可能是分层的)行索引的级别“枢转”到列轴,产生具有新的最内层级的重新形成的DataFrame的列标签。
最清楚的解释是通过例子。让我们从分层索引部分获取前面的示例数据集:
In [8]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
...: 'foo', 'foo', 'qux', 'qux'],
...: ['one', 'two', 'one', 'two',
...: 'one', 'two', 'one', 'two']]))
...:
In [9]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [10]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
In [11]: df2 = df[:4]
In [12]: df2
Out[12]:
A B
first second
bar one 0.721555 -0.706771
two -1.039575 0.271860
baz one -0.424972 0.567020
two 0.276232 -1.087401
stack
函数“压缩”DataFrame的列中的一个级别,以产生:
- A系列,在一个简单的列索引的情况下
- 在列中的
MultiIndex
的情况下为DataFrame
如果列具有MultiIndex
,您可以选择要堆叠的级别。堆叠级别成为列上的MultiIndex
中的新的最低级别:
In [13]: stacked = df2.stack()
In [14]: stacked
Out[14]:
first second
bar one A 0.721555
B -0.706771
two A -1.039575
B 0.271860
baz one A -0.424972
B 0.567020
two A 0.276232
B -1.087401
dtype: float64
With a “stacked” DataFrame or Series (having a MultiIndex
as the index
), the inverse operation of stack
is unstack
, which by default unstacks the last level:
In [15]: stacked.unstack()
Out[15]:
A B
first second
bar one 0.721555 -0.706771
two -1.039575 0.271860
baz one -0.424972 0.567020
two 0.276232 -1.087401
In [16]: stacked.unstack(1)
Out[16]:
second one two
first
bar A 0.721555 -1.039575
B -0.706771 0.271860
baz A -0.424972 0.276232
B 0.567020 -1.087401
In [17]: stacked.unstack(0)
Out[17]:
first bar baz
second
one A 0.721555 -0.424972
B -0.706771 0.567020
two A -1.039575 0.276232
B 0.271860 -1.087401
如果索引具有名称,则可以使用级别名称,而不是指定级别号:
In [18]: stacked.unstack('second')
Out[18]:
second one two
first
bar A 0.721555 -1.039575
B -0.706771 0.271860
baz A -0.424972 0.276232
B 0.567020 -1.087401
请注意,stack
和unstack
方法隐含地排序所涉及的索引级别。因此,调用stack
然后unstack
,反之亦然,将导致原始DataFrame或系列的排序副本:
In [19]: index = pd.MultiIndex.from_product([[2,1], ['a', 'b']])
In [20]: df = pd.DataFrame(np.random.randn(4), index=index, columns=['A'])
In [21]: df
Out[21]:
A
2 a -0.370647
b -1.157892
1 a -1.344312
b 0.844885
In [22]: all(df.unstack().stack() == df.sort_index())
Out[22]: True
而如果去除对sort_index
的调用,则上述代码将产生TypeError
。
Multiple Levels¶
您还可以通过传递级别列表来一次堆叠或取消堆栈多个级别,在这种情况下,最终结果就好像单独处理列表中的每个级别一样。
In [23]: columns = pd.MultiIndex.from_tuples([
....: ('A', 'cat', 'long'), ('B', 'cat', 'long'),
....: ('A', 'dog', 'short'), ('B', 'dog', 'short')
....: ],
....: names=['exp', 'animal', 'hair_length']
....: )
....:
In [24]: df = pd.DataFrame(np.random.randn(4, 4), columns=columns)
In [25]: df
Out[25]:
exp A B A B
animal cat cat dog dog
hair_length long long short short
0 1.075770 -0.109050 1.643563 -1.469388
1 0.357021 -0.674600 -1.776904 -0.968914
2 -1.294524 0.413738 0.276662 -0.472035
3 -0.013960 -0.362543 -0.006154 -0.923061
In [26]: df.stack(level=['animal', 'hair_length'])
Out[26]:
exp A B
animal hair_length
0 cat long 1.075770 -0.109050
dog short 1.643563 -1.469388
1 cat long 0.357021 -0.674600
dog short -1.776904 -0.968914
2 cat long -1.294524 0.413738
dog short 0.276662 -0.472035
3 cat long -0.013960 -0.362543
dog short -0.006154 -0.923061
级别列表可以包含级别名称或级别号(但不能是两者的混合)。
# df.stack(level=['animal', 'hair_length'])
# from above is equivalent to:
In [27]: df.stack(level=[1, 2])
Out[27]:
exp A B
animal hair_length
0 cat long 1.075770 -0.109050
dog short 1.643563 -1.469388
1 cat long 0.357021 -0.674600
dog short -1.776904 -0.968914
2 cat long -1.294524 0.413738
dog short 0.276662 -0.472035
3 cat long -0.013960 -0.362543
dog short -0.006154 -0.923061
Missing Data¶
这些函数在处理缺失数据方面是智能的,并且不期望分层索引中的每个子组具有相同的一组标签。它们还可以处理未排序的索引(但是您可以通过调用sort_index
来排序)。这里有一个更复杂的例子:
In [28]: columns = pd.MultiIndex.from_tuples([('A', 'cat'), ('B', 'dog'),
....: ('B', 'cat'), ('A', 'dog')],
....: names=['exp', 'animal'])
....:
In [29]: index = pd.MultiIndex.from_product([('bar', 'baz', 'foo', 'qux'),
....: ('one', 'two')],
....: names=['first', 'second'])
....:
In [30]: df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)
In [31]: df2 = df.ix[[0, 1, 2, 4, 5, 7]]
In [32]: df2
Out[32]:
exp A B A
animal cat dog cat dog
first second
bar one 0.895717 0.805244 -1.206412 2.565646
two 1.431256 1.340309 -1.170299 -0.226169
baz one 0.410835 0.813850 0.132003 -0.827317
foo one -1.413681 1.607920 1.024180 0.569605
two 0.875906 -2.211372 0.974466 -2.006747
qux two -1.226825 0.769804 -1.281247 -0.727707
如上所述,可以使用level
参数调用stack
,以选择要堆叠的列中的哪个级别:
In [33]: df2.stack('exp')
Out[33]:
animal cat dog
first second exp
bar one A 0.895717 2.565646
B -1.206412 0.805244
two A 1.431256 -0.226169
B -1.170299 1.340309
baz one A 0.410835 -0.827317
B 0.132003 0.813850
foo one A -1.413681 0.569605
B 1.024180 1.607920
two A 0.875906 -2.006747
B 0.974466 -2.211372
qux two A -1.226825 -0.727707
B -1.281247 0.769804
In [34]: df2.stack('animal')
Out[34]:
exp A B
first second animal
bar one cat 0.895717 -1.206412
dog 2.565646 0.805244
two cat 1.431256 -1.170299
dog -0.226169 1.340309
baz one cat 0.410835 0.132003
dog -0.827317 0.813850
foo one cat -1.413681 1.024180
dog 0.569605 1.607920
two cat 0.875906 0.974466
dog -2.006747 -2.211372
qux two cat -1.226825 -1.281247
dog -0.727707 0.769804
如果子组不具有相同的标签集,则解除堆叠可能导致缺失值。默认情况下,缺少的值将替换为该数据类型的默认填充值,NaN
用于float,NaT
用于datetimelike等。对于整数类型,默认情况下,数据将转换为float,缺少的值将设置为NaN
。
In [35]: df3 = df.iloc[[0, 1, 4, 7], [1, 2]]
In [36]: df3
Out[36]:
exp B
animal dog cat
first second
bar one 0.805244 -1.206412
two 1.340309 -1.170299
foo one 1.607920 1.024180
qux two 0.769804 -1.281247
In [37]: df3.unstack()
Out[37]:
exp B
animal dog cat
second one two one two
first
bar 0.805244 1.340309 -1.206412 -1.170299
foo 1.607920 NaN 1.024180 NaN
qux NaN 0.769804 NaN -1.281247
或者,unstack采用可选的fill_value
参数,用于指定缺少的数据的值。
In [38]: df3.unstack(fill_value=-1e9)
Out[38]:
exp B
animal dog cat
second one two one two
first
bar 8.052440e-01 1.340309e+00 -1.206412e+00 -1.170299e+00
foo 1.607920e+00 -1.000000e+09 1.024180e+00 -1.000000e+09
qux -1.000000e+09 7.698036e-01 -1.000000e+09 -1.281247e+00
With a MultiIndex¶
当列为MultiIndex
时解除堆叠也小心做正确的事:
In [39]: df[:3].unstack(0)
Out[39]:
exp A B A \
animal cat dog cat dog
first bar baz bar baz bar baz bar
second
one 0.895717 0.410835 0.805244 0.81385 -1.206412 0.132003 2.565646
two 1.431256 NaN 1.340309 NaN -1.170299 NaN -0.226169
exp
animal
first baz
second
one -0.827317
two NaN
In [40]: df2.unstack(1)
Out[40]:
exp A B A \
animal cat dog cat dog
second one two one two one two one
first
bar 0.895717 1.431256 0.805244 1.340309 -1.206412 -1.170299 2.565646
baz 0.410835 NaN 0.813850 NaN 0.132003 NaN -0.827317
foo -1.413681 0.875906 1.607920 -2.211372 1.024180 0.974466 0.569605
qux NaN -1.226825 NaN 0.769804 NaN -1.281247 NaN
exp
animal
second two
first
bar -0.226169
baz NaN
foo -2.006747
qux -0.727707
Reshaping by Melt¶
melt()
函数有助于将DataFrame压缩为一个或多个列是标识符变量的格式,而所有其他列(被认为是测量变量)都与行轴“不相关”,只留下两个非标识符列,“variable”和“value”。可以通过提供var_name
和value_name
参数来自定义这些列的名称。
例如,
In [41]: cheese = pd.DataFrame({'first' : ['John', 'Mary'],
....: 'last' : ['Doe', 'Bo'],
....: 'height' : [5.5, 6.0],
....: 'weight' : [130, 150]})
....:
In [42]: cheese
Out[42]:
first height last weight
0 John 5.5 Doe 130
1 Mary 6.0 Bo 150
In [43]: pd.melt(cheese, id_vars=['first', 'last'])
Out[43]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
In [44]: pd.melt(cheese, id_vars=['first', 'last'], var_name='quantity')
Out[44]:
first last quantity value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
另一种变换方式是使用wide_to_long
面板数据便利功能。
In [45]: dft = pd.DataFrame({"A1970" : {0 : "a", 1 : "b", 2 : "c"},
....: "A1980" : {0 : "d", 1 : "e", 2 : "f"},
....: "B1970" : {0 : 2.5, 1 : 1.2, 2 : .7},
....: "B1980" : {0 : 3.2, 1 : 1.3, 2 : .1},
....: "X" : dict(zip(range(3), np.random.randn(3)))
....: })
....:
In [46]: dft["id"] = dft.index
In [47]: dft
Out[47]:
A1970 A1980 B1970 B1980 X id
0 a d 2.5 3.2 -0.121306 0
1 b e 1.2 1.3 -0.097883 1
2 c f 0.7 0.1 0.695775 2
In [48]: pd.wide_to_long(dft, ["A", "B"], i="id", j="year")
Out[48]:
X A B
id year
0 1970 -0.121306 a 2.5
1 1970 -0.097883 b 1.2
2 1970 0.695775 c 0.7
0 1980 -0.121306 d 3.2
1 1980 -0.097883 e 1.3
2 1980 0.695775 f 0.1
Combining with stats and GroupBy¶
将pivot
/ stack
/ unstack
与GroupBy和基本的Series和DataFrame统计函数组合可以产生一些非常有表现力和快速数据操作。
In [49]: df
Out[49]:
exp A B A
animal cat dog cat dog
first second
bar one 0.895717 0.805244 -1.206412 2.565646
two 1.431256 1.340309 -1.170299 -0.226169
baz one 0.410835 0.813850 0.132003 -0.827317
two -0.076467 -1.187678 1.130127 -1.436737
foo one -1.413681 1.607920 1.024180 0.569605
two 0.875906 -2.211372 0.974466 -2.006747
qux one -0.410001 -0.078638 0.545952 -1.219217
two -1.226825 0.769804 -1.281247 -0.727707
In [50]: df.stack().mean(1).unstack()
Out[50]:
animal cat dog
first second
bar one -0.155347 1.685445
two 0.130479 0.557070
baz one 0.271419 -0.006733
two 0.526830 -1.312207
foo one -0.194750 1.088763
two 0.925186 -2.109060
qux one 0.067976 -0.648927
two -1.254036 0.021048
# same result, another way
In [51]: df.groupby(level=1, axis=1).mean()
Out[51]:
animal cat dog
first second
bar one -0.155347 1.685445
two 0.130479 0.557070
baz one 0.271419 -0.006733
two 0.526830 -1.312207
foo one -0.194750 1.088763
two 0.925186 -2.109060
qux one 0.067976 -0.648927
two -1.254036 0.021048
In [52]: df.stack().groupby(level=1).mean()
Out[52]:
exp A B
second
one 0.071448 0.455513
two -0.424186 -0.204486
In [53]: df.mean().unstack(0)
Out[53]:
exp A B
animal
cat 0.060843 0.018596
dog -0.413580 0.232430
Pivot tables¶
函数pandas.pivot_table
可用于创建电子表格样式的枢轴表。有关某些高级策略,请参阅cookbook
它需要一些参数
data
:DataFrame对象values
:要聚合的列或列的列表index
:列,Grouper,与数据长度相同的数组或列表。按分组依据的数据透视表索引。如果传递数组,则其使用方式与列值相同。columns
:列,Grouper,与数据长度相同的数组或列表。分组依据的关键字数据透视表列。如果传递数组,则其使用方式与列值相同。aggfunc
:用于聚合的函数,默认为numpy.mean
考虑一个像这样的数据集:
In [54]: import datetime
In [55]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 6,
....: 'B': ['A', 'B', 'C'] * 8,
....: 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
....: 'D': np.random.randn(24),
....: 'E': np.random.randn(24),
....: 'F': [datetime.datetime(2013, i, 1) for i in range(1, 13)] +
....: [datetime.datetime(2013, i, 15) for i in range(1, 13)]})
....:
In [56]: df
Out[56]:
A B C D E F
0 one A foo 0.341734 -0.317441 2013-01-01
1 one B foo 0.959726 -1.236269 2013-02-01
2 two C foo -1.110336 0.896171 2013-03-01
3 three A bar -0.619976 -0.487602 2013-04-01
4 one B bar 0.149748 -0.082240 2013-05-01
5 one C bar -0.732339 -2.182937 2013-06-01
6 two A foo 0.687738 0.380396 2013-07-01
.. ... .. ... ... ... ...
17 one C bar -0.345352 0.206053 2013-06-15
18 two A foo 1.314232 -0.251905 2013-07-15
19 three B foo 0.690579 -2.213588 2013-08-15
20 one C foo 0.995761 1.063327 2013-09-15
21 one A bar 2.396780 1.266143 2013-10-15
22 two B bar 0.014871 0.299368 2013-11-15
23 three C bar 3.357427 -0.863838 2013-12-15
[24 rows x 6 columns]
我们可以非常容易地从这些数据生成数据透视表:
In [57]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[57]:
C bar foo
A B
one A 1.120915 -0.514058
B -0.338421 0.002759
C -0.538846 0.699535
three A -1.181568 NaN
B NaN 0.433512
C 0.588783 NaN
two A NaN 1.000985
B 0.158248 NaN
C NaN 0.176180
In [58]: pd.pivot_table(df, values='D', index=['B'], columns=['A', 'C'], aggfunc=np.sum)
Out[58]:
A one three two
C bar foo bar foo bar foo
B
A 2.241830 -1.028115 -2.363137 NaN NaN 2.001971
B -0.676843 0.005518 NaN 0.867024 0.316495 NaN
C -1.077692 1.399070 1.177566 NaN NaN 0.352360
In [59]: pd.pivot_table(df, values=['D','E'], index=['B'], columns=['A', 'C'], aggfunc=np.sum)
Out[59]:
D E \
A one three two one
C bar foo bar foo bar foo bar
B
A 2.241830 -1.028115 -2.363137 NaN NaN 2.001971 2.786113
B -0.676843 0.005518 NaN 0.867024 0.316495 NaN 1.368280
C -1.077692 1.399070 1.177566 NaN NaN 0.352360 -1.976883
A three two
C foo bar foo bar foo
B
A -0.043211 1.922577 NaN NaN 0.128491
B -1.103384 NaN -2.128743 -0.194294 NaN
C 1.495717 -0.263660 NaN NaN 0.872482
结果对象是在行和列上具有潜在分层索引的DataFrame。如果未给出values
列名称,则数据透视表将包括可以在列中的附加层次级别中聚合的所有数据:
In [60]: pd.pivot_table(df, index=['A', 'B'], columns=['C'])
Out[60]:
D E
C bar foo bar foo
A B
one A 1.120915 -0.514058 1.393057 -0.021605
B -0.338421 0.002759 0.684140 -0.551692
C -0.538846 0.699535 -0.988442 0.747859
three A -1.181568 NaN 0.961289 NaN
B NaN 0.433512 NaN -1.064372
C 0.588783 NaN -0.131830 NaN
two A NaN 1.000985 NaN 0.064245
B 0.158248 NaN -0.097147 NaN
C NaN 0.176180 NaN 0.436241
此外,您还可以对index
和columns
关键字使用Grouper
。有关Grouper
的详细信息,请参阅Grouping with a Grouper specification。
In [61]: pd.pivot_table(df, values='D', index=pd.Grouper(freq='M', key='F'), columns='C')
Out[61]:
C bar foo
F
2013-01-31 NaN -0.514058
2013-02-28 NaN 0.002759
2013-03-31 NaN 0.176180
2013-04-30 -1.181568 NaN
2013-05-31 -0.338421 NaN
2013-06-30 -0.538846 NaN
2013-07-31 NaN 1.000985
2013-08-31 NaN 0.433512
2013-09-30 NaN 0.699535
2013-10-31 1.120915 NaN
2013-11-30 0.158248 NaN
2013-12-31 0.588783 NaN
如果您愿意,可以通过调用to_string
来呈现表的一个不错的输出,省略缺少的值:
In [62]: table = pd.pivot_table(df, index=['A', 'B'], columns=['C'])
In [63]: print(table.to_string(na_rep=''))
D E
C bar foo bar foo
A B
one A 1.120915 -0.514058 1.393057 -0.021605
B -0.338421 0.002759 0.684140 -0.551692
C -0.538846 0.699535 -0.988442 0.747859
three A -1.181568 0.961289
B 0.433512 -1.064372
C 0.588783 -0.131830
two A 1.000985 0.064245
B 0.158248 -0.097147
C 0.176180 0.436241
请注意,pivot_table
也可用作DataFrame上的实例方法。
Adding margins¶
如果您将margins=True
传递到pivot_table
,则特殊的All
列和行将在行和列的类别中添加部分组聚合:
In [64]: df.pivot_table(index=['A', 'B'], columns='C', margins=True, aggfunc=np.std)
Out[64]:
D E
C bar foo All bar foo All
A B
one A 1.804346 1.210272 1.569879 0.179483 0.418374 0.858005
B 0.690376 1.353355 0.898998 1.083825 0.968138 1.101401
C 0.273641 0.418926 0.771139 1.689271 0.446140 1.422136
three A 0.794212 NaN 0.794212 2.049040 NaN 2.049040
B NaN 0.363548 0.363548 NaN 1.625237 1.625237
C 3.915454 NaN 3.915454 1.035215 NaN 1.035215
two A NaN 0.442998 0.442998 NaN 0.447104 0.447104
B 0.202765 NaN 0.202765 0.560757 NaN 0.560757
C NaN 1.819408 1.819408 NaN 0.650439 0.650439
All 1.556686 0.952552 1.246608 1.250924 0.899904 1.059389
Cross tabulations¶
使用crosstab
函数计算两个(或多个)因子的交叉表。默认情况下,crosstab
计算因子的频率表,除非传递值数组和聚合函数。
它需要一些参数
index
:array-like,在行中分组的值columns
:array-like,在列中分组的值values
:array-like,可选,根据因子聚合的值数组aggfunc
:function,可选,如果未传递values数组,则计算频率表rownames
:sequence,默认None
必须匹配通过的行数组数colnames
:序列,默认None
(如果传递)必须匹配传递的列数组数margins
:布尔值,默认值False
,添加行/列边距(小计)normalize
:boolean,{'all','index','columns'}或{0,1},默认False
。将所有值除以值的总和进行归一化。
任何传递的系列将使用其名称属性,除非指定交叉列表的行或列名称
例如:
In [65]: foo, bar, dull, shiny, one, two = 'foo', 'bar', 'dull', 'shiny', 'one', 'two'
In [66]: a = np.array([foo, foo, bar, bar, foo, foo], dtype=object)
In [67]: b = np.array([one, one, two, one, two, one], dtype=object)
In [68]: c = np.array([dull, dull, shiny, dull, dull, shiny], dtype=object)
In [69]: pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
Out[69]:
b one two
c dull shiny dull shiny
a
bar 1 0 0 1
foo 2 1 1 0
如果crosstab
只接收两个Series,它将提供一个频率表。
In [70]: df = pd.DataFrame({'A': [1, 2, 2, 2, 2], 'B': [3, 3, 4, 4, 4],
....: 'C': [1, 1, np.nan, 1, 1]})
....:
In [71]: df
Out[71]:
A B C
0 1 3 1.0
1 2 3 1.0
2 2 4 NaN
3 2 4 1.0
4 2 4 1.0
In [72]: pd.crosstab(df.A, df.B)
Out[72]:
B 3 4
A
1 1 0
2 1 3
包含Categorical
数据的任何输入都将包含在交叉列表中的类别中包含所有,即使实际数据不包含特定类别的任何实例。
In [73]: foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
In [74]: bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
In [75]: pd.crosstab(foo, bar)
Out[75]:
col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0
Normalization¶
版本0.18.1中的新功能。
频率表也可以使用normalize
参数进行标准化,以显示百分比而不是计数:
In [76]: pd.crosstab(df.A, df.B, normalize=True)
Out[76]:
B 3 4
A
1 0.2 0.0
2 0.2 0.6
normalize
还可以标准化每行或每列中的值:
In [77]: pd.crosstab(df.A, df.B, normalize='columns')
Out[77]:
B 3 4
A
1 0.5 0.0
2 0.5 1.0
crosstab
还可以传递第三个系列和聚合函数(aggfunc
),将应用于由前两个系列定义的每个组中的第三个系列的值:
In [78]: pd.crosstab(df.A, df.B, values=df.C, aggfunc=np.sum)
Out[78]:
B 3 4
A
1 1.0 NaN
2 1.0 2.0
Adding Margins¶
最后,还可以添加边距或规范化此输出。
In [79]: pd.crosstab(df.A, df.B, values=df.C, aggfunc=np.sum, normalize=True,
....: margins=True)
....:
Out[79]:
B 3 4 All
A
1 0.25 0.0 0.25
2 0.25 0.5 0.75
All 0.50 0.5 1.00
Tiling¶
cut
函数计算输入数组值的分组,通常用于将连续变量转换为离散变量或分类变量:
In [80]: ages = np.array([10, 15, 13, 12, 23, 25, 28, 59, 60])
In [81]: pd.cut(ages, bins=3)
Out[81]:
[(9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (26.667, 43.333], (43.333, 60], (43.333, 60]]
Categories (3, object): [(9.95, 26.667] < (26.667, 43.333] < (43.333, 60]]
如果bins
关键字是一个整数,那么将形成等宽字节。或者,我们可以指定自定义bin边:
In [82]: pd.cut(ages, bins=[0, 18, 35, 70])
Out[82]:
[(0, 18], (0, 18], (0, 18], (0, 18], (18, 35], (18, 35], (18, 35], (35, 70], (35, 70]]
Categories (3, object): [(0, 18] < (18, 35] < (35, 70]]
Computing indicator / dummy variables¶
为了将分类变量转换为“虚拟”或“指示符”DataFrame,例如具有k
不同值的DataFrame(a Series)中的列,可以导出包含k
In [83]: df = pd.DataFrame({'key': list('bbacab'), 'data1': range(6)})
In [84]: pd.get_dummies(df['key'])
Out[84]:
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
有时,使用列名称前缀是有用的,例如在将结果与原始DataFrame合并时:
In [85]: dummies = pd.get_dummies(df['key'], prefix='key')
In [86]: dummies
Out[86]:
key_a key_b key_c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
In [87]: df[['data1']].join(dummies)
Out[87]:
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
此函数通常与cut
等离散函数一起使用:
In [88]: values = np.random.randn(10)
In [89]: values
Out[89]:
array([ 0.4082, -1.0481, -0.0257, -0.9884, 0.0941, 1.2627, 1.29 ,
0.0824, -0.0558, 0.5366])
In [90]: bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
In [91]: pd.get_dummies(pd.cut(values, bins))
Out[91]:
(0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1]
0 0 0 1 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
7 1 0 0 0 0
8 0 0 0 0 0
9 0 0 1 0 0
版本0.15.0中的新功能。
get_dummies()
也接受一个DataFrame。默认情况下,所有类别变量(在统计学意义上为分类变量,具有对象或分类 t> dtype)被编码为虚拟变量。
In [92]: df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
....: 'C': [1, 2, 3]})
....:
In [93]: pd.get_dummies(df)
Out[93]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
在输出中包含所有非对象列。
您可以控制使用columns
关键字编码的列。
In [94]: pd.get_dummies(df, columns=['A'])
Out[94]:
B C A_a A_b
0 c 1 1 0
1 c 2 0 1
2 b 3 1 0
请注意,B
列仍包含在输出中,它只是没有被编码。如果您不想将其包含在输出中,则可以在调用get_dummies
之前拖动B
。
与Series版本一样,您可以传递prefix
和prefix_sep
的值。默认情况下,列名称用作前缀,“_”用作前缀分隔符。您可以通过3种方式指定prefix
和prefix_sep
- string:对要编码的每个列,使用
prefix
或prefix_sep
的相同值 - list:必须与正在编码的列的长度相同。
- dict:将列名映射到前缀
In [95]: simple = pd.get_dummies(df, prefix='new_prefix')
In [96]: simple
Out[96]:
C new_prefix_a new_prefix_b new_prefix_b new_prefix_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
In [97]: from_list = pd.get_dummies(df, prefix=['from_A', 'from_B'])
In [98]: from_list
Out[98]:
C from_A_a from_A_b from_B_b from_B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
In [99]: from_dict = pd.get_dummies(df, prefix={'B': 'from_B', 'A': 'from_A'})
In [100]: from_dict
Out[100]:
C from_A_a from_A_b from_B_b from_B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
版本0.18.0中的新功能。
有时,在将结果馈送到统计模型时,仅保留k-1级别的分类变量以避免共线性将是有用的。您可以通过打开drop_first
切换到此模式。
In [101]: s = pd.Series(list('abcaa'))
In [102]: pd.get_dummies(s)
Out[102]:
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 1 0 0
In [103]: pd.get_dummies(s, drop_first=True)
Out[103]:
b c
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0
当列只包含一个级别时,将在结果中省略。
In [104]: df = pd.DataFrame({'A':list('aaaaa'),'B':list('ababc')})
In [105]: pd.get_dummies(df)
Out[105]:
A_a B_a B_b B_c
0 1 1 0 0
1 1 0 1 0
2 1 1 0 0
3 1 0 1 0
4 1 0 0 1
In [106]: pd.get_dummies(df, drop_first=True)
Out[106]:
B_b B_c
0 0 0
1 1 0
2 0 0
3 1 0
4 0 1
Factorizing values¶
要将1-d值编码为枚举类型,请使用factorize
:
In [107]: x = pd.Series(['A', 'A', np.nan, 'B', 3.14, np.inf])
In [108]: x
Out[108]:
0 A
1 A
2 NaN
3 B
4 3.14
5 inf
dtype: object
In [109]: labels, uniques = pd.factorize(x)
In [110]: labels
Out[110]: array([ 0, 0, -1, 1, 2, 3])
In [111]: uniques
Out[111]: Index([u'A', u'B', 3.14, inf], dtype='object')
注意,factorize
类似于numpy.unique
,但在处理NaN时不同:
注意
由于排序错误,以下numpy.unique
将在Python 3和TypeError
下失败。另见此处
In [112]: pd.factorize(x, sort=True)
Out[112]:
(array([ 2, 2, -1, 3, 0, 1]),
Index([3.14, inf, u'A', u'B'], dtype='object'))
In [113]: np.unique(x, return_inverse=True)[::-1]
Out[113]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))
注意
如果你只想处理一个列作为一个分类变量(如R的因子),你可以使用df [“cat_col”] = 分类(df [“col”])
或df [“cat_col”] = df [“col” ] .astype(“category”)
。有关Categorical
的完整文档,请参阅Categorical introduction和API documentation。此功能在0.15版中引入。