Intro to Data Structures¶

```In [1]: import numpy as np

In [2]: import pandas as pd
```

Series¶

`Series`是一个一维标签数组，可以保存任何数据类型（整数，字符串，浮点数，Python对象等）。轴标签统称为索引创建Series的基本方法是调用：

```>>> s = pd.Series(data, index=index)
```

• 一个Python dict
• ndarray
• 标量值（如5）

```In [3]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [4]: s
Out[4]:
a    0.2735
b    0.6052
c   -0.1692
d    1.8298
e    0.5432
dtype: float64

In [5]: s.index
Out[5]: Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')

In [6]: pd.Series(np.random.randn(5))
Out[6]:
0    0.3674
1   -0.8230
2   -1.0295
3   -1.0523
4   -0.8502
dtype: float64
```

```In [7]: d = {'a' : 0., 'b' : 1., 'c' : 2.}

In [8]: pd.Series(d)
Out[8]:
a    0.0
b    1.0
c    2.0
dtype: float64

In [9]: pd.Series(d, index=['b', 'c', 'd', 'a'])
Out[9]:
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64
```

NaN（不是数字）是用于pandas的标准缺失数据标记

```In [10]: pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
Out[10]:
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64
```

Series is ndarray-like¶

`Series`的作用与`ndarray`非常相似，是大多数NumPy函数的有效参数。然而，像切片这样的东西也会切片索引。

```In [11]: s[0]
Out[11]: 0.27348116325673794

In [12]: s[:3]
Out[12]:
a    0.2735
b    0.6052
c   -0.1692
dtype: float64

In [13]: s[s > s.median()]
Out[13]:
b    0.6052
d    1.8298
dtype: float64

In [14]: s[[4, 3, 1]]
Out[14]:
e    0.5432
d    1.8298
b    0.6052
dtype: float64

In [15]: np.exp(s)
Out[15]:
a    1.3145
b    1.8317
c    0.8443
d    6.2327
e    1.7215
dtype: float64
```

Series is dict-like¶

```In [16]: s['a']
Out[16]: 0.27348116325673794

In [17]: s['e'] = 12.

In [18]: s
Out[18]:
a     0.2735
b     0.6052
c    -0.1692
d     1.8298
e    12.0000
dtype: float64

In [19]: 'e' in s
Out[19]: True

In [20]: 'f' in s
Out[20]: False
```

```>>> s['f']
KeyError: 'f'
```

```In [21]: s.get('f')

In [22]: s.get('f', np.nan)
Out[22]: nan
```

Vectorized operations and label alignment with Series¶

```In [23]: s + s
Out[23]:
a     0.5470
b     1.2104
c    -0.3385
d     3.6596
e    24.0000
dtype: float64

In [24]: s * 2
Out[24]:
a     0.5470
b     1.2104
c    -0.3385
d     3.6596
e    24.0000
dtype: float64

In [25]: np.exp(s)
Out[25]:
a         1.3145
b         1.8317
c         0.8443
d         6.2327
e    162754.7914
dtype: float64
```

Series和ndarray之间的主要区别是，Series之间的操作会根据标签自动对齐数据。因此，您可以在不考虑所涉及的系列是否具有相同标签的情况下编写计算。

```In [26]: s[1:] + s[:-1]
Out[26]:
a       NaN
b    1.2104
c   -0.3385
d    3.6596
e       NaN
dtype: float64
```

Name attribute¶

```In [27]: s = pd.Series(np.random.randn(5), name='something')

In [28]: s
Out[28]:
0    1.5140
1   -1.2345
2    0.5666
3   -1.0184
4    0.1081
Name: something, dtype: float64

In [29]: s.name
Out[29]: 'something'
```

```In [30]: s2 = s.rename("different")

In [31]: s2.name
Out[31]: 'different'
```

DataFrame¶

DataFrame是具有可能不同类型的列的2维标记数据结构。你可以把它想象成一个电子表格或SQL表，或者系列对象的dict。它一般是最常用的pandas对象。像系列一样，DataFrame接受许多不同类型的输入：

• 1D数组，列表，dicts或Series的说明
• 2-D numpy.ndarray
• 结构化或记录 ndarray
• A `Series`
• 另一个`DataFrame`

From dict of Series or dicts¶

```In [32]: d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
....:      'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
....:

In [33]: df = pd.DataFrame(d)

In [34]: df
Out[34]:
one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

In [35]: pd.DataFrame(d, index=['d', 'b', 'a'])
Out[35]:
one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0

In [36]: pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
Out[36]:
two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN
```

```In [37]: df.index
Out[37]: Index([u'a', u'b', u'c', u'd'], dtype='object')

In [38]: df.columns
Out[38]: Index([u'one', u'two'], dtype='object')
```

From dict of ndarrays / lists¶

ndarrays必须都是相同的长度。如果索引被传递，它必须清楚地也是与数组相同的长度。如果没有传递索引，结果将是`range(n)`，其中`n`是数组长度。

```In [39]: d = {'one' : [1., 2., 3., 4.],
....:      'two' : [4., 3., 2., 1.]}
....:

In [40]: pd.DataFrame(d)
Out[40]:
one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0

In [41]: pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
Out[41]:
one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0
```

From structured or record array¶

```In [42]: data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])

In [43]: data[:] = [(1,2.,'Hello'), (2,3.,"World")]

In [44]: pd.DataFrame(data)
Out[44]:
A    B      C
0  1  2.0  Hello
1  2  3.0  World

In [45]: pd.DataFrame(data, index=['first', 'second'])
Out[45]:
A    B      C
first   1  2.0  Hello
second  2  3.0  World

In [46]: pd.DataFrame(data, columns=['C', 'A', 'B'])
Out[46]:
C  A    B
0  Hello  1  2.0
1  World  2  3.0
```

DataFrame不打算像2维NumPy ndarray一样工作。

From a list of dicts¶

```In [47]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [48]: pd.DataFrame(data2)
Out[48]:
a   b     c
0  1   2   NaN
1  5  10  20.0

In [49]: pd.DataFrame(data2, index=['first', 'second'])
Out[49]:
a   b     c
first   1   2   NaN
second  5  10  20.0

In [50]: pd.DataFrame(data2, columns=['a', 'b'])
Out[50]:
a   b
0  1   2
1  5  10
```

From a dict of tuples¶

```In [51]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
....:               ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
....:               ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
....:               ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
....:               ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
....:
Out[51]:
a              b
a    b    c    a     b
A B  4.0  1.0  5.0  8.0  10.0
C  3.0  2.0  6.0  7.0   NaN
D  NaN  NaN  NaN  NaN   9.0
```

From a Series¶

Missing data部分中，将对此主题进行更多说明。要构造具有缺失数据的DataFrame，请对缺少的值使用`np.nan`或者，您可以将`numpy.MaskedArray`作为数据参数传递给DataFrame构造函数，其掩码条目将被视为缺失。

Alternate Constructors¶

DataFrame.from_dict

`DataFrame.from_dict`接受dicts的dict或类似array的序列的dict，并返回一个DataFrame。除了默认情况下为`'columns'``orient`参数，它可以像`DataFrame`构造函数操作，但可以设置为`'index'`，以便将dict键用作行标签。

DataFrame.from_records

`DataFrame.from_records`获取元组的列表或带有结构化dtype的ndarray。以类似于正常`DataFrame`构造函数的方式工作，除了索引可能是用作索引的结构化dtype的特定字段。例如：

```In [52]: data
Out[52]:
array([(1, 2.0, 'Hello'), (2, 3.0, 'World')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [53]: pd.DataFrame.from_records(data, index='C')
Out[53]:
A    B
C
Hello  1  2.0
World  2  3.0
```

DataFrame.from_items

`DataFrame.from_items`的工作方式类似于采用`（键， 值）的dict构造函数的形式`对，其中键是列（或行，在`orient='index'`）的情况下，并且值是列值（或行值）。这对于以特定的顺序构建具有列的DataFrame非常有用，而不必传递明确的列列表：

```In [54]: pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])])
Out[54]:
A  B
0  1  4
1  2  5
2  3  6
```

```In [55]: pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])],
....:                         orient='index', columns=['one', 'two', 'three'])
....:
Out[55]:
one  two  three
A    1    2      3
B    4    5      6
```

Column selection, addition, deletion¶

```In [56]: df['one']
Out[56]:
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [57]: df['three'] = df['one'] * df['two']

In [58]: df['flag'] = df['one'] > 2

In [59]: df
Out[59]:
one  two  three   flag
a  1.0  1.0    1.0  False
b  2.0  2.0    4.0  False
c  3.0  3.0    9.0   True
d  NaN  4.0    NaN  False
```

```In [60]: del df['two']

In [61]: three = df.pop('three')

In [62]: df
Out[62]:
one   flag
a  1.0  False
b  2.0  False
c  3.0   True
d  NaN  False
```

```In [63]: df['foo'] = 'bar'

In [64]: df
Out[64]:
one   flag  foo
a  1.0  False  bar
b  2.0  False  bar
c  3.0   True  bar
d  NaN  False  bar
```

```In [65]: df['one_trunc'] = df['one'][:2]

In [66]: df
Out[66]:
one   flag  foo  one_trunc
a  1.0  False  bar        1.0
b  2.0  False  bar        2.0
c  3.0   True  bar        NaN
d  NaN  False  bar        NaN
```

```In [67]: df.insert(1, 'bar', df['one'])

In [68]: df
Out[68]:
one  bar   flag  foo  one_trunc
a  1.0  1.0  False  bar        1.0
b  2.0  2.0  False  bar        2.0
c  3.0  3.0   True  bar        NaN
d  NaN  NaN  False  bar        NaN
```

Assigning New Columns in Method Chains¶

dplyr的 `mutate`动词的启发，DataFrame有一个`assign()`方法，允许您轻松创建可能从现有列派生的新列。

```In [69]: iris = pd.read_csv('data/iris.data')

In [70]: iris.head()
Out[70]:
SepalLength  SepalWidth  PetalLength  PetalWidth         Name
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa

In [71]: (iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength'])
....:      .head())
....:
Out[71]:
SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa       0.6863
1          4.9         3.0          1.4         0.2  Iris-setosa       0.6122
2          4.7         3.2          1.3         0.2  Iris-setosa       0.6809
3          4.6         3.1          1.5         0.2  Iris-setosa       0.6739
4          5.0         3.6          1.4         0.2  Iris-setosa       0.7200
```

```In [72]: iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
....:                                      x['SepalLength'])).head()
....:
Out[72]:
SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa       0.6863
1          4.9         3.0          1.4         0.2  Iris-setosa       0.6122
2          4.7         3.2          1.3         0.2  Iris-setosa       0.6809
3          4.6         3.1          1.5         0.2  Iris-setosa       0.6739
4          5.0         3.6          1.4         0.2  Iris-setosa       0.7200
```

`assign` 始终返回数据的副本，而保留原始DataFrame不变。

```In [73]: (iris.query('SepalLength > 5')
....:      .assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength,
....:              PetalRatio = lambda x: x.PetalWidth / x.PetalLength)
....:      .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
....:
Out[73]: <matplotlib.axes._subplots.AxesSubplot at 0x7ff286891b50>
```

`assign`的函数参数是`**kwargs`键是新字段的列名，值是要插入的值（例如，`Series`或NumPy数组），或者是要在`DataFrame`返回原始DataFrame的副本，并插入新值。

```In [74]: # Don't do this, bad reference to `C`
df.assign(C = lambda x: x['A'] + x['B'],
D = lambda x: x['A'] + x['C'])
In [2]: # Instead, break it into two assigns
(df.assign(C = lambda x: x['A'] + x['B'])
.assign(D = lambda x: x['A'] + x['C']))
```

Indexing / Selection¶

```In [75]: df.loc['b']
Out[75]:
one              2
bar              2
flag         False
foo            bar
one_trunc        2
Name: b, dtype: object

In [76]: df.iloc[2]
Out[76]:
one             3
bar             3
flag         True
foo           bar
one_trunc     NaN
Name: c, dtype: object
```

Data alignment and arithmetic¶

DataFrame对象之间的数据对齐在列和索引（行标签）上自动对齐。同样，生成的对象将具有列和行标签的并集。

```In [77]: df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])

In [78]: df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])

In [79]: df + df2
Out[79]:
A       B       C   D
0  0.5222  0.3225 -0.7566 NaN
1 -0.8441  0.2334  0.8818 NaN
2 -2.2079 -0.1572 -0.3875 NaN
3  2.8080 -1.0927  1.0432 NaN
4 -1.7511 -2.0812  2.7477 NaN
5 -3.2473 -1.0850  0.7898 NaN
6 -1.7107  0.0661  0.1294 NaN
7     NaN     NaN     NaN NaN
8     NaN     NaN     NaN NaN
9     NaN     NaN     NaN NaN
```

```In [80]: df - df.iloc[0]
Out[80]:
A       B       C       D
0  0.0000  0.0000  0.0000  0.0000
1 -2.6396 -1.0702  1.7214 -0.7896
2 -2.7662 -1.6918  2.2776 -2.5401
3  0.8679 -3.5247  1.9365 -0.1331
4 -1.9883 -3.2162  2.0464 -1.0700
5 -3.3932 -4.0976  1.6366 -2.1635
6 -1.3668 -1.9572  1.6523 -0.7191
7 -0.7949 -2.1663  0.9706 -2.6297
8 -0.8383 -1.3630  1.6702 -2.0865
9  0.8588  0.0814  3.7305 -1.3737
```

```In [81]: index = pd.date_range('1/1/2000', periods=8)

In [82]: df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=list('ABC'))

In [83]: df
Out[83]:
A       B       C
2000-01-01  0.2731  0.3604 -1.1515
2000-01-02  1.1577  1.4787 -0.6528
2000-01-03 -0.7712  0.2203 -0.5739
2000-01-04 -0.6356 -1.1703 -0.0789
2000-01-05 -1.4687  0.1705 -1.8796
2000-01-06 -1.2037  0.9568 -1.1383
2000-01-07 -0.6540 -0.2169  0.3843
2000-01-08 -2.1639 -0.8145 -1.2475

In [84]: type(df['A'])
Out[84]: pandas.core.series.Series

In [85]: df - df['A']
Out[85]:
2000-01-01 00:00:00  2000-01-02 00:00:00  2000-01-03 00:00:00  \
2000-01-01                  NaN                  NaN                  NaN
2000-01-02                  NaN                  NaN                  NaN
2000-01-03                  NaN                  NaN                  NaN
2000-01-04                  NaN                  NaN                  NaN
2000-01-05                  NaN                  NaN                  NaN
2000-01-06                  NaN                  NaN                  NaN
2000-01-07                  NaN                  NaN                  NaN
2000-01-08                  NaN                  NaN                  NaN

2000-01-04 00:00:00 ...  2000-01-08 00:00:00   A   B   C
2000-01-01                  NaN ...                  NaN NaN NaN NaN
2000-01-02                  NaN ...                  NaN NaN NaN NaN
2000-01-03                  NaN ...                  NaN NaN NaN NaN
2000-01-04                  NaN ...                  NaN NaN NaN NaN
2000-01-05                  NaN ...                  NaN NaN NaN NaN
2000-01-06                  NaN ...                  NaN NaN NaN NaN
2000-01-07                  NaN ...                  NaN NaN NaN NaN
2000-01-08                  NaN ...                  NaN NaN NaN NaN

[8 rows x 11 columns]
```

```df - df['A']
```

```df.sub(df['A'], axis=0)
```

```In [86]: df * 5 + 2
Out[86]:
A       B       C
2000-01-01  3.3655  3.8018 -3.7575
2000-01-02  7.7885  9.3936 -1.2641
2000-01-03 -1.8558  3.1017 -0.8696
2000-01-04 -1.1781 -3.8513  1.6056
2000-01-05 -5.3437  2.8523 -7.3982
2000-01-06 -4.0186  6.7842 -3.6915
2000-01-07 -1.2699  0.9157  3.9217
2000-01-08 -8.8194 -2.0724 -4.2375

In [87]: 1 / df
Out[87]:
A       B        C
2000-01-01  3.6616  2.7751  -0.8684
2000-01-02  0.8638  0.6763  -1.5318
2000-01-03 -1.2967  4.5383  -1.7424
2000-01-04 -1.5733 -0.8545 -12.6759
2000-01-05 -0.6809  5.8662  -0.5320
2000-01-06 -0.8308  1.0451  -0.8785
2000-01-07 -1.5291 -4.6113   2.6019
2000-01-08 -0.4621 -1.2278  -0.8016

In [88]: df ** 4
Out[88]:
A       B           C
2000-01-01   0.0056  0.0169  1.7581e+00
2000-01-02   1.7964  4.7813  1.8162e-01
2000-01-03   0.3537  0.0024  1.0849e-01
2000-01-04   0.1632  1.8755  3.8733e-05
2000-01-05   4.6534  0.0008  1.2482e+01
2000-01-06   2.0995  0.8382  1.6789e+00
2000-01-07   0.1829  0.0022  2.1819e-02
2000-01-08  21.9244  0.4401  2.4219e+00
```

```In [89]: df1 = pd.DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)

In [90]: df2 = pd.DataFrame({'a' : [0, 1, 1], 'b' : [1, 1, 0] }, dtype=bool)

In [91]: df1 & df2
Out[91]:
a      b
0  False  False
1  False   True
2   True  False

In [92]: df1 | df2
Out[92]:
a     b
0  True  True
1  True  True
2  True  True

In [93]: df1 ^ df2
Out[93]:
a      b
0   True   True
1   True  False
2  False   True

In [94]: -df1
Out[94]:
a      b
0  False   True
1   True  False
2  False  False
```

Transposing¶

```# only show the first 5 rows
In [95]: df[:5].T
Out[95]:
2000-01-01  2000-01-02  2000-01-03  2000-01-04  2000-01-05
A      0.2731      1.1577     -0.7712     -0.6356     -1.4687
B      0.3604      1.4787      0.2203     -1.1703      0.1705
C     -1.1515     -0.6528     -0.5739     -0.0789     -1.8796
```

DataFrame interoperability with NumPy functions¶

Elementwise NumPy ufuncs（log，exp，sqrt，...）和各种其他NumPy函数可以在DataFrame没有问题的情况下使用，假设其中的数据是数字：

```In [96]: np.exp(df)
Out[96]:
A       B       C
2000-01-01  1.3140  1.4338  0.3162
2000-01-02  3.1826  4.3873  0.5206
2000-01-03  0.4625  1.2465  0.5633
2000-01-04  0.5296  0.3103  0.9241
2000-01-05  0.2302  1.1859  0.1526
2000-01-06  0.3001  2.6034  0.3204
2000-01-07  0.5200  0.8050  1.4686
2000-01-08  0.1149  0.4429  0.2872

In [97]: np.asarray(df)
Out[97]:
array([[ 0.2731,  0.3604, -1.1515],
[ 1.1577,  1.4787, -0.6528],
[-0.7712,  0.2203, -0.5739],
[-0.6356, -1.1703, -0.0789],
[-1.4687,  0.1705, -1.8796],
[-1.2037,  0.9568, -1.1383],
[-0.654 , -0.2169,  0.3843],
[-2.1639, -0.8145, -1.2475]])
```

DataFrame上的dot方法实现了矩阵乘法：

```In [98]: df.T.dot(df)
Out[98]:
A       B       C
A  11.1298  2.8864  6.0015
B   2.8864  5.3895 -1.8913
C   6.0015 -1.8913  8.6204
```

```In [99]: s1 = pd.Series(np.arange(5,10))

In [100]: s1.dot(s1)
Out[100]: 255
```

DataFrame不打算作为ndarray的替代，因为它的索引语义在一个矩阵的地方是非常不同的。

Console display¶

```In [101]: baseball = pd.read_csv('data/baseball.csv')

In [102]: print(baseball)
id     player  year  stint  ...   hbp   sh   sf  gidp
0   88641  womacto01  2006      2  ...   0.0  3.0  0.0   0.0
1   88643  schilcu01  2006      1  ...   0.0  0.0  0.0   0.0
..    ...        ...   ...    ...  ...   ...  ...  ...   ...
98  89533   aloumo01  2007      1  ...   2.0  0.0  3.0  13.0
99  89534  alomasa02  2007      1  ...   0.0  0.0  0.0   0.0

[100 rows x 23 columns]

In [103]: baseball.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 23 columns):
id        100 non-null int64
player    100 non-null object
year      100 non-null int64
stint     100 non-null int64
team      100 non-null object
lg        100 non-null object
g         100 non-null int64
ab        100 non-null int64
r         100 non-null int64
h         100 non-null int64
X2b       100 non-null int64
X3b       100 non-null int64
hr        100 non-null int64
rbi       100 non-null float64
sb        100 non-null float64
cs        100 non-null float64
bb        100 non-null int64
so        100 non-null float64
ibb       100 non-null float64
hbp       100 non-null float64
sh        100 non-null float64
sf        100 non-null float64
gidp      100 non-null float64
dtypes: float64(9), int64(11), object(3)
memory usage: 18.0+ KB
```

```In [104]: print(baseball.iloc[-20:, :12].to_string())
id     player  year  stint team  lg    g   ab   r    h  X2b  X3b
80  89474  finlest01  2007      1  COL  NL   43   94   9   17    3    0
81  89480  embreal01  2007      1  OAK  AL    4    0   0    0    0    0
82  89481  edmonji01  2007      1  SLN  NL  117  365  39   92   15    2
83  89482  easleda01  2007      1  NYN  NL   76  193  24   54    6    0
84  89489  delgaca01  2007      1  NYN  NL  139  538  71  139   30    0
85  89493  cormirh01  2007      1  CIN  NL    6    0   0    0    0    0
86  89494  coninje01  2007      2  NYN  NL   21   41   2    8    2    0
87  89495  coninje01  2007      1  CIN  NL   80  215  23   57   11    1
88  89497  clemero02  2007      1  NYA  AL    2    2   0    1    0    0
89  89498  claytro01  2007      2  BOS  AL    8    6   1    0    0    0
90  89499  claytro01  2007      1  TOR  AL   69  189  23   48   14    0
91  89501  cirilje01  2007      2  ARI  NL   28   40   6    8    4    0
92  89502  cirilje01  2007      1  MIN  AL   50  153  18   40    9    2
93  89521  bondsba01  2007      1  SFN  NL  126  340  75   94   14    0
94  89523  biggicr01  2007      1  HOU  NL  141  517  68  130   31    3
95  89525  benitar01  2007      2  FLO  NL   34    0   0    0    0    0
96  89526  benitar01  2007      1  SFN  NL   19    0   0    0    0    0
97  89530  ausmubr01  2007      1  HOU  NL  117  349  38   82   16    3
98  89533   aloumo01  2007      1  NYN  NL   87  328  51  112   19    1
99  89534  alomasa02  2007      1  NYN  NL    8   22   1    3    1    0
```

```In [105]: pd.DataFrame(np.random.randn(3, 12))
Out[105]:
0         1         2         3         4         5         6   \
0  2.173014  1.273573  0.888325  0.631774  0.206584 -1.745845 -0.505310
1 -1.240418  2.177280 -0.082206  0.827373 -0.700792  0.524540 -1.101396
2  0.269598 -0.453050 -1.821539 -0.126332 -0.153257  0.405483 -0.504557

7         8         9         10        11
0  1.376623  0.741168 -0.509153 -2.012112 -1.204418
1  1.115750  0.294139  0.286939  1.709761 -0.212596
2  1.405148  0.778061 -0.799024 -0.670727  0.086877
```

```In [106]: pd.set_option('display.width', 40) # default is 80

In [107]: pd.DataFrame(np.random.randn(3, 12))
Out[107]:
0         1         2   \
0  1.179465  0.777427 -1.923460
1  0.054928  0.776156  0.372060
2 -0.243404 -1.506557 -1.977226

3         4         5   \
0  0.782432  0.203446  0.250652
1  0.710963 -0.784859  0.168405
2 -0.226582 -0.777971  0.231309

6         7         8   \
0 -2.349580 -0.540814 -0.748939
1  0.159230  0.866492  1.266025
2  1.394479  0.723474 -0.097256

9         10        11
0 -0.994345  1.478624 -0.341991
1  0.555240  0.731803  0.219383
2  0.375274 -0.314401 -2.363136
```

```In [108]: datafile={'filename': ['filename_01','filename_02'],
.....:           'path': ["media/user_name/storage/folder_01/filename_01",
.....:                    "media/user_name/storage/folder_02/filename_02"]}
.....:

In [109]: pd.set_option('display.max_colwidth',30)

In [110]: pd.DataFrame(datafile)
Out[110]:
filename  \
0  filename_01
1  filename_02

path
0  media/user_name/storage/fo...
1  media/user_name/storage/fo...

In [111]: pd.set_option('display.max_colwidth',100)

In [112]: pd.DataFrame(datafile)
Out[112]:
filename  \
0  filename_01
1  filename_02

path
0  media/user_name/storage/folder_01/filename_01
1  media/user_name/storage/folder_02/filename_02
```

DataFrame column attribute access and IPython completion¶

```In [113]: df = pd.DataFrame({'foo1' : np.random.randn(5),
.....:                    'foo2' : np.random.randn(5)})
.....:

In [114]: df
Out[114]:
foo1      foo2
0 -0.412237  0.213232
1 -0.237644  1.740139
2  1.272869 -0.241491
3  1.220450 -0.868514
4  1.315172  0.407544

In [115]: df.foo1
Out[115]:
0   -0.412237
1   -0.237644
2    1.272869
3    1.220450
4    1.315172
Name: foo1, dtype: float64
```

```In [5]: df.fo<TAB>
df.foo1  df.foo2
```

Panel¶

Panel是一个稍微少用的，但是对于三维数据仍然重要的容器。术语面板数据源自计量经济学，部分负责名称pandas：pan（el）-da（ta）-s。 3轴的名称旨在给描述涉及面板数据的操作，特别是面板数据的计量分析提供一些语义。但是，为了严格切割和切割DataFrame对象的集合，您可能会发现轴名称稍有任意：

• ：轴0，每个项对应于其中包含的DataFrame
• major_axis：轴1，它是每个DataFrame的索引（rows）
• minor_axis：轴2，它是每个DataFrames的

From 3D ndarray with optional axis labels¶

```In [116]: wp = pd.Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'],
.....:               major_axis=pd.date_range('1/1/2000', periods=5),
.....:               minor_axis=['A', 'B', 'C', 'D'])
.....:

In [117]: wp
Out[117]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D
```

From dict of DataFrame objects¶

```In [118]: data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
.....:         'Item2' : pd.DataFrame(np.random.randn(4, 2))}
.....:

In [119]: pd.Panel(data)
Out[119]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2
```

```In [120]: pd.Panel.from_dict(data, orient='minor')
Out[120]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 4 (major_axis) x 2 (minor_axis)
Items axis: 0 to 2
Major_axis axis: 0 to 3
Minor_axis axis: Item1 to Item2
```

Orient对于混合类型的DataFrames特别有用。如果你传递一个带有混合类型列的DataFrame对象的dict，除非你传递`orient='minor'`：所有的数据将被转换为`dtype=object`

```In [121]: df = pd.DataFrame({'a': ['foo', 'bar', 'baz'],
.....:                    'b': np.random.randn(3)})
.....:

In [122]: df
Out[122]:
a         b
0  foo -1.142863
1  bar -1.015321
2  baz  0.683625

In [123]: data = {'item1': df, 'item2': df}

In [124]: panel = pd.Panel.from_dict(data, orient='minor')

In [125]: panel['a']
Out[125]:
item1 item2
0   foo   foo
1   bar   bar
2   baz   baz

In [126]: panel['b']
Out[126]:
item1     item2
0 -1.142863 -1.142863
1 -1.015321 -1.015321
2  0.683625  0.683625

In [127]: panel['b'].dtypes
Out[127]:
item1    float64
item2    float64
dtype: object
```

From DataFrame using `to_panel` method¶

```In [128]: midx = pd.MultiIndex(levels=[['one', 'two'], ['x','y']], labels=[[1,1,0,0],[1,0,1,0]])

In [129]: df = pd.DataFrame({'A' : [1, 2, 3, 4], 'B': [5, 6, 7, 8]}, index=midx)

In [130]: df.to_panel()
Out[130]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: A to B
Major_axis axis: one to two
Minor_axis axis: x to y
```

Item selection / addition / deletion¶

```In [131]: wp['Item1']
Out[131]:
A         B         C         D
2000-01-01 -0.729430  0.427693 -0.121325 -0.736418
2000-01-02  0.739037 -0.648805 -0.383057  0.385027
2000-01-03  2.321064 -1.290881  0.105458 -1.097035
2000-01-04  0.158759 -1.261191 -0.081710  1.390506
2000-01-05 -1.962031 -0.505580  0.021253 -0.317071

In [132]: wp['Item3'] = wp['Item1'] / wp['Item2']
```

Transposing¶

```In [133]: wp.transpose(2, 0, 1)
Out[133]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 3 (major_axis) x 5 (minor_axis)
Items axis: A to D
Major_axis axis: Item1 to Item3
Minor_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
```

Indexing / Selection¶

```In [134]: wp['Item1']
Out[134]:
A         B         C         D
2000-01-01 -0.729430  0.427693 -0.121325 -0.736418
2000-01-02  0.739037 -0.648805 -0.383057  0.385027
2000-01-03  2.321064 -1.290881  0.105458 -1.097035
2000-01-04  0.158759 -1.261191 -0.081710  1.390506
2000-01-05 -1.962031 -0.505580  0.021253 -0.317071

In [135]: wp.major_xs(wp.major_axis[2])
Out[135]:
Item1     Item2     Item3
A  2.321064 -0.538606 -4.309389
B -1.290881  0.791512 -1.630905
C  0.105458 -0.020302 -5.194337
D -1.097035  0.184430 -5.948253

In [136]: wp.minor_axis
Out[136]: Index([u'A', u'B', u'C', u'D'], dtype='object')

In [137]: wp.minor_xs('C')
Out[137]:
Item1     Item2     Item3
2000-01-01 -0.121325  1.413524 -0.085832
2000-01-02 -0.383057  1.243178 -0.308127
2000-01-03  0.105458 -0.020302 -5.194337
2000-01-04 -0.081710 -1.811565  0.045105
2000-01-05  0.021253 -1.040542 -0.020425
```

Squeezing¶

```In [138]: wp.reindex(items=['Item1']).squeeze()
Out[138]:
A         B         C         D
2000-01-01 -0.729430  0.427693 -0.121325 -0.736418
2000-01-02  0.739037 -0.648805 -0.383057  0.385027
2000-01-03  2.321064 -1.290881  0.105458 -1.097035
2000-01-04  0.158759 -1.261191 -0.081710  1.390506
2000-01-05 -1.962031 -0.505580  0.021253 -0.317071

In [139]: wp.reindex(items=['Item1'], minor=['B']).squeeze()
Out[139]:
2000-01-01    0.427693
2000-01-02   -0.648805
2000-01-03   -1.290881
2000-01-04   -1.261191
2000-01-05   -0.505580
Freq: D, Name: B, dtype: float64
```

Conversion to DataFrame¶

```In [140]: panel = pd.Panel(np.random.randn(3, 5, 4), items=['one', 'two', 'three'],
.....:                  major_axis=pd.date_range('1/1/2000', periods=5),
.....:                  minor_axis=['a', 'b', 'c', 'd'])
.....:

In [141]: panel.to_frame()
Out[141]:
one       two     three
major      minor
2000-01-01 a     -1.876826 -0.383171 -0.117339
b     -1.873827 -0.172217  0.780048
c     -0.251457 -1.674685  2.162047
d      0.027599  0.762474  0.874233
2000-01-02 a      1.235291  0.481666 -0.764147
b      0.850574  1.217546 -0.484495
c     -1.140302  0.577103  0.298570
d      2.149143 -0.076021  0.825136
2000-01-03 a      0.504452  0.720235 -0.388020
b      0.678026  0.202660 -0.339279
c     -0.628443 -0.314950  0.141164
d      1.191156 -0.410852  0.565930
2000-01-04 a     -1.145363  0.542758 -1.749969
b     -0.523153  1.955407 -1.402941
c     -1.299878 -0.940645  0.623222
d     -0.110240  0.076257  0.020129
2000-01-05 a     -0.333712 -0.897159 -2.858463
b      0.416876 -1.265679  0.885765
c     -0.436400 -0.528311  0.158014
d      0.999768 -0.660014 -1.981797
```

Scroll To Top