Sparse data structures¶
注意
在0.19.0中已删除SparsePanel
类
我们实现了“稀疏”版本的Series和DataFrame。这些在典型的“大多为0”中不稀疏。相反,您可以将这些对象视为“压缩”,其中省略任何匹配特定值(NaN
/缺失值,尽管可以选择任何值)的数据。特殊的SparseIndex
对象跟踪数据已被“稀疏化”的位置。在一个例子中,这将更有意义。所有标准的熊猫数据结构都有一个to_sparse
方法:
In [1]: ts = pd.Series(randn(10))
In [2]: ts[2:-2] = np.nan
In [3]: sts = ts.to_sparse()
In [4]: sts
Out[4]:
0 0.469112
1 -0.282863
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 -0.861849
9 -2.104569
dtype: float64
BlockIndex
Block locations: array([0, 8], dtype=int32)
Block lengths: array([2, 2], dtype=int32)
to_sparse
方法采用kind
参数(对于稀疏索引,请参见下文)和fill_value
。所以如果我们有一个大多数为零的系列,我们可以将它转换为稀疏与fill_value=0
:
In [5]: ts.fillna(0).to_sparse(fill_value=0)
Out[5]:
0 0.469112
1 -0.282863
2 0.000000
3 0.000000
4 0.000000
5 0.000000
6 0.000000
7 0.000000
8 -0.861849
9 -2.104569
dtype: float64
BlockIndex
Block locations: array([0, 8], dtype=int32)
Block lengths: array([2, 2], dtype=int32)
稀疏对象存在是为了内存效率的原因。假设你有一个大的,主要是NA DataFrame:
In [6]: df = pd.DataFrame(randn(10000, 4))
In [7]: df.ix[:9998] = np.nan
In [8]: sdf = df.to_sparse()
In [9]: sdf
Out[9]:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
... ... ... ... ...
9993 NaN NaN NaN NaN
9994 NaN NaN NaN NaN
9995 NaN NaN NaN NaN
9996 NaN NaN NaN NaN
9997 NaN NaN NaN NaN
9998 NaN NaN NaN NaN
9999 0.280249 -1.648493 1.490865 -0.890819
[10000 rows x 4 columns]
In [10]: sdf.density
Out[10]: 0.0001
如你所见,密度(未被“压缩”的值的百分比)非常低。这个稀疏对象在磁盘(pickled)和Python解释器中占用更少的内存。在功能上,它们的行为应该与它们的稠密对应物几乎相同。
任何稀疏对象都可以通过调用to_dense
转换回标准密集形式:
In [11]: sts.to_dense()
Out[11]:
0 0.469112
1 -0.282863
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 -0.861849
9 -2.104569
dtype: float64
SparseArray¶
SparseArray
是所有稀疏索引数据结构的基本层。它是一个1维的ndarray样对象,只存储不同于fill_value
的值:
In [12]: arr = np.random.randn(10)
In [13]: arr[2:5] = np.nan; arr[7:8] = np.nan
In [14]: sparr = pd.SparseArray(arr)
In [15]: sparr
Out[15]:
[-1.95566352972, -1.6588664276, nan, nan, nan, 1.15893288864, 0.145297113733, nan, 0.606027190513, 1.33421134013]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)
像索引对象(SparseSeries,SparseDataFrame)一样,通过调用to_dense
可以将SparseArray
转换回常规的ndarray:
In [16]: sparr.to_dense()
Out[16]:
array([-1.9557, -1.6589, nan, nan, nan, 1.1589, 0.1453,
nan, 0.606 , 1.3342])
SparseIndex objects¶
实现了两种SparseIndex
,block
和integer
。我们建议使用block
,因为它更节省内存。integer
格式保留数据不等于填充值的所有位置的数组。block
格式只跟踪数据块的位置和大小。
Sparse Dtypes¶
稀疏数据应具有与其密集表示相同的dtype。目前,支持float64
,int64
和bool
dtypes。根据原始dtype,fill_value
默认更改:
float64
:np.nan
int64
:0
bool
:False
In [17]: s = pd.Series([1, np.nan, np.nan])
In [18]: s
Out[18]:
0 1.0
1 NaN
2 NaN
dtype: float64
In [19]: s.to_sparse()
Out[19]:
0 1.0
1 NaN
2 NaN
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)
In [20]: s = pd.Series([1, 0, 0])
In [21]: s
Out[21]:
0 1
1 0
2 0
dtype: int64
In [22]: s.to_sparse()
Out[22]:
0 1
1 0
2 0
dtype: int64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)
In [23]: s = pd.Series([True, False, True])
In [24]: s
Out[24]:
0 True
1 False
2 True
dtype: bool
In [25]: s.to_sparse()
Out[25]:
0 True
1 False
2 True
dtype: bool
BlockIndex
Block locations: array([0, 2], dtype=int32)
Block lengths: array([1, 1], dtype=int32)
您可以使用.astype()
更改dtype,结果也是稀疏的。请注意,.astype()
也会影响fill_value
以保持其密集表示。
In [26]: s = pd.Series([1, 0, 0, 0, 0])
In [27]: s
Out[27]:
0 1
1 0
2 0
3 0
4 0
dtype: int64
In [28]: ss = s.to_sparse()
In [29]: ss
Out[29]:
0 1
1 0
2 0
3 0
4 0
dtype: int64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)
In [30]: ss.astype(np.float64)
Out[30]:
0 1.0
1 0.0
2 0.0
3 0.0
4 0.0
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)
如果任何值不能强制到指定的dtype,它会引发。
In [1]: ss = pd.Series([1, np.nan, np.nan]).to_sparse()
0 1.0
1 NaN
2 NaN
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)
In [2]: ss.astype(np.int64)
ValueError: unable to coerce current fill_value nan to int64 dtype
Sparse Calculation¶
您可以将NumPy ufuncs应用于SparseArray
,并获得SparseArray
作为结果。
In [31]: arr = pd.SparseArray([1., np.nan, np.nan, -2., np.nan])
In [32]: np.abs(arr)
Out[32]:
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)
ufunc也适用于fill_value
。这是需要得到正确的密集结果。
In [33]: arr = pd.SparseArray([1., -1, -1, -2., -1], fill_value=-1)
In [34]: np.abs(arr)
Out[34]:
[1.0, 1, 1, 2.0, 1]
Fill: 1
IntIndex
Indices: array([0, 3], dtype=int32)
In [35]: np.abs(arr).to_dense()
Out[35]: array([ 1., 1., 1., 2., 1.])
Interaction with scipy.sparse¶
实验api在稀疏熊猫和scipy.sparse结构之间进行转换。
A SparseSeries.to_coo()
method is implemented for transforming a SparseSeries
indexed by a MultiIndex
to a scipy.sparse.coo_matrix
.
该方法需要具有两个或更多个级别的MultiIndex
。
In [36]: s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])
In [37]: s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),
....: (1, 2, 'a', 1),
....: (1, 1, 'b', 0),
....: (1, 1, 'b', 1),
....: (2, 1, 'b', 0),
....: (2, 1, 'b', 1)],
....: names=['A', 'B', 'C', 'D'])
....:
In [38]: s
Out[38]:
A B C D
1 2 a 0 3.0
1 NaN
1 b 0 1.0
1 3.0
2 1 b 0 NaN
1 NaN
dtype: float64
# SparseSeries
In [39]: ss = s.to_sparse()
In [40]: ss
Out[40]:
A B C D
1 2 a 0 3.0
1 NaN
1 b 0 1.0
1 3.0
2 1 b 0 NaN
1 NaN
dtype: float64
BlockIndex
Block locations: array([0, 2], dtype=int32)
Block lengths: array([1, 2], dtype=int32)
在下面的示例中,通过指定第一个和第二个MultiIndex
级别定义行的标签,将SparseSeries
变换为2-d数组的稀疏表示,和第四级定义列的标签。我们还指定列和行标签应按最终稀疏表示法排序。
In [41]: A, rows, columns = ss.to_coo(row_levels=['A', 'B'],
....: column_levels=['C', 'D'],
....: sort_labels=True)
....:
In [42]: A
Out[42]:
<3x4 sparse matrix of type '<type 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [43]: A.todense()
Out[43]:
matrix([[ 0., 0., 1., 3.],
[ 3., 0., 0., 0.],
[ 0., 0., 0., 0.]])
In [44]: rows
Out[44]: [(1, 1), (1, 2), (2, 1)]
In [45]: columns
Out[45]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]
指定不同的行和列标签(而不是排序)会产生不同的稀疏矩阵:
In [46]: A, rows, columns = ss.to_coo(row_levels=['A', 'B', 'C'],
....: column_levels=['D'],
....: sort_labels=False)
....:
In [47]: A
Out[47]:
<3x2 sparse matrix of type '<type 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [48]: A.todense()
Out[48]:
matrix([[ 3., 0.],
[ 1., 3.],
[ 0., 0.]])
In [49]: rows
Out[49]: [(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]
In [50]: columns
Out[50]: [0, 1]
实现方便方法SparseSeries.from_coo()
用于从scipy.sparse.coo_matrix
创建SparseSeries
。
In [51]: from scipy import sparse
In [52]: A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])),
....: shape=(3, 4))
....:
In [53]: A
Out[53]:
<3x4 sparse matrix of type '<type 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [54]: A.todense()
Out[54]:
matrix([[ 0., 0., 1., 2.],
[ 3., 0., 0., 0.],
[ 0., 0., 0., 0.]])
默认行为(dense_index=False
)只返回一个只包含非空条目的SparseSeries
。
In [55]: ss = pd.SparseSeries.from_coo(A)
In [56]: ss
Out[56]:
0 2 1.0
3 2.0
1 0 3.0
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([3], dtype=int32)
指定dense_index=True
将产生一个索引,该索引是矩阵的行和列坐标的笛卡尔乘积。注意,如果稀疏矩阵足够大(和稀疏),这将消耗大量的存储器(相对于dense_index=False
)。
In [57]: ss_dense = pd.SparseSeries.from_coo(A, dense_index=True)
In [58]: ss_dense
Out[58]:
0 0 NaN
1 NaN
2 1.0
3 2.0
1 0 3.0
1 NaN
2 NaN
3 NaN
2 0 NaN
1 NaN
2 NaN
3 NaN
dtype: float64
BlockIndex
Block locations: array([2], dtype=int32)
Block lengths: array([3], dtype=int32)