A proposal for adding groupby functionality to NumPy

作者:Travis Oliphant
联系:oliphant @ enthought com
日期:2010-04-27

Executive summary

NumPy提供了处理数据和进行计算的工具,与关系代数允许的方式大致相同。然而,普通的分组功能不容易处理。NumPy的ufuncs的reduce方法是一个自然的地方,把这个groupby行为。这个NEP描述了用于ufunc(reduceby和reducein)和两个附加函数(段和边)的两个附加方法,这有助于添加这个功能。

Example Use Case

假设您有一个NumPy结构化数组,其中包含多天内在多家商店的购买次数信息。要清楚,结构化数组数据类型是:

('store',i4),('SKU','S6'),('number',i4)]

假设有一个这个数据类型的1-d NumPy数组,你想计算各种统计数据(max,min,mean,sum等)所售产品数量,产品,月份,商店等。

目前,这可以通过使用reduce方法对数组的数字字段,加上就地排序,return_inverse = True和bincount等唯一的方法来完成。然而,对于这样一个常见的数据分析需要,很高兴有标准和更直接的方式来获得结果。

Ufunc methods proposed

建议在ufuncs中添加两个新的reduce-style方法:reduceby和reducein。reducein方法旨在使用reduceat的更简单的版本,而reduceby方法旨在提供减少的group-by能力。

缩写:

<ufunc>.reducein(arr, indices, axis=0, dtype=None, out=None)

Perform a local reduce with slices specified by pairs of indices.

The reduction occurs along the provided axis, using the provided
data-type to calculate intermediate results, storing the result into
the array out (if provided).

The indices array provides the start and end indices for the
reduction.  If the length of the indices array is odd, then the
final index provides the beginning point for the final reduction
and the ending point is the end of arr.

This generalizes along the given axis, the behavior:

[<ufunc>.reduce(arr[indices[2*i]:indices[2*i+1]])
        for i in range(len(indices)/2)]

This assumes indices is of even length

Example:
   >>> a = [0,1,2,4,5,6,9,10]
   >>> add.reducein(a,[0,3,2,5,-2])
   [3, 11, 19]

   Notice that sum(a[0:3]) = 3; sum(a[2:5]) = 11; and sum(a[-2:]) = 19

reduceby:

<ufunc>.reduceby(arr, by, dtype=None, out=None)

Perform a reduction in arr over unique non-negative integers in by.


Let N=arr.ndim and M=by.ndim.  Then, by.shape[:N] == arr.shape.
In addition, let I be an N-length index tuple, then by[I]
contains the location in the output array for the reduction to
be stored.  Notice that if N == M, then by[I] is a non-negative
integer, while if N < M, then by[I] is an array of indices into
the output array.

The reduction is computed on groups specified by unique indices
into the output array. The index is either the single
non-negative integer if N == M or if N < M, the entire
(M-N+1)-length index by[I] considered as a whole.

Functions proposed

分割:

边缘:

.. Local Variables:
.. mode: rst
.. coding: utf-8
.. fill-column: 72
.. End: