Enhancing Performance¶

Cython (Writing C extensions for pandas)¶

对于许多使用情况下，用纯python和numpy编写pandas就足够了。然而，在一些计算繁重的应用中，可以通过将工作转换到cython来实现可观的加速。

本教程假设您已在python中尽可能重构，例如尝试删除for循环并使用numpy向量化，它总是值得在python首先优化。

本教程将介绍一个“典型”的细化慢计算过程。我们使用cython文档中的示例，但是在pandas的上下文中。我们最终的cythonized解决方案比纯python大约快100倍。

Pure python¶

我们有一个DataFrame，我们要对其应用一个行的方式。

In [1]: df = pd.DataFrame({'a': np.random.randn(1000),
   ...:                    'b': np.random.randn(1000),
   ...:                    'N': np.random.randint(100, 1000, (1000)),
   ...:                    'x': 'x'})
   ...: 

In [2]: df
Out[2]: 
       N         a         b  x
0    585  0.469112 -0.218470  x
1    841 -0.282863 -0.061645  x
2    251 -1.509059 -0.723780  x
3    972 -1.135632  0.551225  x
4    181  1.212112 -0.497767  x
5    458 -0.173215  0.837519  x
6    159  0.119209  1.103245  x
..   ...       ...       ... ..
993  190  0.131892  0.290162  x
994  931  0.342097  0.215341  x
995  374 -1.512743  0.874737  x
996  246  0.933753  1.120790  x
997  157 -0.308013  0.198768  x
998  977 -0.079915  1.757555  x
999  770 -1.010589 -1.115680  x

[1000 rows x 4 columns]

这里是纯python中的函数：

In [3]: def f(x):
   ...:     return x * (x - 1)
   ...: 

In [4]: def integrate_f(a, b, N):
   ...:     s = 0
   ...:     dx = (b - a) / N
   ...:     for i in range(N):
   ...:         s += f(a + i * dx)
   ...:     return s * dx
   ...: 

我们通过使用apply（逐行）来实现我们的结果：

In [7]: %timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)
10 loops, best of 3: 174 ms per loop

但显然这对我们来说不够快。让我们来看看，使用prun ipython magic function查看在此操作期间花费的时间（限于最耗时的四个调用）：

In [5]: %prun -l 4 df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)
         671915 function calls (666906 primitive calls) in 0.379 seconds

   Ordered by: internal time
   List reduced from 128 to 4 due to restriction <4>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.193    0.000    0.290    0.000 <ipython-input-4-91e33489f136>:1(integrate_f)
   552423    0.089    0.000    0.089    0.000 <ipython-input-3-bc41a25943f6>:1(f)
     3000    0.011    0.000    0.060    0.000 base.py:2146(get_value)
     1000    0.008    0.000    0.008    0.000 {range}

到目前为止，大部分时间是花费在integrate_f或f内，因此我们将集中力量对这两个函数进行cythonizing。

注意

在python 2中，用其生成器对（xrange）替换range将意味着range线将消失。在python 3 range已经是一个生成器。

Plain cython¶

First we’re going to need to import the cython magic function to ipython (for cython versions < 0.21 you can use %load_ext cythonmagic):

In [6]: %load_ext Cython

现在，让我们简单地将我们的函数复制到cython as（后缀在这里区分功能版本）：

In [7]: %%cython
   ...: def f_plain(x):
   ...:     return x * (x - 1)
   ...: def integrate_f_plain(a, b, N):
   ...:     s = 0
   ...:     dx = (b - a) / N
   ...:     for i in range(N):
   ...:         s += f_plain(a + i * dx)
   ...:     return s * dx
   ...: 

注意

如果你无法将上面的内容粘贴到你的ipython中，你可能需要使用出血边缘的ipython来粘贴，以适应细胞魔法。

In [4]: %timeit df.apply(lambda x: integrate_f_plain(x['a'], x['b'], x['N']), axis=1)
10 loops, best of 3: 85.5 ms per loop

这已经刮了三分之一，不是太糟糕了一个简单的复制和粘贴。

Adding type¶

我们通过提供类型信息获得另一个巨大的改进：

In [8]: %%cython
   ...: cdef double f_typed(double x) except? -2:
   ...:     return x * (x - 1)
   ...: cpdef double integrate_f_typed(double a, double b, int N):
   ...:     cdef int i
   ...:     cdef double s, dx
   ...:     s = 0
   ...:     dx = (b - a) / N
   ...:     for i in range(N):
   ...:         s += f_typed(a + i * dx)
   ...:     return s * dx
   ...: 

In [4]: %timeit df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)
10 loops, best of 3: 20.3 ms per loop

现在，我们在说话！它现在比原来的python实现快十倍，我们没有真的修改代码。让我们再看看什么是吃饭时间：

In [9]: %prun -l 4 df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)
         118490 function calls (113481 primitive calls) in 0.093 seconds

   Ordered by: internal time
   List reduced from 124 to 4 due to restriction <4>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     3000    0.011    0.000    0.064    0.000 base.py:2146(get_value)
     3000    0.006    0.000    0.072    0.000 series.py:600(__getitem__)
     3000    0.005    0.000    0.014    0.000 base.py:1131(_convert_scalar_indexer)
     9024    0.005    0.000    0.012    0.000 {getattr}

Using ndarray¶

这是电话系列...很多！它从每一行创建一个系列，并从索引和系列（每行三次）获取。函数调用在Python中很昂贵，所以也许我们可以通过应用部分的cythonizing最小化。

注意

我们现在将ndarrays传递给cython函数，幸运的是cython和numpy非常好。

In [10]: %%cython
   ....: cimport numpy as np
   ....: import numpy as np
   ....: cdef double f_typed(double x) except? -2:
   ....:     return x * (x - 1)
   ....: cpdef double integrate_f_typed(double a, double b, int N):
   ....:     cdef int i
   ....:     cdef double s, dx
   ....:     s = 0
   ....:     dx = (b - a) / N
   ....:     for i in range(N):
   ....:         s += f_typed(a + i * dx)
   ....:     return s * dx
   ....: cpdef np.ndarray[double] apply_integrate_f(np.ndarray col_a, np.ndarray col_b, np.ndarray col_N):
   ....:     assert (col_a.dtype == np.float and col_b.dtype == np.float and col_N.dtype == np.int)
   ....:     cdef Py_ssize_t i, n = len(col_N)
   ....:     assert (len(col_a) == len(col_b) == n)
   ....:     cdef np.ndarray[double] res = np.empty(n)
   ....:     for i in range(len(col_a)):
   ....:         res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
   ....:     return res
   ....: 

实现很简单，它创建一个零和循环的行数组，应用我们的integrate_f_typed，并将其放在零的数组。

警告

In 0.13.0 since Series has internaly been refactored to no longer sub-class ndarray but instead subclass NDFrame, you can not pass a Series directly as a ndarray typed parameter to a cython function. 而应使用系列的.values属性传递实际的ndarray。

0.13.0之前

apply_integrate_f(df['a'], df['b'], df['N'])

使用.values来获取底层的ndarray

apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)

注意

Loops like this would be extremely slow in python, but in Cython looping over numpy arrays is fast.

In [4]: %timeit apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)
1000 loops, best of 3: 1.25 ms per loop

我们又有了一个很大的改进。让我们再次检查时间花费在哪里：

In [11]: %prun -l 4 apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)
         208 function calls in 0.002 seconds

   Ordered by: internal time
   List reduced from 53 to 4 due to restriction <4>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.002    0.002    0.002    0.002 {_cython_magic_40485b2751cb6bc085f3a7be0856f402.apply_integrate_f}
        3    0.000    0.000    0.000    0.000 internals.py:4031(__init__)
        9    0.000    0.000    0.000    0.000 generic.py:2746(__setattr__)
        3    0.000    0.000    0.000    0.000 internals.py:3565(iget)

正如人们所期望的，大多数时间现在花费在apply_integrate_f中，因此如果我们想提高效率，我们必须继续集中精力在这里。

More advanced techniques¶

仍有改善的希望。这里有一个使用一些更先进的cython技术的例子：

In [12]: %%cython
   ....: cimport cython
   ....: cimport numpy as np
   ....: import numpy as np
   ....: cdef double f_typed(double x) except? -2:
   ....:     return x * (x - 1)
   ....: cpdef double integrate_f_typed(double a, double b, int N):
   ....:     cdef int i
   ....:     cdef double s, dx
   ....:     s = 0
   ....:     dx = (b - a) / N
   ....:     for i in range(N):
   ....:         s += f_typed(a + i * dx)
   ....:     return s * dx
   ....: @cython.boundscheck(False)
   ....: @cython.wraparound(False)
   ....: cpdef np.ndarray[double] apply_integrate_f_wrap(np.ndarray[double] col_a, np.ndarray[double] col_b, np.ndarray[int] col_N):
   ....:     cdef int i, n = len(col_N)
   ....:     assert len(col_a) == len(col_b) == n
   ....:     cdef np.ndarray[double] res = np.empty(n)
   ....:     for i in range(n):
   ....:         res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
   ....:     return res
   ....: 

In [4]: %timeit apply_integrate_f_wrap(df['a'].values, df['b'].values, df['N'].values)
1000 loops, best of 3: 987 us per loop

更快，但需要注意的是，我们的cython代码中的一个错误（例如，一个一个一个的错误）可能会导致segfault，因为内存访问未检查。

Using numba¶

最近一种替代静态编译cython代码的方法是使用动态jit编译器，numba。

Numba使您能够通过使用Python直接编写的高性能函数加快应用程序的速度。有了几个注释，面向数组和数学重的Python代码可以及时编译为本机机器指令，性能类似于C，C ++和Fortran，无需切换语言或Python解释器。

Numba通过在导入时间，运行时或静态（使用包含的pycc工具）使用LLVM编译器基础结构生成优化的机器代码。Numba支持编译Python以在CPU或GPU硬件上运行，并且旨在与Python科学软件堆栈集成。

注意

您需要安装numba。This is easy with conda, by using: conda install numba, see installing using miniconda.

注意

从numba版本0.20起，pandas对象不能直接传递到numba编译的函数。相反，必须将pandas对象下面的numpy数组传递到numba编译函数，如下所示。

Jit¶

使用numba来及时编译代码。我们只需从上面的普通python代码，并用@jit装饰器注释。

import numba

@numba.jit
def f_plain(x):
   return x * (x - 1)

@numba.jit
def integrate_f_numba(a, b, N):
   s = 0
   dx = (b - a) / N
   for i in range(N):
       s += f_plain(a + i * dx)
   return s * dx

@numba.jit
def apply_integrate_f_numba(col_a, col_b, col_N):
   n = len(col_N)
   result = np.empty(n, dtype='float64')
   assert len(col_a) == len(col_b) == n
   for i in range(n):
      result[i] = integrate_f_numba(col_a[i], col_b[i], col_N[i])
   return result

def compute_numba(df):
   result = apply_integrate_f_numba(df['a'].values, df['b'].values, df['N'].values)
   return pd.Series(result, index=df.index, name='result')

注意，我们直接将numpy数组传递给numba函数。compute_numba只是一个包装器，通过传递/返回pandas对象来提供更好的界面。

In [4]: %timeit compute_numba(df)
1000 loops, best of 3: 798 us per loop

Vectorize¶

numba也可用于编写不需要用户明确循环向量观察的向量化函数；矢量化函数将自动应用于每行。考虑下面的玩具示例，将每个观察值加倍：

import numba

def double_every_value_nonumba(x):
    return x*2

@numba.vectorize
def double_every_value_withnumba(x):
    return x*2


# Custom function without numba
In [5]: %timeit df['col1_doubled'] = df.a.apply(double_every_value_nonumba)
1000 loops, best of 3: 797 us per loop

# Standard implementation (faster than a custom function)
In [6]: %timeit df['col1_doubled'] = df.a*2
1000 loops, best of 3: 233 us per loop

# Custom function with numba
In [7]: %timeit df['col1_doubled'] = double_every_value_withnumba(df.a.values)
1000 loops, best of 3: 145 us per loop

Caveats¶

注意

numba将对任何函数执行，但只能加速某些类的函数。

numba最适合加速将数值函数应用于numpy数组的函数。当传递一个只使用操作的函数时，它知道如何加速，它将在nopython模式下执行。

如果numba传递的函数包含不知道如何使用的东西 - 当前包含集合，列表，字典或字符串函数的类别，它将还原为对象模式。在对象模式中，numba将执行，但您的代码不会显着加速。如果您希望numba在无法以加快代码的方式编译函数时抛出错误，请将numba参数传递给nopython=True（例如@numba.jit(nopython=True)）。有关解决numba模式问题的详情，请参阅numba疑难解答页。

请在numba docs中了解详情。

Expression Evaluation via `eval()` (Experimental)¶

版本0.13中的新功能。

顶层函数pandas.eval()实现Series和DataFrame对象的表达式求值。

注意

要受益于使用eval()，您需要安装numexpr。有关详细信息，请参阅recommended dependencies section。

使用eval()来表达式求值而不是纯Python是两个方面：1）大的DataFrame对象被更有效地计算，2）大的算术和布尔表达式由底层引擎一次性计算（默认情况下，numexpr用于计算）。

注意

对于简单表达式或涉及小型DataFrames的表达式，不应使用eval()。事实上，对于较小的表达式/对象，eval()比纯粹的Python要慢许多个数量级。一个好的经验法则是，当您拥有超过10,000行的DataFrame时，只使用eval()。

eval()支持引擎支持的所有算术表达式，除了一些仅在pandas中可用的扩展。

注意

帧越大，表达式越大，使用eval()可以看到的加速越快。

Supported Syntax¶

这些操作由pandas.eval()支持：

Arithmetic operations except for the left shift (<<) and right shift (>>) operators, e.g., df + 2 * pi / s ** 4 % 42 - the_golden_ratio
比较操作，包括链式比较，例如2 df df2
Boolean operations, e.g., df < df2 and df3 < df4 or not df_bool
list and tuple literals, e.g., [1, 2] or (1, 2)
属性访问权限，例如df.a
下标表达式，例如df[0]
简单变量评估，例如pd.eval('df')（这不是很有用）
Math functions, sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs and arctan2.

此Python语法为不允许：

表达式
- 函数调用而不是数学函数。
- is / 是不是操作
- if表达式
- lambda表达式
- list / set / dict理解
- 字面dict和set表达式
- yield表达式
- 生成器表达式
- 仅由标量值组成的布尔表达式
语句
- 既不允许简单也不允许复合语句。这包括for，while和if的内容。

`eval()` Examples¶

pandas.eval()适用于包含大型数组的表达式。

首先，让我们创建一些大小合适的数组：

In [13]: nrows, ncols = 20000, 100

In [14]: df1, df2, df3, df4 = [pd.DataFrame(np.random.randn(nrows, ncols)) for _ in range(4)]

现在让我们比较使用纯粹的Python和eval()将它们添加在一起：

In [15]: %timeit df1 + df2 + df3 + df4
10 loops, best of 3: 24.6 ms per loop

In [16]: %timeit pd.eval('df1 + df2 + df3 + df4')
100 loops, best of 3: 8.36 ms per loop

现在让我们做同样的事情，但比较：

In [17]: %timeit (df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)
10 loops, best of 3: 30.9 ms per loop

In [18]: %timeit pd.eval('(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)')
100 loops, best of 3: 16.4 ms per loop

eval()也可以使用未对齐的pandas对象：

In [19]: s = pd.Series(np.random.randn(50))

In [20]: %timeit df1 + df2 + df3 + df4 + s
10 loops, best of 3: 38.4 ms per loop

In [21]: %timeit pd.eval('df1 + df2 + df3 + df4 + s')
100 loops, best of 3: 9.31 ms per loop

注意

操作如

1 and 2  # would parse to 1 & 2, but should evaluate to 2
3 or 4  # would parse to 3 | 4, but should evaluate to 3
~1  # this is okay, but slower when using eval

应该在Python中执行。如果尝试使用非类型为bool或np.bool_的标量操作数执行任何布尔/逐位运算，则会引发异常。同样，你应该在纯Python中执行这些类型的操作。

The `DataFrame.eval` method (Experimental)¶

版本0.13中的新功能。

除了顶层pandas.eval()函数，您还可以评估DataFrame的“上下文”中的表达式。

In [22]: df = pd.DataFrame(np.random.randn(5, 2), columns=['a', 'b'])

In [23]: df.eval('a + b')
Out[23]: 
0   -0.246747
1    0.867786
2   -1.626063
3   -1.134978
4   -1.027798
dtype: float64

作为有效pandas.eval()表达式的任何表达式也是有效的DataFrame.eval()表达式，还有一个好处，到您想要评估的列的DataFrame的名称。

此外，您可以在表达式中执行列的分配。这允许公式计算。分配目标可以是新的列名称或现有的列名称，它必须是有效的Python标识符。

版本0.18.0中的新功能。

inplace关键字确定此分配是否对原始DataFrame执行，或返回带有新列的副本。

警告

对于向后兼容性，如果未指定，inplace默认为True。这将在未来版本的pandas中改变 - 如果你的代码依赖于一个内部赋值，你应该更新来显式设置inplace=True

In [24]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))

In [25]: df.eval('c = a + b', inplace=True)

In [26]: df.eval('d = a + b + c', inplace=True)

In [27]: df.eval('a = 1', inplace=True)

In [28]: df
Out[28]: 
   a  b   c   d
0  1  5   5  10
1  1  6   7  14
2  1  7   9  18
3  1  8  11  22
4  1  9  13  26

当inplace设置为False时，将返回带有新列或已修改列的DataFrame的副本，原始帧不变。

In [29]: df
Out[29]: 
   a  b   c   d
0  1  5   5  10
1  1  6   7  14
2  1  7   9  18
3  1  8  11  22
4  1  9  13  26

In [30]: df.eval('e = a - c', inplace=False)
Out[30]: 
   a  b   c   d   e
0  1  5   5  10  -4
1  1  6   7  14  -6
2  1  7   9  18  -8
3  1  8  11  22 -10
4  1  9  13  26 -12

In [31]: df
Out[31]: 
   a  b   c   d
0  1  5   5  10
1  1  6   7  14
2  1  7   9  18
3  1  8  11  22
4  1  9  13  26

版本0.18.0中的新功能。

为了方便，可以通过使用多行字符串来执行多个分配。

In [32]: df.eval("""
   ....: c = a + b
   ....: d = a + b + c
   ....: a = 1""", inplace=False)
   ....: 
Out[32]: 
   a  b   c   d
0  1  5   6  12
1  1  6   7  14
2  1  7   8  16
3  1  8   9  18
4  1  9  10  20

在标准Python中的等价将是

In [33]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))

In [34]: df['c'] = df.a + df.b

In [35]: df['d'] = df.a + df.b + df.c

In [36]: df['a'] = 1

In [37]: df
Out[37]: 
   a  b   c   d
0  1  5   5  10
1  1  6   7  14
2  1  7   9  18
3  1  8  11  22
4  1  9  13  26

版本0.18.0中的新功能。

query方法获得了inplace关键字，该关键字确定查询是否修改原始帧。

In [38]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))

In [39]: df.query('a > 2')
Out[39]: 
   a  b
3  3  8
4  4  9

In [40]: df.query('a > 2', inplace=True)

In [41]: df
Out[41]: 
   a  b
3  3  8
4  4  9

警告

Unlike with eval, the default value for inplace for query is False. 这与以前版本的熊猫一致。

Local Variables¶

在pandas版本0.14中，本地变量API已更改。在pandas 0.13.x中，你可以像在标准Python中一样引用局部变量。例如，

df = pd.DataFrame(np.random.randn(5, 2), columns=['a', 'b'])
newcol = np.random.randn(len(df))
df.eval('b + newcol')

UndefinedVariableError: name 'newcol' is not defined

从生成的异常中可以看出，不再允许使用此语法。您必须通过将@字符放在名称前，显式引用要在表达式中使用的任何局部变量。例如，

In [42]: df = pd.DataFrame(np.random.randn(5, 2), columns=list('ab'))

In [43]: newcol = np.random.randn(len(df))

In [44]: df.eval('b + @newcol')
Out[44]: 
0   -0.173926
1    2.493083
2   -0.881831
3   -0.691045
4    1.334703
dtype: float64

In [45]: df.query('b < @newcol')
Out[45]: 
          a         b
0  0.863987 -0.115998
2 -2.621419 -1.297879

如果你不用局部变量前缀@，pandas将引发一个异常告诉你该变量是未定义的。

当使用DataFrame.eval()和DataFrame.query()时，这允许您有一个局部变量和一个DataFrame表达式中的名称。

In [46]: a = np.random.randn()

In [47]: df.query('@a < a')
Out[47]: 
          a         b
0  0.863987 -0.115998

In [48]: df.loc[a < df.a]  # same as the previous expression
Out[48]: 
          a         b
0  0.863987 -0.115998

With pandas.eval() you cannot use the @ prefix at all, because it isn’t defined in that context. 如果您尝试在对pandas.eval()的顶级调用中尝试使用@，则pandas会让您知道这一点。例如，

In [49]: a, b = 1, 2

In [50]: pd.eval('@a + b')
  File "<string>", line unknown
SyntaxError: The '@' prefix is not allowed in top-level eval calls, 
please refer to your variables by name without the '@' prefix

在这种情况下，你应该像在标准Python中那样引用变量。

In [51]: pd.eval('a + b')
Out[51]: 3

`pandas.eval()` Parsers¶

有两个不同的解析器和两个不同的引擎可以用作后端。

默认的'pandas'解析器允许更直观的语法来表达类查询操作（比较，连接和析取）。特别地，使&和|运算符的优先级等于相应的布尔运算and和or。

例如，上述连接可以不用括号写。或者，您可以使用'python'解析器强制执行严格的Python语义。

In [52]: expr = '(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)'

In [53]: x = pd.eval(expr, parser='python')

In [54]: expr_no_parens = 'df1 > 0 & df2 > 0 & df3 > 0 & df4 > 0'

In [55]: y = pd.eval(expr_no_parens, parser='pandas')

In [56]: np.all(x == y)
Out[56]: True

相同的表达式可以与字and一起被“anded”：

In [57]: expr = '(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)'

In [58]: x = pd.eval(expr, parser='python')

In [59]: expr_with_ands = 'df1 > 0 and df2 > 0 and df3 > 0 and df4 > 0'

In [60]: y = pd.eval(expr_with_ands, parser='pandas')

In [61]: np.all(x == y)
Out[61]: True

这里的and和or运算符具有与在vanilla Python中相同的优先级。

`pandas.eval()` Backends¶

还有一个选项让eval()操作与纯粹的Python相同。

注意

使用'python'引擎通常不有用，除了测试其他评估引擎。您将使用eval()和engine='python'实现no性能优势，实际上可能会造成性能损失。

你可以通过使用pandas.eval()和'python'引擎来看到这一点。它比在Python中评估同一个表达式慢一点（不是太多）

In [62]: %timeit df1 + df2 + df3 + df4
10 loops, best of 3: 24.2 ms per loop

In [63]: %timeit pd.eval('df1 + df2 + df3 + df4', engine='python')
10 loops, best of 3: 25.2 ms per loop

`pandas.eval()` Performance¶

eval()旨在加速某些类型的操作。特别地，涉及具有大的DataFrame / Series对象的复杂表达式的那些操作应当看到显着的性能益处。这里是一个图表，显示pandas.eval()的运行时间作为计算中涉及的框架大小的函数。这两条线是两个不同的引擎。

http://pandas.pydata.org/pandas-docs/version/0.19.2/_images/eval-perf.png

注意

使用纯Python，较小对象（大约15k-20k行）的操作速度更快：

此图使用DataFrame创建，每个列包含使用numpy.random.randn()生成的浮点值。

Technical Minutia Regarding Expression Evaluation¶

必须在Python空间中评估导致对象dtype或涉及datetime操作（因为NaT）的表达式。此行为的主要原因是保持与numpy版本的向后兼容性在numpy的这些版本中，对ndarray.astype(str)的调用将截断长度超过60个字符的任何字符串。第二，我们不能将object数组传递到numexpr，因此字符串比较必须在Python空间中求值。

结果是，这仅适用于object-dtype的表达式。所以，如果你有一个表达式 - 例如

In [64]: df = pd.DataFrame({'strings': np.repeat(list('cba'), 3),
   ....:                    'nums': np.repeat(range(3), 3)})
   ....: 

In [65]: df
Out[65]: 
   nums strings
0     0       c
1     0       c
2     0       c
3     1       b
4     1       b
5     1       b
6     2       a
7     2       a
8     2       a

In [66]: df.query('strings == "a" and nums == 1')
Out[66]: 
Empty DataFrame
Columns: [nums, strings]
Index: []

比较的数字部分（nums == 1）将由numexpr

In general, DataFrame.query()/pandas.eval() will evaluate the subexpressions that can be evaluated by numexpr and those that must be evaluated in Python space transparently to the user. 这是通过从其参数和运算符推断表达式的结果类型来完成的。

目录

搜索

Enhancing Performance¶

Cython (Writing C extensions for pandas)¶

Pure python¶

Plain cython¶

Adding type¶

Using ndarray¶

More advanced techniques¶

Using numba¶

Jit¶

Vectorize¶

Caveats¶

Expression Evaluation via `eval()` (Experimental)¶

Supported Syntax¶

`eval()` Examples¶

The `DataFrame.eval` method (Experimental)¶

Local Variables¶

`pandas.eval()` Parsers¶

`pandas.eval()` Backends¶

`pandas.eval()` Performance¶

Technical Minutia Regarding Expression Evaluation¶

目录

搜索

Enhancing Performance¶

Cython (Writing C extensions for pandas)¶

Pure python¶

Plain cython¶

Adding type¶

Using ndarray¶

More advanced techniques¶

Using numba¶

Jit¶

Vectorize¶

Caveats¶

Expression Evaluation via eval() (Experimental)¶

Supported Syntax¶

eval() Examples¶

The DataFrame.eval method (Experimental)¶

Local Variables¶

pandas.eval() Parsers¶

pandas.eval() Backends¶

pandas.eval() Performance¶

Technical Minutia Regarding Expression Evaluation¶

Expression Evaluation via `eval()` (Experimental)¶

`eval()` Examples¶

The `DataFrame.eval` method (Experimental)¶

`pandas.eval()` Parsers¶

`pandas.eval()` Backends¶

`pandas.eval()` Performance¶