****************************** Python Functions ****************************** ======================= User-defined Functions ======================= ----------------------- Function registration ----------------------- To register Python UDFs, you must install script files in all cluster nodes. After that, you can register your functions by specifying the paths to those script files in ``tajo-site.xml``. Here is an example of the configuration. .. code-block:: xml tajo.function.python.code-dir /path/to/script1.py,/path/to/script2.py Please note that you can specify multiple paths with ``','`` as a delimiter. Each file can contain multiple functions. Here is a typical example of a script file. .. code-block:: python # /path/to/udf1.py @output_type('int4') def return_one(): return 1 @output_type("text") def helloworld(): return 'Hello, World' # No decorator - blob def concat_py(str): return str+str @output_type('int4') def sum_py(a,b): return a+b If the configuration is set properly, every function in the script files are registered when the Tajo cluster starts up. ----------------------- Decorators and types ----------------------- By default, every function has a return type of ``BLOB``. You can use Python decorators to define output types for the script functions. Tajo can figure out return types from the annotations of the Python script. * ``output_type``: Defines the return data type for a script UDF in a format that Tajo can understand. The defined type must be one of the types supported by Tajo. For supported types, please refer to :doc:`/sql_language/data_model`. ----------------------- Query example ----------------------- Once the Python UDFs are successfully registered, you can use them as other built-in functions. .. code-block:: sql default> select concat_py(n_name)::text from nation where sum_py(n_regionkey,1) > 2; ============================================== User-defined Aggregation Functions ============================================== ----------------------- Function registration ----------------------- To define your Python aggregation functions, you should write Python classes for each function. Followings are typical examples of Python UDAFs. .. code-block:: python # /path/to/udaf1.py class AvgPy: sum = 0 cnt = 0 def __init__(self): self.reset() def reset(self): self.sum = 0 self.cnt = 0 # eval at the first stage def eval(self, item): self.sum += item self.cnt += 1 # get intermediate result def get_partial_result(self): return [self.sum, self.cnt] # merge intermediate results def merge(self, list): self.sum += list[0] self.cnt += list[1] # get final result @output_type('float8') def get_final_result(self): return self.sum / float(self.cnt) class CountPy: cnt = 0 def __init__(self): self.reset() def reset(self): self.cnt = 0 # eval at the first stage def eval(self): self.cnt += 1 # get intermediate result def get_partial_result(self): return self.cnt # merge intermediate results def merge(self, cnt): self.cnt += cnt # get final result @output_type('int4') def get_final_result(self): return self.cnt These classes must provide ``reset()``, ``eval()``, ``merge()``, ``get_partial_result()``, and ``get_final_result()`` functions. * ``reset()`` resets the aggregation state. * ``eval()`` evaluates input tuples in the first stage. * ``merge()`` merges intermediate results of the first stage. * ``get_partial_result()`` returns intermediate results of the first stage. Output type must be same with the input type of ``merge()``. * ``get_final_result()`` returns the final aggregation result. ----------------------- Query example ----------------------- Once the Python UDAFs are successfully registered, you can use them as other built-in aggregation functions. .. code-block:: sql default> select avgpy(n_nationkey), countpy() from nation; .. warning:: Currently, Python UDAFs cannot be used as window functions.