Unified Map ########### .. image:: https://img.shields.io/badge/Python-3-00864b.svg :target: https://docs.python.org/3 :alt: Python 3 | .. image:: https://img.shields.io/badge/Release-v1.0.0-00864b.svg :target: https://pypi.org/project/unified-map/1.0.0 :alt: Release v1.0.0 | .. image:: https://img.shields.io/badge/Licence-Apache%202.0-00864b.svg :target: https://www.apache.org/licenses/LICENSE-2.0 :alt: License Apache 2.0 Unified *Many implementations, one way of access.* Map *Apply a function to a list of arguments and collect the results -- serial, parallel or distributed.* Purpose ======= This package provides reasonably simple syntax for a frequent programming task which is implemented in various places (built-in, standard library, external libraries). Here are three descriptions of this task: - `Map `_ a list of inputs to a list of results via a user-provided function. - `Apply `_ a given function to a list of arguments to get a list of return values in the same order. - `A black-box function `_ is evaluated with different inputs and the outputs are collected. In general, this is a so-called `"pleasingly parallel problem" `_ (aka "embarrassingly parallel" or "pleasingly parallel") because it is straightforward to separate it into independend subtasks. For this reason, and due to its frequent occurrence, it is recognized as a `programming idiom `_ ("parallel map", "parallel for loop") in **parallel computing** that can equally simple be applied in **distributed computing**. This package allows to do so with a focus on simplicity of use. Use cases ========= The function can be **univariate** (=takes one argument) or **multivariate** (=takes several arguments). .. image:: images/functions.svg :width: 60% :alt: Functions visualization: Image not found :align: center | The evaluation can be executed in a **serial** (=on one core), **parallel** (=on several cores) or **distributed** (=on several machines) way. .. image:: images/execution_modes.svg :width: 80% :alt: Execution modes visualization: Image not found :align: center | .. note:: The arrows indicate that **parallel** and **distributed** calculations send *objects* to other processes or machines, namely the *elements of the input list*, the *function* and the *elements of the result list*. This implies that all of these **objects need to be serializable**, which is not always the case: - If a library uses Python's `pickle module `_ for object serialization, then only `certain objects `_ can be "pickled", i.e. be transformed in a binary serialization format. For example, a `lambda expression `_ will not work as function, since it can not be pickled. - Some libraries use improved serialization functionality from elsewhere, e.g. `dill `_ or `cloudpickle `_. - Some libraries implement their own advanced serialization routines, e.g. `joblib `_ or `dask `_. Installation ============ - **Complete installation** = with all optional dependencies .. code-block:: bash $ pip install "unified-map[complete]" .. note:: *Apache Spark* installation currently can not be automated correctly. If you want to use it, please install it manually as described below. - **Minimal installation** = with no optional dependencies .. code-block:: bash $ pip install unified-map | Functions that rely on third-party libraries will not work. You can, however, install any optional dependency you want to use by yourself, see below. - **Optional dependencies** - `Joblib installation `_ .. code-block:: bash $ pip install joblib - `Dask installation `_ .. code-block:: bash $ pip install "dask[complete]" - `Apache Spark installation `_ Apache Spark is written in Scala and provides APIs for several languages (incl. Python). - There is a `"Python packaged version of Spark" available on PyPI `_ that could be installed with :code:`pip`. This does, however, not include the necessary scripts (e.g. for starting workers), therefore manual installation is required. - The `"pre-packaged Apache Spark" needs to be downloaded from here `_. Extract it in a desired place in your file system (e.g. :code:`$HOME/bin/spark`). Then certain environment variables need to be set (e.g. in :code:`.bashrc`) so that Apache Spark can be found in the shell and in Python. Most importantly, :code:`SPARK_HOME` needs to point at the installation directory: .. code-block:: bash export SPARK_HOME="$HOME/bin/spark/spark-2.3.0-bin-hadoop2.7/" export PATH="$SPARK_HOME/bin/:$PATH" export PYTHONPATH="$SPARK_HOME/python/:$PYTHONPATH" - Apache Spark uses `Py4J `_ to access Java objects from a Python interpreter: .. code-block:: bash $ pip install py4j - Apache Spark is written in Scala which runs on a Java Virtual Machine, hence it may be necessary to install or update Java on your machine. The documentation indicates that :code:`Java 8+` is required. - On a Linux system you can check your Java version on the command line with :code:`java -version`. - An `open-source implementations like OpenJDK `_ may work (no guarantee!). - If necessary, install the `official Java Runtime Environment (JRE) from Oracle `_. .. _quickstart: Quickstart ========== Univariate function applied to a list of arguments where each element is a single value .. literalinclude:: code/quickstart_univariate.py Multivariate function applied ot a list of arguments where each element is a tuple of values .. literalinclude:: code/quickstart_multivariate.py .. note:: :code:`connect_to_scheduler(
)` requires that a scheduler and workers were started in other processes, possibly running on remote machines, see :ref:`cluster-setup`. Start a scheduler in one terminal: .. code-block:: bash $ python3 >>> import unified_map as umap >>> umap.cluster_setup.dask.start_scheduler('10.0.0.5:8789') Start workers in other terminals: .. code-block:: bash $ python3 >>> import unified_map as umap >>> umap.cluster_setup.dask.start_worker('10.0.0.5:8789') More detailed information ========================= .. toctree:: :maxdepth: 1 rst/api/api rst/references