Introduction to the Python Scientific Ecosystem, Part 1

Posted on August 03, 2017 in Python

With the rise in popularity of Python as a tool for data analysis, I'd like to spend some time showing you some tools that have recently emerged as new 'standards' for computation in science and engineering.

I'll be dividing this post in two sections.

The first will touch on 'how' Python is used, i.e. tools that help the environment in which the code runs. I'll talk about the Anaconda distribution, installing packages, the Python interpreter, Jupyter Notebooks, etc. The second part will talk about 'what' Python is used for -- that is, a primer on the standard scientific 'stack' of Python packages and libraries that are now part of a open source stack used in many companies, universities and departments.

Getting Started With Python

As of writing, Python is rapidly becoming the go-to language for almost any programming task, be they ten-line scripts to thousand-employee enterprise projects, that span hundreds of thousands of lines of code. It is well-loved for:

  • How the code closely resembles 'whiteboard-like' pseudocode
  • The speed and small number of lines of code required to get to the desired result
  • The huge number, maturity and user-friendliness of its community-created packages

The last point should not be taken lightly. Indeed, there exist mature and well-maintained packages for most use cases one can think of. This means the time invested to learn the language to perfect, say, your data manipulation skills, is also time invested in making you a potentially better web programmer, day trading investor, or relativistic hydromagnetodynamics modeler.

We won't be diving into a comprehensive beginner's guide to Python in this notebook; others have already done so remarkably well, and have already shared these beginner guides online.

More in-depth reads and talks:

The Python Scientific Stack - SciPy

The Python language was not originally designed with scientific computing in mind, but its beauty and ease-of-use have inspired the development of a powerful and mature ecosystem of scientific and data-focused computing tools. The scientific and engineering community of practice has coalesced their tools over the years, to form a series of packages with standard components and interfaces, known as Scipy.

From Scipy's website:

SciPy refers to several related but distinct entities:

  • The SciPy Stack, a collection of open source software for scientific computing in Python, and particularly a specified set of core packages.

  • The community of people who use and develop this stack.

  • Several conferences dedicated to scientific computing in Python - SciPy, EuroSciPy and SciPy.in.

  • The SciPy library, one component of the SciPy stack, providing many numerical routines.

The SciPy Stack

Core Packages

  • Python, a general purpose programming language. It is interpreted and dynamically typed and is very suited for interactive work and quick prototyping, while being powerful enough to write large applications in.

  • NumPy, the fundamental package for numerical computation. It defines the numerical array and matrix types and basic operations on them.

  • The SciPy library, a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics and much more.

  • Matplotlib, a mature and popular plotting package, that provides publication-quality 2D plotting as well as rudimentary 3D plotting.

  • pandas, providing high-performance, easy to use data structures.

  • SymPy, for symbolic mathematics and computer algebra.

  • IPython, a rich interactive interface, letting you quickly process data and test ideas. The IPython notebook works in your web browser, allowing you to document your computation in an easily reproducible form.

  • nose, a framework for testing Python code.

Other packages

There are many, many more packages built on this stack - too many to list here.

Sidenote: Python vs legacy Python

From Wikipedia:

Python 3.0 was released on December 3, 2008. It was designed to rectify certain fundamental design flaws in the language (the changes required could not be implemented while retaining full backwards compatibility with the 2.x series, which necessitated a new major version number). The guiding principle of Python 3 was: "reduce feature duplication by removing old ways of doing things".

Python 3.0 broke backward compatibility, and much Python 2 code does not run unmodified on Python 3. Python's dynamic typing combined with the plans to change the semantics of certain methods of dictionaries, for example, made perfect mechanical translation from Python 2.x to Python 3.0 very difficult.

There has been a long adaptation period for the Python community to adapt to the latest version but as of writing, these days are mostly over, and the majority of Python users and active codebases are written for Python 3.X and can make full use of its better performance and new features; even though there was a break in backwards compatibility between Python 2 and 3, there is no such break in sight for the foreseeable future. Unfortunately, many books, blog posts, Stack Overflow answers and training videos still contain legacy Python code, and the modern Python user needs to understand a few differences between the two versions to "translate" the material. Here are the main three items to look out for:

  • Print statement vs function:
    • Python 2: print 'a'
    • Python 3: print('a')
  • Integer division:
    • Python 2: 4 / 3 == 1
    • Python 3: 4 / 3 == 1.333333
  • Dictionary access:
    • Python 2: dict.iteritems()
    • Python 3: dict.items()

Apart from that, the main take-home is: don't worry about it, and just use the latest version of Python.

Python distribution - Anaconda

It can be hard to set up a programming environment, especially in an entreprise context. You will need to download and setup Python itself proper, and then maintain your files in a manner that makes sense when writing code. We'll adress the first by using the Anaconda Python distribution, and the second by using git for version control.

Anaconda is an open source product, distributed by the cloud computing company Continuum Analytics. Anaconda bundles 700+ of the most-used open source packages for scientific and data analysis packages, as well as a Python package manager called conda. It is easily installed on all platforms, and it does not require administrator privileges, as it only installs files for the local user, and nothing system-wide.

The conda package manager

The main package manager for Python is called pip, and it is used in almost all Python tutorials, e.g. pip install requests. That being said, Anaconda comes with its own package manager, conda, which aims to facilitate the installation and update of packages that are not easily installable through pip. For example, installing the OpenCV package for computer vision typically requires half-a-dozen steps and compiling files outside of the "main" Python folders. Using conda, this is just one step. It does mean that most packages have to be installed through conda instead of pip, however:

$ conda install requests

Conda maintains a "best of" list of scientific packages. Many times, you'll want to install packages that are not part of Anaconda's main repository, for example the Skyfield package for orbital propagation. In this case, you'll need to search the Anaconda library to see if a user hasn't already provided it:

$ anaconda search skyfield
Packages:
 Name                      |  Version | Package Types   | Platforms
 ------------------------- |   ------ | --------------- | ---------------
 conda-forge/skyfield      |      1.0 | conda           | linux-64, win-32, win-64, osx-64
 jochym/skyfield           |          | conda           | linux-64
                                      : Elegant astronomy for Python
 pypi/skyfield             |      0.3 | pypi            |
                                      : Elegant astronomy for Python
Found 3 packages

Once found, you then specify to conda to install the desired package from another channel:

$ conda install -c conda-forge skyfield

and all of the packages installed with conda can be updated with:

$ conda update anaconda

IPython REPL

Python is not a compiled language, but rather it is interpreted. The standard Python distribution comes with its own interpreter program, many times referred to as a "read, eval, print loop" (REPL). They are a great way to test things as you are writing code, be it to test-run your application, debug it, make sure your syntax is correct, etc. It can be spun up from the command line (CMD.exe), and is quite straightforward to use. You can even spin up a vanilla Python REPL in your browser.

Also included in the Anaconda distribution is the IPython REPL, which adds extra goodies to make the interpreter more interactive. These include:

  • Tab autocompletion (on class names, functions, methods, variables)
  • Shift-Tab retrieves contents of the function's documentation (docstring) and a list of the named arguments
  • More explicit and colour-highlighted error messages
  • Better history management
  • Basic shell integration (you can run simple UNIX shell commands such as cp, ls, rm, cp, etc. directly from the IPython command line)
  • Nice integration with many common GUI modules (PyQt, PyGTK, and tkinter)

This can dramatically accelerate the pace at which one can discover how to use Python and new packages. Try IPython in your browser here.

Keep on reading

You can read Part 2 of this article here, where we'll talk about actual Python code, how to generate graphs and charts with Matplotlib, writing text in Markdown, and a few other things.