SciPy (Scientific Python): Tips & Snippets

Last updated: November 12, 2019

ℹ️ I wrote these notes for myself as a reference, and published them here in case they are helpful for others.

Table of contents:

Initial setup for IPython notebooks

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

This is the minimum of what is needed. For more, see ipython-setup.

You also might want some notebook extensions:

pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user

Pandas DataFrames

Assume there is a DataFrame called df.

Importing data

To get this DataFrame, you can import from…

Reading from MySQL

from sqlalchemy import create_engine

from sqlalchemy import create_engine
engine = create_engine('mysql://root:password@localhost/database')
sql = "select * from tablename"
df = pd.read_sql(sql, engine)

Empty dataframes

If you want to make an empty dataframe and then fill it in, do this:

df = pd.DataFrame()
df = df.append(pd.DataFrame({
  "column1": [1],
  "column2": [2]
}), ignore_index = True)

That will add one row to df where column1 = 1 and column2 = 2.

Exporting data

Tips for exporting to Stata

Pandas is fussy about df.to_stata(...). Here are some workarounds for common problems:

Feather

feather (GitHub) is a “fast on-disk format for data frames.” It is similar to pickle but for data frames.

After pip install feather-format, you can:

# Writing to a Feather file
import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)

# Reading from a Feather file
df = feather.read_dataframe(path)

For long-running reads from external sources, this pattern might be helpful:

import hashlib # for md5 checksum
import os.path # for checking if file exists

data_source = "data/source.xlsx"
md5 = hashlib.md5(data_source).hexdigest()
data_source_feather = "data/source-%s.feather" % md5

if os.path.isfile(data_source_feather):
    print("Loading %s..." % data_source_feather)
    df = feather.read_dataframe(data_source_feather)
else:
    df = pd.read_excel(data_source)
    feather.write_dataframe(df, data_source_feather)

The first time this code runs, it will be slow as Pandas reads from the original data source. Subsequent runs should be very fast.

DataFrame information

Selecting columns and rows

Selecting rows with a MultiIndex

Here are some examples of how to select specific rows with a two-level index (also known as a hierarchical index or a MultiIndex):

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({("bar", "a"): {"col1": 1}, ("bar", "b"): {"col1": 2}, ("baz", "a"): {"col1": 3}, ("baz", "b"): {"col1": 4}}).T

In [3]: df
Out[3]:
       col1
bar a     1
    b     2
baz a     3
    b     4

# Select a specific row
In [4]: df.loc[("bar", "a"), :]
Out[4]:
col1    1
Name: (bar, a), dtype: int64

# Select all rows with a specific top-level index
In [5]: df.loc["bar", :]
Out[5]:
   col1
a     1
b     2

# Select all rows with a specific second-level index
In [6]: df.loc[(slice(None), "a"), :]
Out[6]:
       col1
bar a     1
baz a     3

Renaming columns and rows

Deleting columns and rows

Recoding data

Subsetting data

Aggregating data

Other data manipulation

Getting the previous row value for long longitudinal data

Sometimes it’s helpful to get a value from row i-1 in row i with long data. This would fill in the prev_value column in the table below.

id obs_num value prev_value
1 1 7 NaN
1 2 10 7
1 3 18 10
2 1 3 NaN

To do this:

df = pd.DataFrame([
    {'id': 1, 'obs_num': 1, 'value':  7},
    {'id': 1, 'obs_num': 2, 'value': 10},
    {'id': 1, 'obs_num': 3, 'value': 18},
    {'id': 2, 'obs_num': 1, 'value':  3},
])
df['prev_value'] = df.groupby('id')['value'].shift()

# In : df
# Out:
#    id  obs_num  value  prev_value
# 0   1        1      7         NaN
# 1   1        2     10         7.0
# 2   1        3     18        10.0
# 3   2        1      3         NaN

Missing data

Duplicate data

Merging data

Appending data frames

Recoding variables

Returning a view vs. a copy (avoiding SettingWithCopy warnings)

See this pandas documentation page. This general discussion about names and values in Python may also be helpful.

Miscellaneous

Scipy

Statistics

Graphing

Common graphs with Seaborn

Seaborn is typically imported as import seaborn as sns.

Time series line graph

This produces a line graph with reasonable date formatting for x axis labels:

fig, ax = plt.subplots()
sns.tsplot(df['value_colname'], time=df['time_varname'], ax=ax)
ax.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, _: pd.to_datetime(x)))
fig.autofmt_xdate()

IPython notebooks

General Python stuff that might be useful

Logging

The basic logging tutorial in the official Python docs is a good place to start. Here’s an example from there:

import logging
logging.basicConfig(filename='example.log',level=logging.DEBUG)
logging.debug('This message should go to the log file')
logging.info('So should this')
logging.warning('And this, too')

Testing

Python comes with built-in testing capabilities.

import unittest

class TestSomething(unittest.TestCase):

    def setUp(self):
        # Runs before each test method
        pass

    def tearDown(self):
        # Post-test clean-up, run after each test method
        pass

    def test_a_thing:
        assertEqual(1, 1)

There are a bunch of built-in assertSomething methods:

assertDictEqual, assertEqual, assertTrue, assertFalse, assertGreater, assertGreaterEqual, assertIn, assertIs, assertIsIntance, assertIsNone, assertIsNot, assertIsNotNone, assertItemsEqual, assertLess, assertLessEqual, assertListEqual, assertMultiLineEqual, assertNotAlmostEqual, assertNotEqual, assertTupleEqual, assertRaises, assertRaisesRegexp, assertRegexpMatches

Tests can be skipped with the @unittest.skip("message here") decorator.

Tests can automatically be run once with nosetest -v (put your tests in tests/, start each filename with test_, and each test method with test_).

To get some color in the test output, pip install pinocchio and then run nosetests -v --spec-color --with-spec.

I’m also trying out nose-progressive for colorizing nosetests tracebacks: pip install nose-progressive and then nosetests -v --with-progressive --logging-clear-handlers. This is not compatible with pinocchio, and it seems to crash sniffer.

To automatically run tests when a file changes, pip install sniffer and then run sniffer -x--spec-color -x--with-spec or sniffer -x--with-progressive.

To get a debugger to run from within a test, pip install nose2 and then run nose2. If you pdb.set_trace() in a test, the debugger will open.

Profiling

Profiling can help to determine which parts of a program are running slower than you want.

As a first pass, adding in execution time logging can be helpful (printing out how long it takes a method to run). This can be done with a decorator.

There is also a decorator for more detailed profiling. It produces output like:

*** PROFILER RESULTS ***
silly_fibonacci_example (sample.py:6)
function called 109 times

         325 function calls (5 primitive calls) in 0.004 CPU seconds

   Ordered by: internal time, call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    108/2    0.001    0.000    0.004    0.002 profilehooks.py:79(<lambda>)
    108/2    0.001    0.000    0.004    0.002 profilehooks.py:131(__call__)
    109/1    0.001    0.000    0.004    0.004 sample.py:6(silly_fibonacci_example)
        0    0.000             0.000          profile:0(profiler)

There is also line-by-line profiling that provides output like this:

Pystone(1.1) time for 50000 passes = 2.48
This machine benchmarks at 20161.3 pystones/second
Wrote profile results to pystone.py.lprof
Timer unit: 1e-06 s

File: pystone.py
Function: Proc2 at line 149
Total time: 0.606656 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   149                                           @profile
   150                                           def Proc2(IntParIO):
   151     50000        82003      1.6     13.5      IntLoc = IntParIO + 10
   152     50000        63162      1.3     10.4      while 1:
   153     50000        69065      1.4     11.4          if Char1Glob == 'A':
   154     50000        66354      1.3     10.9              IntLoc = IntLoc - 1
   155     50000        67263      1.3     11.1              IntParIO = IntLoc - IntGlob
   156     50000        65494      1.3     10.8              EnumLoc = Ident1
   157     50000        68001      1.4     11.2          if EnumLoc == Ident1:
   158     50000        63739      1.3     10.5              break
   159     50000        61575      1.2     10.1      return IntParIO

Tools and libraries of interest

Installing Recent Python on OS X

I used to use these instructions to set up pyenv, but pyenv can cause some problems. Handling Python dependency requirements is difficult enough already that I don’t want any extra complexity on top of pip (the package manager) and virtualenv (separate environments with different installed dependencies).

So here is what I’m doing now:

Resources

Thanks

The following people have helped with this:

Suggestions? Corrections?

Please contact me.