SciPy (Scientific Python): Tips & Snippets

Last updated: January 18, 2021

ℹ️ I wrote these notes for myself as a reference, and published them here in case they are helpful for others.

Table of contents:

Initial setup for IPython notebooks
Pandas DataFrames
Scipy
Statistics
Graphing
- Common graphs with Seaborn
  - Time series line graph
IPython notebooks
General Python stuff that might be useful
Tools and libraries of interest
Installing Recent Python on OS X
Resources
Thanks
Suggestions? Corrections?

Initial setup for IPython notebooks

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

This is the minimum of what is needed. For more, see ipython-setup.

You also might want some notebook extensions:

pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user

Pandas DataFrames

Assume there is a DataFrame called df.

Importing data

To get this DataFrame, you can import from…

CSV: df = pd.read_csv("path/to/csv.csv")
Stata: df = pd.read_stata('path/to/stata_data.dta')
Excel: df = pd.read_excel('/path/to/excel.xls')
- You will need to first pip install xlrd and then import xlrd for this to work.

Reading from MySQL

from sqlalchemy import create_engine

from sqlalchemy import create_engine
engine = create_engine('mysql://root:password@localhost/database')
sql = "select * from tablename"
df = pd.read_sql(sql, engine)

Empty dataframes

If you want to make an empty dataframe and then fill it in, do this:

df = pd.DataFrame()
df = df.append(pd.DataFrame({
  "column1": [1],
  "column2": [2]
}), ignore_index = True)

That will add one row to df where column1 = 1 and column2 = 2.

Exporting data

To CSV: df.to_csv(file_name, encoding='utf-8')
To Stata: df.to_stata("/path/to/stata/file.dta")

Tips for exporting to Stata

Pandas is fussy about df.to_stata(...). Here are some workarounds for common problems:

Exporting datetime columns: df.to_stata('f.dta', convert_dates={'datetime_col_name': 'tc'})
- See the docs for a full list of alternatives to tc.

“Writing general object arrays is not supported” error:

Figure out which of your columns are of object type: list(df.select_dtypes(include=['object']).columns)
Convert them to something else: df['col'] = df['col'].astype(str)

This code may help:

# Attempt to auto-convert columns from `object` to a useful type.
# Note that string columns may still have `dtype=object`, but will still be able
# to export to Stata format.
type_pref = [int, float, str]
for colname in list(df.select_dtypes(include=['object']).columns):
    for t in type_pref:
        try:
            df[colname] = df[colname].astype(t)
        except (ValueError, TypeError) as e:
            pass

Feather

feather (GitHub) is a “fast on-disk format for data frames.” It is similar to pickle but for data frames.

After pip install feather-format, you can:

# Writing to a Feather file
import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)

# Reading from a Feather file
df = feather.read_dataframe(path)

For long-running reads from external sources, this pattern might be helpful:

import hashlib # for md5 checksum
import os.path # for checking if file exists

data_source = "data/source.xlsx"
md5 = hashlib.md5(data_source).hexdigest()
data_source_feather = "data/source-%s.feather" % md5

if os.path.isfile(data_source_feather):
    print("Loading %s..." % data_source_feather)
    df = feather.read_dataframe(data_source_feather)
else:
    df = pd.read_excel(data_source)
    feather.write_dataframe(df, data_source_feather)

The first time this code runs, it will be slow as Pandas reads from the original data source. Subsequent runs should be very fast.

DataFrame information

Get dimensions: df.shape
Get list of columns: df.columns

Selecting columns and rows

Select columns: df[['col1', 'col2', ..]]
Selecting rows:
- df.loc[...] is used to select based on labels in the index
- df.iloc[...] is used to select based on integer positions in the index
- df.ix[...] tries to use labels, and falls back to positions if the label is not in the index. Only use this when you need to mix label and positions (like for selecting rows based on label and columns based on position).
Selecting columns and filtering rows at the same time:
- df[['col1', 'col2', ..]][df['col1'] == "criteria"]
- If you want to modify cells, you have to use loc, iloc, or ix to avoid this warning: “A value is trying to be set on a copy of a slice from a DataFrame”.
  - Syntax: df.ix[df['column_name'] < 1, ['column_to_change']] = 2
Selecting a specific cell by row and column name: df.loc['row name', 'col name']

Selecting rows with a MultiIndex

Here are some examples of how to select specific rows with a two-level index (also known as a hierarchical index or a MultiIndex):

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({("bar", "a"): {"col1": 1}, ("bar", "b"): {"col1": 2}, ("baz", "a"): {"col1": 3}, ("baz", "b"): {"col1": 4}}).T

In [3]: df
Out[3]:
       col1
bar a     1
    b     2
baz a     3
    b     4

# Select a specific row
In [4]: df.loc[("bar", "a"), :]
Out[4]:
col1    1
Name: (bar, a), dtype: int64

# Select all rows with a specific top-level index
In [5]: df.loc["bar", :]
Out[5]:
   col1
a     1
b     2

# Select all rows with a specific second-level index
In [6]: df.loc[(slice(None), "a"), :]
Out[6]:
       col1
bar a     1
baz a     3

Renaming columns and rows

Rename columns: df.columns = ["col 1 name", ...]
Renaming a single column: df.rename(columns={'oldname': 'newname'}, inplace=True)
Renaming rows: df = df.set_index('column name'), which will cause the rows to be labeled with the values of the specified column. You may have to create a new column first to get what you want.
Removing the label for an index (this is the label that appears above row labels, which I think is confusing): df.index.names = [None]
Round everything to 1 decimal place: np.round(df, decimals=1)

Deleting columns and rows

Delete a column: df = df.drop('column_name', 1)

Recoding data

Based on value: df['var'] = df['var'].replace('oldval', 'newval')

Subsetting data

Unique values from a column: np.unique(df['colname']) or df.column.unique()
Filtering rows using a list of values: df[df['col'].isin(whitelist)] will yield all the rows in df where the value for col appears in whitelist = pd.Series(...)

Aggregating data

I often find myself wanting to sum a variable by group. For example, I might want the number of records for each day in the dataset.

This can be accomplished easily with the following Stata code:
```
gen one = 1
collapse (sum) one, by(datevar)
```
Translated to Python, this is:
```
collapsed = df.groupby(['date']).size()
```
If you aggregate by a variable with a datetime type, and you want to access specific values, you can use collapsed[pd.datetime(2016,1,1).date()].
You can aggregate with a custom method like this:
```
df.groupby('group_by_this_column_name')['aggregate_this_column_name'].agg({'output_column_name': custom_method})
```
The custom method should have one argument, which is a Pandas Series object containing the values of aggregate_this_column_name for a given value of group_by_this_column_name.
- How to do this when you need to look at multiple columns during the aggregation

You can aggregate multiple columns with different methods like this:

aggregate = ['aggregate1', 'aggregate2']
df.groupby('group_by_this_column_name')[aggregate].agg({
    'aggregate1': np.sum,
    'aggregate2': (lambda x: x.iloc[0]) # Gets first value
})

If you want to run multiple aggregation methods for the same column or just show more descriptive names, you can do this:

aggregate = ['aggregate1', 'aggregate2']
df.groupby('group_by_this_column_name')[aggregate].agg({
    'aggregate1': {'sum': np.sum},
    'aggregate2': {'sum': np.sum, 'first_value': (lambda x: x.iloc[0])} # Gets first value
})

Adding a running count column by group (first row for a given groupvar will be 0, second will be 1, etc.):
```
df['running_count'] = df.groupby(['groupvar']).cumcount()
```

Adding a total count column by group (if there are 3 rows with a given groupvar, all 3 rows will have total_counts set to 3):

total_counts = df.groupby(['groupvar']).size()
df = df.set_index('groupvar')
df['total_counts'] = total_counts
df = df.reset_index()

Other data manipulation

Concatenate two columns (a string and integer): df['bar'] = df['bar'].str.cat(df['foo'].values.astype(str), sep=' is ') (source)
Change type of column: df['col'] = df['col'].astype(str)
- For changing to numeric, this works on recent versions of Pandas: df['col'] = pd.to_numeric(df['col'])
Changing the case of a string column: df['col'] = df['col'].str.upper() or df['col'].apply(lambda x: x.upper(), inplace=True) if you need to deal with unicode strings. Use lower() to lowercase rather than uppercase.
Iterating over rows in a DataFrame: for index, row in df.iterrows():
Sorting: df.sort_values(['col1', 'col2'], ascending=[True, False], inplace=True)

Getting the previous row value for long longitudinal data

Sometimes it’s helpful to get a value from row i-1 in row i with long data. This would fill in the prev_value column in the table below.

id	obs_num	value	prev_value
1	1	7	NaN
1	2	10	7
1	3	18	10
2	1	3	NaN

To do this:

df = pd.DataFrame([
    {'id': 1, 'obs_num': 1, 'value':  7},
    {'id': 1, 'obs_num': 2, 'value': 10},
    {'id': 1, 'obs_num': 3, 'value': 18},
    {'id': 2, 'obs_num': 1, 'value':  3},
])
df['prev_value'] = df.groupby('id')['value'].shift()

# In : df
# Out:
#    id  obs_num  value  prev_value
# 0   1        1      7         NaN
# 1   1        2     10         7.0
# 2   1        3     18        10.0
# 3   2        1      3         NaN

Missing data

Get rows where a numeric column is missing: df[pd.isnull(df['colname'])]
pd.notnull can be used in the same way for the opposite effect
The fillna() method can be used to fill missing data.
- Fill with 0s: df['col1'] = df['col1'].fillna(0)
- Fill with previous value: df['col1'] = df['col1'].fillna(method='ffill')

Duplicate data

Dropping duplicates
- df = df.drop_duplicates(['col_with_duplicates'], keep='first')
- Options for keep= are first, last, and False (which drops all duplicates)

Counting duplicates of the combination of col1 and col2

counts = df.groupby(['col1', 'col2']).size().reset_index().rename(columns={0:'count'})
print counts[counts['count'] > 1]

Merging data

The Pandas documentation has a good explanation of this.
Chris Albon has lots of good examples of merges and the necessary code
Merging in Pandas uses SQL terminology like left, right, outer, and inner.
- My explanation of SQL joins.
- Quick visual reference for SQL joins.

Syntax from the Pandas docs:

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)

Appending data frames

df_both = df.append(df2)

Recoding variables

df['recoded'] = df['original'].map({"original_value_1": "new1", "original_value_2": "new2"})

Returning a view vs. a copy (avoiding `SettingWithCopy` warnings)

See this pandas documentation page. This general discussion about names and values in Python may also be helpful.

Miscellaneous

df.T will transpose the DataFrame

Scipy

Truncated normal distribution:

from scipy import stats

# Set up a truncated normal distribution
lower = 0.2
upper = 0.8
mu = 0.5
sigma = 1
distribution = stats.truncnorm((lower - mu) / sigma, (upper - mu) / sigma, loc=mu, scale=sigma)

# Draw once from distribution
draw = distribution.rvs(1)[0]

Statistics

Frequencies table for a column: df['column_name'].value_counts()
Descriptive statistics for a column: df['column_name'].describe()

Two-way tables:

grouped = df.groupby(['var1', 'var2'])
grouped.size()

# Output:
# var1    var2
# value1  value1    3
# value1  value2    1
#         value3    1

Graphing

Must-read explanation of matplotlib – read this before you start making graphs.

Basic setup:

import matplotlib.pyplot as plt # http://matplotlib.org/faq/usage_faq.html
fig = plt.figure(figsize=(6*1.5, 4*1.5))
ax = fig.add_subplot(1,1,1)
ax.hist(df['col'], color="k", alpha=0.4)

Axis titles: plt.ylabel("Y axis title here")
Plot title: ax.set_title('Plot title here')
Axis range: ax.set_xlim([min, max])
Setting figure size: fig = plt.figure(figsize=(6*1.5, 4*1.5))
Histograms with evenly spaced bins for integers: ax.hist(df['col'], color="k", alpha=0.4, bins=range(0, max(df['col']) + 10, 10))
- 10 is the bin width
- More info is available on StackOverflow, including how to do this with floats
Use GridSpec for easy multi-graph figures, where graphs have different sizes.

For example:
```
import matplotlib.gridspec as gs # http://bit.ly/1U6YKGm
fig = plt.figure(figsize=(10, 8))
gs = gridspec.GridSpec(2, 2, height_ratios=[3, 1])

ax = plt.subplot(gs[0])
ax.hist(...)
...
```
This is for a 2x2 grid of graphs. The top two graphs will be taller than the bottom two.

See this blog post for more information and full example code.

Common axis labels for plots with subplots:

fig.text(0.5, 0.04, 'common xlabel', ha='center', va='center')
fig.text(0.06, 0.5, 'common ylabel', ha='center', va='center', rotation='vertical')

Add more horizontal padding between rows of subplots: fig.subplots_adjust(hspace=0.35)

Saving a figure to a PDF

from matplotlib.backends.backend_pdf import PdfPages
ppp = PdfPages('/path/to/file.pdf')
ppp.savefig(fig)
ppp.close()

Keep a plot from automatically showing in a notebook: plt.close()

Common graphs with Seaborn

Seaborn is typically imported as import seaborn as sns.

Time series line graph

This produces a line graph with reasonable date formatting for x axis labels:

fig, ax = plt.subplots()
sns.tsplot(df['value_colname'], time=df['time_varname'], ax=ax)
ax.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, _: pd.to_datetime(x)))
fig.autofmt_xdate()

IPython notebooks

“Line magics” (like %load and %run) – I definitely recommend reading through the official docs or this blog post
- %pdb will open an interactive prompt when there’s an exception
Have None be the last line in the cell to avoid junk output like Out[1]: <matplotlib.text.Text at 0x115933850>
Want to automatically open Chrome when you run jupyter notebook? Add this to /Users/you/.jupyter/jupyter_notebook_config.py: c.NotebookApp.browser = u'/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome %s'. If you don’t have that config file, jupyter notebook --generate-config will generate it.

General Python stuff that might be useful

List of useful, less-known features of Python 3 – there is some great stuff in here!
A gentle guide to the Python features that I didn’t know existed or was too afraid to use.
Explanation of import, etc.
Static, class, and abstract methods
Merging dicts
Magic methods
Iterating over a dict: for k, v in {'key': 'value'}.iteritems()
Print rounded float as a string with 3 decimals: "%.3f" % float_var

Save file to string:

with open("Output.txt", "w") as text_file:
   text_file.write("Purchase Amount: %s" % TotalAmount)

Intermediate Python book
Regular expressions:
- Check for match: if re.compile("^.*$").match(str):
- Quick way to get match out of string: re.compile("regexp(.*)").search(str).groups()[0]
Interactive breakpoint (works when running python myprogram.py):
```
import pdb # at top of file
pdb.set_trace() # where you want the breakpoint
```
- If you want syntax highlighting in the debugger and tab completion, use pdb++, which is a drop-in replacement for pdb. You literally pip install pdbpp and change nothing else.
Colors in stack traces when you run python somefile.py in the Terminal:
- First, pip install colored-traceback
- Then, add this code to your Python file:
```
try:
    import colored_traceback.auto
except ImportError:
    pass
```

For debugging, awesome_print can be very helpful.

# First, `pip install awesome_print`
from awesome_print import ap
ap({"test": [1,2,3]})

Or, to avoid having to use some variation of print for debugging, you can use pysnooper.
Raising a generic exception: raise Exception("message")

Getting a datetime with the correct local timezone (Python 2):

import datetime as dt
from tzlocal import get_localzone
from pytz import timezone

dt.datetime.now(timezone("UTC")).astimezone(get_localzone()).strftime("%Y-%m-%dT%H:%M:%S %Z")

Zero-padding a number: '{0:05d}'.format(n) # If n=3, this will produce '00003'
Markdown parsing: there are a bunch of different packages, but it looks lke mistletoe is a good choice.

Logging

The basic logging tutorial in the official Python docs is a good place to start. Here’s an example from there:

import logging
logging.basicConfig(filename='example.log',level=logging.DEBUG)
logging.debug('This message should go to the log file')
logging.info('So should this')
logging.warning('And this, too')

Testing

Python comes with built-in testing capabilities.

import unittest

class TestSomething(unittest.TestCase):

    def setUp(self):
        # Runs before each test method
        pass

    def tearDown(self):
        # Post-test clean-up, run after each test method
        pass

    def test_a_thing:
        assertEqual(1, 1)

There are a bunch of built-in assertSomething methods:

assertDictEqual, assertEqual, assertTrue, assertFalse, assertGreater, assertGreaterEqual, assertIn, assertIs, assertIsIntance, assertIsNone, assertIsNot, assertIsNotNone, assertItemsEqual, assertLess, assertLessEqual, assertListEqual, assertMultiLineEqual, assertNotAlmostEqual, assertNotEqual, assertTupleEqual, assertRaises, assertRaisesRegexp, assertRegexpMatches

Tests can be skipped with the @unittest.skip("message here") decorator.

Tests can automatically be run once with nosetest -v (put your tests in tests/, start each filename with test_, and each test method with test_).

To get some color in the test output, pip install pinocchio and then run nosetests -v --spec-color --with-spec.

I’m also trying out nose-progressive for colorizing nosetests tracebacks: pip install nose-progressive and then nosetests -v --with-progressive --logging-clear-handlers. This is not compatible with pinocchio, and it seems to crash sniffer.

To automatically run tests when a file changes, pip install sniffer and then run sniffer -x--spec-color -x--with-spec or sniffer -x--with-progressive.

To get a debugger to run from within a test, pip install nose2 and then run nose2. If you pdb.set_trace() in a test, the debugger will open.

Profiling

Profiling can help to determine which parts of a program are running slower than you want.

As a first pass, adding in execution time logging can be helpful (printing out how long it takes a method to run). This can be done with a decorator.

There is also a decorator for more detailed profiling. It produces output like:

*** PROFILER RESULTS ***
silly_fibonacci_example (sample.py:6)
function called 109 times

         325 function calls (5 primitive calls) in 0.004 CPU seconds

   Ordered by: internal time, call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    108/2    0.001    0.000    0.004    0.002 profilehooks.py:79(<lambda>)
    108/2    0.001    0.000    0.004    0.002 profilehooks.py:131(__call__)
    109/1    0.001    0.000    0.004    0.004 sample.py:6(silly_fibonacci_example)
        0    0.000             0.000          profile:0(profiler)

There is also line-by-line profiling that provides output like this:

Pystone(1.1) time for 50000 passes = 2.48
This machine benchmarks at 20161.3 pystones/second
Wrote profile results to pystone.py.lprof
Timer unit: 1e-06 s

File: pystone.py
Function: Proc2 at line 149
Total time: 0.606656 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   149                                           @profile
   150                                           def Proc2(IntParIO):
   151     50000        82003      1.6     13.5      IntLoc = IntParIO + 10
   152     50000        63162      1.3     10.4      while 1:
   153     50000        69065      1.4     11.4          if Char1Glob == 'A':
   154     50000        66354      1.3     10.9              IntLoc = IntLoc - 1
   155     50000        67263      1.3     11.1              IntParIO = IntLoc - IntGlob
   156     50000        65494      1.3     10.8              EnumLoc = Ident1
   157     50000        68001      1.4     11.2          if EnumLoc == Ident1:
   158     50000        63739      1.3     10.5              break
   159     50000        61575      1.2     10.1      return IntParIO

Tools and libraries of interest

Bokeh is a library for interactive visualizations.
Rodeo is a data science IDE for Python.
Seaborn is a high level Python visualization library based on matplotlib.
Plot.ly is an web application for making charts with a Python API.
Records for accessing data in MySQL
ftfy for fixing text encoding

Installing Recent Python on OS X

I used to use these instructions to set up pyenv, but pyenv can cause some problems. Handling Python dependency requirements is difficult enough already that I don’t want any extra complexity on top of pip (the package manager) and virtualenv (separate environments with different installed dependencies).

So here is what I’m doing now:

Install Python 2.x.x from Homebrew to get an up-to-date version: brew install python --enable-framework (the --enable-framework is an attempt to get around this numpy issue, but it doesn’t appear to work)
Install virtualenv with brew install virtualenv.
Create a folder called ~/.virtualenvs to store all my virtualenvs (each one is a subfolder).

It doesn’t really matter what I call this, but putting it in ~/ (my user’s home folder) is convenient and using a .something folder will hide it in Finder.
Add the following to my ~/.zshrc file (or ~/.bashrc if I didn’t use zsh):
```
# pip should only run if there is a virtualenv currently activated
export PIP_REQUIRE_VIRTUALENV=true
gpip(){
   PIP_REQUIRE_VIRTUALENV="" pip "$@"
}
pyenv() {
  source ~/.virtualenvs/$@/bin/activate
}
newpyenv() {
  cd ~/.virtualenvs/ && virtualenv $@ && cd - && source ~/.virtualenvs/$@/bin/activate
}
```
- This will prevent pip install ... from working if I’m not in a virtualenv, which prevents me from accidentally installing anything globally (I do this all the time).
- To get around this restriction, use gpip install ...
- To activate a virtualenv named gorbypuff, run pyenv gorbypuff
- To create this virtualenv and activate it, run newpyenv gorbypuff
- Note that I use pyenv and newpyenv for these function names – and this is not the same as the pyenv that I mentioned ditching at the beginning of this section. I just find it easy to remember pyenv so that’s what I use.
Set up virtualenv using system Python so numpy will work: virtualenv -p /usr/bin/python2.7 ~/.virtualenvs/data-python2.7sys (this doesn’t seem to work either)
Update pip: pip install --upgrade pip

Install the scipy stack:

pip install numpy scipy matplotlib ipython jupyter pandas sympy nose

Install some additional packages I often use:

pip install nose-progressive pdbpp colored-traceback awesome_print

Resources

Chris Albon’s website is a great resource
Example IPython notebooks showing a bunch of data science techniques
Stanford introduction to scientific Python short course
PyFormat: documentation for Python’s string formatters
strftime.org: documentation for Python’s datetime formatter
SQL style guide (note that DataGrip is the only linter/formatter I’ve found that comes close to matching this style guide’s recommendations)

Thanks

The following people have helped with this:

Suggestions? Corrections?

Please contact me.