SciPy (Scientific Python): Tips & Snippets
Last updated: January 18, 2021
ℹ️ I wrote these notes for myself as a reference, and published them here in case they are helpful for others.
Table of contents:
- Initial setup for IPython notebooks
- Pandas DataFrames
- Importing data
- Exporting data
- Feather
- DataFrame information
- Selecting columns and rows
- Selecting rows with a MultiIndex
- Renaming columns and rows
- Deleting columns and rows
- Recoding data
- Subsetting data
- Aggregating data
- Other data manipulation
- Getting the previous row value for long longitudinal data
- Missing data
- Duplicate data
- Merging data
- Appending data frames
- Recoding variables
- Returning a view vs. a copy (avoiding
SettingWithCopy
warnings) - Miscellaneous
- Scipy
- Statistics
- Graphing
- IPython notebooks
- General Python stuff that might be useful
- Tools and libraries of interest
- Installing Recent Python on OS X
- Resources
- Thanks
- Suggestions? Corrections?
Initial setup for IPython notebooks
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
This is the minimum of what is needed. For more, see ipython-setup.
You also might want some notebook extensions:
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
Pandas DataFrames
Assume there is a DataFrame called df
.
Importing data
To get this DataFrame, you can import from…
- CSV:
df = pd.read_csv("path/to/csv.csv")
- Stata:
df = pd.read_stata('path/to/stata_data.dta')
- Excel:
df = pd.read_excel('/path/to/excel.xls')
- You will need to first
pip install xlrd
and thenimport xlrd
for this to work.
- You will need to first
Reading from MySQL
from sqlalchemy import create_engine
from sqlalchemy import create_engine
engine = create_engine('mysql://root:password@localhost/database')
sql = "select * from tablename"
df = pd.read_sql(sql, engine)
Empty dataframes
If you want to make an empty dataframe and then fill it in, do this:
df = pd.DataFrame()
df = df.append(pd.DataFrame({
"column1": [1],
"column2": [2]
}), ignore_index = True)
That will add one row to df
where column1 = 1
and column2 = 2
.
Exporting data
- To CSV:
df.to_csv(file_name, encoding='utf-8')
- To Stata:
df.to_stata("/path/to/stata/file.dta")
Tips for exporting to Stata
Pandas is fussy about df.to_stata(...)
. Here are some workarounds for common problems:
- Exporting datetime columns:
df.to_stata('f.dta', convert_dates={'datetime_col_name': 'tc'})
- See the docs for a full list of alternatives to
tc
.
- See the docs for a full list of alternatives to
- “Writing general object arrays is not supported” error:
- Figure out which of your columns are of
object
type:list(df.select_dtypes(include=['object']).columns)
- Convert them to something else:
df['col'] = df['col'].astype(str)
This code may help:
# Attempt to auto-convert columns from `object` to a useful type. # Note that string columns may still have `dtype=object`, but will still be able # to export to Stata format. type_pref = [int, float, str] for colname in list(df.select_dtypes(include=['object']).columns): for t in type_pref: try: df[colname] = df[colname].astype(t) except (ValueError, TypeError) as e: pass
- Figure out which of your columns are of
Feather
feather
(GitHub) is a “fast on-disk format for data frames.” It is similar to pickle
but for data frames.
After pip install feather-format
, you can:
# Writing to a Feather file
import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
# Reading from a Feather file
df = feather.read_dataframe(path)
For long-running reads from external sources, this pattern might be helpful:
import hashlib # for md5 checksum
import os.path # for checking if file exists
data_source = "data/source.xlsx"
md5 = hashlib.md5(data_source).hexdigest()
data_source_feather = "data/source-%s.feather" % md5
if os.path.isfile(data_source_feather):
print("Loading %s..." % data_source_feather)
df = feather.read_dataframe(data_source_feather)
else:
df = pd.read_excel(data_source)
feather.write_dataframe(df, data_source_feather)
The first time this code runs, it will be slow as Pandas reads from the original data source. Subsequent runs should be very fast.
DataFrame information
- Get dimensions:
df.shape
- Get list of columns:
df.columns
Selecting columns and rows
- Select columns:
df[['col1', 'col2', ..]]
- Selecting rows:
df.loc[...]
is used to select based on labels in the indexdf.iloc[...]
is used to select based on integer positions in the indexdf.ix[...]
tries to use labels, and falls back to positions if the label is not in the index. Only use this when you need to mix label and positions (like for selecting rows based on label and columns based on position).
- Selecting columns and filtering rows at the same time:
df[['col1', 'col2', ..]][df['col1'] == "criteria"]
- If you want to modify cells, you have to use
loc
,iloc
, orix
to avoid this warning: “A value is trying to be set on a copy of a slice from a DataFrame”.- Syntax:
df.ix[df['column_name'] < 1, ['column_to_change']] = 2
- Syntax:
- Selecting a specific cell by row and column name:
df.loc['row name', 'col name']
Selecting rows with a MultiIndex
Here are some examples of how to select specific rows with a two-level index (also known as a hierarchical index or a MultiIndex
):
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({("bar", "a"): {"col1": 1}, ("bar", "b"): {"col1": 2}, ("baz", "a"): {"col1": 3}, ("baz", "b"): {"col1": 4}}).T
In [3]: df
Out[3]:
col1
bar a 1
b 2
baz a 3
b 4
# Select a specific row
In [4]: df.loc[("bar", "a"), :]
Out[4]:
col1 1
Name: (bar, a), dtype: int64
# Select all rows with a specific top-level index
In [5]: df.loc["bar", :]
Out[5]:
col1
a 1
b 2
# Select all rows with a specific second-level index
In [6]: df.loc[(slice(None), "a"), :]
Out[6]:
col1
bar a 1
baz a 3
Renaming columns and rows
- Rename columns:
df.columns = ["col 1 name", ...]
- Renaming a single column:
df.rename(columns={'oldname': 'newname'}, inplace=True)
- Renaming rows:
df = df.set_index('column name')
, which will cause the rows to be labeled with the values of the specified column. You may have to create a new column first to get what you want. - Removing the label for an index (this is the label that appears above row labels, which I think is confusing):
df.index.names = [None]
- Round everything to 1 decimal place:
np.round(df, decimals=1)
Deleting columns and rows
- Delete a column:
df = df.drop('column_name', 1)
Recoding data
- Based on value:
df['var'] = df['var'].replace('oldval', 'newval')
Subsetting data
- Unique values from a column:
np.unique(df['colname'])
ordf.column.unique()
- Filtering rows using a list of values:
df[df['col'].isin(whitelist)]
will yield all the rows indf
where the value forcol
appears inwhitelist = pd.Series(...)
Aggregating data
I often find myself wanting to sum a variable by group. For example, I might want the number of records for each day in the dataset.
This can be accomplished easily with the following Stata code:
gen one = 1 collapse (sum) one, by(datevar)
Translated to Python, this is:
collapsed = df.groupby(['date']).size()
If you aggregate by a variable with a
datetime
type, and you want to access specific values, you can usecollapsed[pd.datetime(2016,1,1).date()]
.You can aggregate with a custom method like this:
df.groupby('group_by_this_column_name')['aggregate_this_column_name'].agg({'output_column_name': custom_method})
The custom method should have one argument, which is a Pandas
Series
object containing the values ofaggregate_this_column_name
for a given value ofgroup_by_this_column_name
.You can aggregate multiple columns with different methods like this:
aggregate = ['aggregate1', 'aggregate2'] df.groupby('group_by_this_column_name')[aggregate].agg({ 'aggregate1': np.sum, 'aggregate2': (lambda x: x.iloc[0]) # Gets first value })
If you want to run multiple aggregation methods for the same column or just show more descriptive names, you can do this:
aggregate = ['aggregate1', 'aggregate2'] df.groupby('group_by_this_column_name')[aggregate].agg({ 'aggregate1': {'sum': np.sum}, 'aggregate2': {'sum': np.sum, 'first_value': (lambda x: x.iloc[0])} # Gets first value })
Adding a running count column by group (first row for a given
groupvar
will be 0, second will be 1, etc.):df['running_count'] = df.groupby(['groupvar']).cumcount()
Adding a total count column by group (if there are 3 rows with a given
groupvar
, all 3 rows will havetotal_counts
set to 3):total_counts = df.groupby(['groupvar']).size() df = df.set_index('groupvar') df['total_counts'] = total_counts df = df.reset_index()
Other data manipulation
- Concatenate two columns (a string and integer):
df['bar'] = df['bar'].str.cat(df['foo'].values.astype(str), sep=' is ')
(source) - Change type of column:
df['col'] = df['col'].astype(str)
- For changing to numeric, this works on recent versions of Pandas:
df['col'] = pd.to_numeric(df['col'])
- For changing to numeric, this works on recent versions of Pandas:
- Changing the case of a string column:
df['col'] = df['col'].str.upper()
ordf['col'].apply(lambda x: x.upper(), inplace=True)
if you need to deal with unicode strings. Uselower()
to lowercase rather than uppercase. - Iterating over rows in a
DataFrame
:for index, row in df.iterrows():
- Sorting:
df.sort_values(['col1', 'col2'], ascending=[True, False], inplace=True)
Getting the previous row value for long longitudinal data
Sometimes it’s helpful to get a value from row i-1
in row i
with long data. This would fill in the prev_value
column in the table below.
id | obs_num | value | prev_value |
---|---|---|---|
1 | 1 | 7 | NaN |
1 | 2 | 10 | 7 |
1 | 3 | 18 | 10 |
2 | 1 | 3 | NaN |
To do this:
df = pd.DataFrame([
{'id': 1, 'obs_num': 1, 'value': 7},
{'id': 1, 'obs_num': 2, 'value': 10},
{'id': 1, 'obs_num': 3, 'value': 18},
{'id': 2, 'obs_num': 1, 'value': 3},
])
df['prev_value'] = df.groupby('id')['value'].shift()
# In : df
# Out:
# id obs_num value prev_value
# 0 1 1 7 NaN
# 1 1 2 10 7.0
# 2 1 3 18 10.0
# 3 2 1 3 NaN
Missing data
- Get rows where a numeric column is missing:
df[pd.isnull(df['colname'])]
pd.notnull
can be used in the same way for the opposite effect- The
fillna()
method can be used to fill missing data.- Fill with 0s:
df['col1'] = df['col1'].fillna(0)
- Fill with previous value:
df['col1'] = df['col1'].fillna(method='ffill')
- Fill with 0s:
Duplicate data
- Dropping duplicates
df = df.drop_duplicates(['col_with_duplicates'], keep='first')
- Options for
keep=
arefirst
,last
, andFalse
(which drops all duplicates)
Counting duplicates of the combination of
col1
andcol2
counts = df.groupby(['col1', 'col2']).size().reset_index().rename(columns={0:'count'}) print counts[counts['count'] > 1]
Merging data
- The Pandas documentation has a good explanation of this.
- Chris Albon has lots of good examples of merges and the necessary code
- Merging in Pandas uses SQL terminology like left, right, outer, and inner.
Syntax from the Pandas docs:
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)
Appending data frames
df_both = df.append(df2)
Recoding variables
df['recoded'] = df['original'].map({"original_value_1": "new1", "original_value_2": "new2"})
Returning a view vs. a copy (avoiding SettingWithCopy
warnings)
See this pandas documentation page. This general discussion about names and values in Python may also be helpful.
Miscellaneous
df.T
will transpose the DataFrame
Scipy
Truncated normal distribution:
from scipy import stats # Set up a truncated normal distribution lower = 0.2 upper = 0.8 mu = 0.5 sigma = 1 distribution = stats.truncnorm((lower - mu) / sigma, (upper - mu) / sigma, loc=mu, scale=sigma) # Draw once from distribution draw = distribution.rvs(1)[0]
Statistics
- Frequencies table for a column:
df['column_name'].value_counts()
- Descriptive statistics for a column:
df['column_name'].describe()
Two-way tables:
grouped = df.groupby(['var1', 'var2']) grouped.size() # Output: # var1 var2 # value1 value1 3 # value1 value2 1 # value3 1
Graphing
- Must-read explanation of
matplotlib
– read this before you start making graphs. Basic setup:
import matplotlib.pyplot as plt # http://matplotlib.org/faq/usage_faq.html fig = plt.figure(figsize=(6*1.5, 4*1.5)) ax = fig.add_subplot(1,1,1) ax.hist(df['col'], color="k", alpha=0.4)
Axis titles:
plt.ylabel("Y axis title here")
Plot title:
ax.set_title('Plot title here')
Axis range:
ax.set_xlim([min, max])
Setting figure size:
fig = plt.figure(figsize=(6*1.5, 4*1.5))
Histograms with evenly spaced bins for integers:
ax.hist(df['col'], color="k", alpha=0.4, bins=range(0, max(df['col']) + 10, 10))
10
is the bin width- More info is available on StackOverflow, including how to do this with floats
Use GridSpec for easy multi-graph figures, where graphs have different sizes.
For example:
import matplotlib.gridspec as gs # http://bit.ly/1U6YKGm fig = plt.figure(figsize=(10, 8)) gs = gridspec.GridSpec(2, 2, height_ratios=[3, 1]) ax = plt.subplot(gs[0]) ax.hist(...) ...
This is for a 2x2 grid of graphs. The top two graphs will be taller than the bottom two.
See this blog post for more information and full example code.
Common axis labels for plots with subplots:
fig.text(0.5, 0.04, 'common xlabel', ha='center', va='center') fig.text(0.06, 0.5, 'common ylabel', ha='center', va='center', rotation='vertical')
Add more horizontal padding between rows of subplots:
fig.subplots_adjust(hspace=0.35)
Saving a figure to a PDF
from matplotlib.backends.backend_pdf import PdfPages ppp = PdfPages('/path/to/file.pdf') ppp.savefig(fig) ppp.close()
Keep a plot from automatically showing in a notebook:
plt.close()
Common graphs with Seaborn
Seaborn is typically imported as import seaborn as sns
.
Time series line graph
This produces a line graph with reasonable date formatting for x axis labels:
fig, ax = plt.subplots()
sns.tsplot(df['value_colname'], time=df['time_varname'], ax=ax)
ax.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, _: pd.to_datetime(x)))
fig.autofmt_xdate()
IPython notebooks
- “Line magics” (like
%load
and%run
) – I definitely recommend reading through the official docs or this blog post%pdb
will open an interactive prompt when there’s an exception
- Have
None
be the last line in the cell to avoid junk output likeOut[1]: <matplotlib.text.Text at 0x115933850>
- Want to automatically open Chrome when you run
jupyter notebook
? Add this to/Users/you/.jupyter/jupyter_notebook_config.py
:c.NotebookApp.browser = u'/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome %s'
. If you don’t have that config file,jupyter notebook --generate-config
will generate it.
General Python stuff that might be useful
- List of useful, less-known features of Python 3 – there is some great stuff in here!
- A gentle guide to the Python features that I didn’t know existed or was too afraid to use.
- Explanation of
import
, etc. - Static, class, and abstract methods
- Merging
dicts
- Magic methods
- Iterating over a dict:
for k, v in {'key': 'value'}.iteritems()
- Print rounded float as a string with 3 decimals:
"%.3f" % float_var
Save file to string:
with open("Output.txt", "w") as text_file: text_file.write("Purchase Amount: %s" % TotalAmount)
Regular expressions:
- Check for match:
if re.compile("^.*$").match(str):
- Quick way to get match out of string:
re.compile("regexp(.*)").search(str).groups()[0]
- Check for match:
Interactive breakpoint (works when running
python myprogram.py
):import pdb # at top of file pdb.set_trace() # where you want the breakpoint
- If you want syntax highlighting in the debugger and tab completion, use
pdb++
, which is a drop-in replacement forpdb
. You literallypip install pdbpp
and change nothing else.
- If you want syntax highlighting in the debugger and tab completion, use
Colors in stack traces when you run
python somefile.py
in the Terminal:- First,
pip install colored-traceback
Then, add this code to your Python file:
try: import colored_traceback.auto except ImportError: pass
- First,
For debugging,
awesome_print
can be very helpful.# First, `pip install awesome_print` from awesome_print import ap ap({"test": [1,2,3]})
Or, to avoid having to use some variation of
print
for debugging, you can use pysnooper.Raising a generic exception:
raise Exception("message")
Getting a
datetime
with the correct local timezone (Python 2):import datetime as dt from tzlocal import get_localzone from pytz import timezone dt.datetime.now(timezone("UTC")).astimezone(get_localzone()).strftime("%Y-%m-%dT%H:%M:%S %Z")
Zero-padding a number:
'{0:05d}'.format(n) # If n=3, this will produce '00003'
Markdown parsing: there are a bunch of different packages, but it looks lke mistletoe is a good choice.
Logging
The basic logging tutorial in the official Python docs is a good place to start. Here’s an example from there:
import logging
logging.basicConfig(filename='example.log',level=logging.DEBUG)
logging.debug('This message should go to the log file')
logging.info('So should this')
logging.warning('And this, too')
Testing
Python comes with built-in testing capabilities.
import unittest
class TestSomething(unittest.TestCase):
def setUp(self):
# Runs before each test method
pass
def tearDown(self):
# Post-test clean-up, run after each test method
pass
def test_a_thing:
assertEqual(1, 1)
There are a bunch of built-in assertSomething
methods:
assertDictEqual, assertEqual, assertTrue, assertFalse, assertGreater, assertGreaterEqual, assertIn, assertIs, assertIsIntance, assertIsNone, assertIsNot, assertIsNotNone, assertItemsEqual, assertLess, assertLessEqual, assertListEqual, assertMultiLineEqual, assertNotAlmostEqual, assertNotEqual, assertTupleEqual, assertRaises, assertRaisesRegexp, assertRegexpMatches
Tests can be skipped with the @unittest.skip("message here")
decorator.
Tests can automatically be run once with nosetest -v
(put your tests in tests/
, start each filename with test_
, and each test method with test_
).
To get some color in the test output, pip install pinocchio
and then run nosetests -v --spec-color --with-spec
.
I’m also trying out nose-progressive
for colorizing nosetests
tracebacks: pip install nose-progressive
and then nosetests -v --with-progressive --logging-clear-handlers
. This is not compatible with pinocchio
, and it seems to crash sniffer
.
To automatically run tests when a file changes, pip install sniffer
and then run sniffer -x--spec-color -x--with-spec
or sniffer -x--with-progressive
.
To get a debugger to run from within a test, pip install nose2
and then run nose2
. If you pdb.set_trace()
in a test, the debugger will open.
Profiling
Profiling can help to determine which parts of a program are running slower than you want.
As a first pass, adding in execution time logging can be helpful (printing out how long it takes a method to run). This can be done with a decorator.
There is also a decorator for more detailed profiling. It produces output like:
*** PROFILER RESULTS ***
silly_fibonacci_example (sample.py:6)
function called 109 times
325 function calls (5 primitive calls) in 0.004 CPU seconds
Ordered by: internal time, call count
ncalls tottime percall cumtime percall filename:lineno(function)
108/2 0.001 0.000 0.004 0.002 profilehooks.py:79(<lambda>)
108/2 0.001 0.000 0.004 0.002 profilehooks.py:131(__call__)
109/1 0.001 0.000 0.004 0.004 sample.py:6(silly_fibonacci_example)
0 0.000 0.000 profile:0(profiler)
There is also line-by-line profiling that provides output like this:
Pystone(1.1) time for 50000 passes = 2.48
This machine benchmarks at 20161.3 pystones/second
Wrote profile results to pystone.py.lprof
Timer unit: 1e-06 s
File: pystone.py
Function: Proc2 at line 149
Total time: 0.606656 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
149 @profile
150 def Proc2(IntParIO):
151 50000 82003 1.6 13.5 IntLoc = IntParIO + 10
152 50000 63162 1.3 10.4 while 1:
153 50000 69065 1.4 11.4 if Char1Glob == 'A':
154 50000 66354 1.3 10.9 IntLoc = IntLoc - 1
155 50000 67263 1.3 11.1 IntParIO = IntLoc - IntGlob
156 50000 65494 1.3 10.8 EnumLoc = Ident1
157 50000 68001 1.4 11.2 if EnumLoc == Ident1:
158 50000 63739 1.3 10.5 break
159 50000 61575 1.2 10.1 return IntParIO
Tools and libraries of interest
- Bokeh is a library for interactive visualizations.
- Rodeo is a data science IDE for Python.
- Seaborn is a high level Python visualization library based on matplotlib.
- Plot.ly is an web application for making charts with a Python API.
- Records for accessing data in MySQL
- ftfy for fixing text encoding
Installing Recent Python on OS X
I used to use these instructions to set up pyenv
, but pyenv
can cause some problems. Handling Python dependency requirements is difficult enough already that I don’t want any extra complexity on top of pip
(the package manager) and virtualenv
(separate environments with different installed dependencies).
So here is what I’m doing now:
- Install Python 2.x.x from Homebrew to get an up-to-date version:
brew install python --enable-framework
(the--enable-framework
is an attempt to get around this numpy issue, but it doesn’t appear to work) - Install
virtualenv
withbrew install virtualenv
. Create a folder called
~/.virtualenvs
to store all my virtualenvs (each one is a subfolder).It doesn’t really matter what I call this, but putting it in
~/
(my user’s home folder) is convenient and using a.something
folder will hide it in Finder.Add the following to my
~/.zshrc
file (or~/.bashrc
if I didn’t usezsh
):# pip should only run if there is a virtualenv currently activated export PIP_REQUIRE_VIRTUALENV=true gpip(){ PIP_REQUIRE_VIRTUALENV="" pip "$@" } pyenv() { source ~/.virtualenvs/$@/bin/activate } newpyenv() { cd ~/.virtualenvs/ && virtualenv $@ && cd - && source ~/.virtualenvs/$@/bin/activate }
- This will prevent
pip install ...
from working if I’m not in avirtualenv
, which prevents me from accidentally installing anything globally (I do this all the time). - To get around this restriction, use
gpip install ...
- To activate a
virtualenv
namedgorbypuff
, runpyenv gorbypuff
- To create this
virtualenv
and activate it, runnewpyenv gorbypuff
- Note that I use
pyenv
andnewpyenv
for these function names – and this is not the same as thepyenv
that I mentioned ditching at the beginning of this section. I just find it easy to rememberpyenv
so that’s what I use.
- This will prevent
Set up
virtualenv
using system Python so numpy will work:virtualenv -p /usr/bin/python2.7 ~/.virtualenvs/data-python2.7sys
(this doesn’t seem to work either)Update pip:
pip install --upgrade pip
Install the scipy stack:
pip install numpy scipy matplotlib ipython jupyter pandas sympy nose
Install some additional packages I often use:
pip install nose-progressive pdbpp colored-traceback awesome_print
Resources
- Chris Albon’s website is a great resource
- Example IPython notebooks showing a bunch of data science techniques
- Stanford introduction to scientific Python short course
- PyFormat: documentation for Python’s string formatters
- strftime.org: documentation for Python’s datetime formatter
- SQL style guide (note that DataGrip is the only linter/formatter I’ve found that comes close to matching this style guide’s recommendations)
Thanks
The following people have helped with this:
Suggestions? Corrections?
Please contact me.