A VIEW FROM INSIDE THE DEMAND SPHERE: EXPLORATORY DATA ANALYSIS IN PANDAS WITH MASKS

Often at GrowthIntel when we receive a list of prospects from a new client the first thing we do, once it has been matched against our datasets, is some exploratory data analysis.

We’ve found that iPython Notebook (or rather Jupyter Notebook) combined with pandas and Matplotlib is an excellent combination which allows us to slice, transform and query the data with the all the power of Python and pandas and also produce a document with plots and figures that can easily be communicated with the rest of the team.

Pandas Boolean Masks

A common pattern to our work is to compare sets of rows against each other for a certain data field. For example, say we have a pandas DataFrame (df) with 1000 companies where there is a label field describing which set that particular company falls into (usually a stage in the sales pipeline e.g. Customer or Disqualified Prospect) and the remaining fields contain data or “features” related to the company.

Now imagine we want to compare a feature, say the number of import events num_import_events, across the different company sets. Indexing DataFrames with boolean masks allow us to down-select rows according to the values of entries in a field like so:

customers_df = df[ df.label == “customer” ]

We can then do some analysis on this new DataFrame:

customers_df.num_import_events.hist()

If we repeat this for other label values and tell pandas to plot the histograms on the same axis, we can visualise the distribution of this feature across different company sets. If we want to do this for, say, 5 – 10 features, however, our notebooks will end up being littered with boilerplate code.

Lots of boilerplate

Mask Objects

Our solution to this problem is a Python class to wrap boolean masks, Mask, and comparison functions which accept a list of Mask objects as an argument. Mask objects are instantiated with two arguments: a function that returns boolean masks and a name to describe what subset the Mask is attempting to index.

customer_mask = Mask(lambda frame: frame.label == “customer”, “Customers”)

The reason we use a function rather than a ‘static’ mask is that the function can be applied to any DataFrame rather than it being inseparably tied to one.

We use Mask objects to index DataFrames like so:

customer_df = df[customer_mask(df)]

We can call customer_mask as if it is a function because we have overwritten the __call__ method.

Where the real power lies when dealing with Masks however, is in the associated library of functions. One of these is the plot_comparison_hist function which does exactly the task we described above, but if we have already declared our Mask objects, we can achieve this in one line:

plot_comparison_hist(df, ‘num_import_events’, [customer_mask, disq_prospect_mask])

where the first argument is our DataFrame, the second is the name of the column containing the values we want to plot and the last is a list of Mask objects to compare against one another.

Notebook with Masks

The nice thing about these functions is that they handle the labelling of the plots, including chart titles, axis titles and legends and also format the plots (reasonably) nicely. Doing this in Matplotlib generally requires a new line of code for each operation, so the reduction in boilerplate can be quite significant.

Derived Masks and Operator Overloading

Sometimes there is more than one field which we want to use split the data. For example, we may decide that we want to consider Telecommunications companies separately from the rest. We could do this by creating a new Mask object as follows:

telco_customers_mask = Mask(lambda frame: (frame.sector == “telecommunications”) & (frame.label == “customer”), “Telco Customers”)

But this way we’d need to define a new mask for all company subsets. The Mask class, however, allows us to do the following:

# First create a single mask for Telco's
telco_mask = Mask(lambda frame: frame.sector == “telecommunications”, “Telecommications”)

# Then combine with existing masks using &
plot_comparison_hist(df, ‘num_import_events’, [telcos_mask&customer_mask, telcos_mask&disq_prospect_mask])

This is achieved by overwriting the __and__ method, as we did the __call__ method. The derived Mask‘s names are automatically created from the original masks, so when the results of the comparisons are presented it’s clear which subsets are being analysed. Other operators we can also use, with their respective methods, are:

telcos_mask | customer_mask  # select telcos OR customers, __or__

~telcos_mask  # select non-telcos, __invert__

Wrap-up

We’ve used the Mask class extensively in our exploratory analysis, and as a result our iPython Notebooks have become much cleaner and with far fewer lines of unnecessary boilerplate code. Members of the team are constantly growing the library of functions that leverage Masks when a new use case arises.