Comparing python plotting libraries

John Henderson

2017-12-23 13:33

introduction

As mentioned in a previous post, I've been trying to pick up python, coming from mostly using R over the past 5-7 years for data analysis, stats, and a lot of plotting. I found python's plotting landscape quite a bit more confusing than I expected, with tons of options compared to the typical reigning R champions: base, lattice, and ggplot2.

Granted, there are a ton of other R contenders as well, such as plotly and rCharts, but the three above are the most common. I stick almost exclusively to ggplot2, and rarely find something I can't do (granted, this is typical science-y stuff, not infographics or really complicated stuff). To educate myself, I went through and plotted a pretty standard dataset in various ways to see how one would do it in a bunch of python libraries:

the example

I started really simple. Three types of plots: a simple bar, a dot plot (basically a bar plot, but makes Tufte proud with a higher data:ink ratio), and a scatter plot colored by group.

The bar chart in R:

library(ggplot2)
data(mtcars)

df <- mtcars[1:5, ]
df$car <- rownames(df)

ggplot(df, aes(x = car, y = mpg)) + geom_bar(stat="identity")

A dot plot:

ggplot(df, aes(x = car, y = mpg)) + geom_point() + expand_limits(y=0)

And lastly, we'll use a bit more data and do a colored scatter plot:

df <- mtcars[1:10, ]
df$car <- rownames(df)

ggplot(df, aes(x = wt, y = mpg, colour = car)) + geom_point()

Here are our three plots:

ggpy

Porting ggplot2 to python is an amazing concept. The whole idea behind the grammar of graphics is how little the code needs to change for different visualizations. We're only changing some mappings (some aspect of the data leads to some aesthetic) and the geometry being used.

The above is almost identical using the port for the bar chart. I dislike that the aesthetic keyword is weight and not simply y. This surprised someone else, too. It also causes there to be no y-axis label by default; I had to manually add it in.

I also don't love how ggpy handles limits. Note the dot plot from ggplot2 and the nice little padding above the highest dot. In ggpy, this didn't seem to work well out of the box. I could fix it with scale_y_continuous, but why should I have to? This also makes for requiring what I consider hokey, like:

... + scale_y_continuous(limits=(0, max(df['mpg'])*1.1))

I could do that, but again, it's extra effort. In my own workflow, I find I'm often visualizing some variables pretty similarly. I'll copy and paste code blocks and switch out variable names. This is one more thing to need to remember, and perhaps the 1.1 multiplier wouldn't work so well if the data range was different by an order or two of magnitude. Smart ranges should "just work" in my opinion.

Another complaint has to do with saving out the files. I'm not showing it below, but these were each generated with:

p = ggplot(...)
p.save('filename.png', width=w, height=h, dpi=300)

For the bar and dot, things were fine using width=9, height=6 (inches), but using this for the scatter seemed not to incorporate the legend in the width. Using the same size, I got this:

nil

To get the legend to fit, I had to bump the size, but that makes the text labels much smaller with respect to the plot. In ggplot, I could fix this with some theme() options; I didn't get far enough to hunt the equivalent in ggpy.

from ggplot import *

ggplot(df, aes(x='car', weight='mpg')) + geom_bar(stat='identity') + scale_y_continuous('mpg')

For the dot plot, note the switch back to y=:

ggplot(df, aes(x='car', y='mpg')) + geom_point() + ylim(low=0)

And the scatter plot:

ggplot(df, aes(x='wt', y='mpg', color='car')) + geom_point()

Here's what we get!

matplotlib

From my reading, this looks like the loved and hated "bread and butter" of python plotting. I think the biggest complaint I've seen is the verbosity. I didn't experience much of this, likely due to how simple these examples are.

I ran into the same quirk as with ggpy with respect to expanding the dot plot y-axis limits. I could pass ylim(0) to get zero included, but it would cut off the top dot. I had to do the hokey scaling bit again.

Lastly, the process of getting colors by groups was not awesome. I found I could sort of manually map each group to a color or do it the canonical way and loop through the data, adding a layer of dots for each group. That also required some fiddling with the legend location.

The bar plot:

import matplotlib.pyplot as plt

plt.figure(figsize=(9, 6))
plt.bar(df['car'], df['mpg'])
plt.show()

Dot plot:

plt.figure(figsize=(9, 6))
plt.scatter(df['car'], df['mpg'])
plt.ylim(0, max(df['mpg'])*1.1)
plt.show()

For the colored scatter plot, which I had a helluva time getting to save as a file. I originally had to just screenshot the plot from jupyter lab, but ended up getting the answer, which was that my plt.figure() (now commented) goofs with the plt.subplots() command.

fig, ax = plt.subplots()
# plt.figure(figsize=(12, 9))
for ind in df.index:
    ax.scatter(df.loc[ind, 'wt'], df.loc[ind, 'mpg'], label=ind)
ax.legend(bbox_to_anchor=(1.05, 1), loc=2)
plt.show()

seaborn

Seaborn was pretty straightforward. Not much else to say!

import seaborn as sns

plt.figure(figsize=(9, 6))
sns.barplot(x='car', y='mpg', data=df)
plt.show()

plt.figure(figsize=(9, 6))
sns.stripplot(x='car', y='mpg', data=df)
plt.ylim(0)
plt.show()

sns.lmplot(x='wt', y='mpg', hue='car',
	   data=df, fit_reg=False)
plt.show()

plotly

For these plots, I left in the jupyter lab commands for plotting. To save them out, I clicked the camera icon in the plotly embedded plot. Surprisingly, you can find evidence that people want to save directly, and the solution is not awesome:

plotly.offline.plot(trace, image='png', image_filename='filename')

For me, this opens a new tab and automatically saves the file. It saves it in ~/Downloads, not my current directory, though.

The bar plot was pretty straightforward:

import plotly
import plotly.graph_objs as go

plotly.offline.init_notebook_mode()

trace = [go.Bar(x = df['car'],
		y = df['mpg'])]
plotly.offline.iplot(trace)

Same for the dot plot, though note we have to fiddle with the layout object to expand our y-axis limits.

data = [go.Scatter(x = df['car'],
		    y = df['mpg'],
		    mode='markers')]
layout = go.Layout(yaxis={'range': [0, max(df['mpg'])*1.1]})
fig = go.Figure(data=data, layout=layout)

plotly.offline.iplot(fig)

The colored scatter plot were where things really broke down for me. I admit that I really want to love python and have heard it touted as one of the top data science/analysis languages. Waaayyy back, I took quite a bit of time to research programmatic ways to do analysis, stats, and plotting. I was really just trying to find some alternative to my company's typical option of Minitab for this type of work. I wanted it to work from linux, and ideally be compatible with my beloved orgmode. Typical contenders included R, python, and octave. For better or worse, I went with R and sort of didn't look back.

Now, I'm coming into python for some other work projects and am honestly sort of feeling spoiled coming from R. It's so concise and, well, easy! It had some odd syntax for sure, but it just feels so easy to accomplish what I want.

So, all of this is to say that I was a bit blown away going to find out how to color by group in one of the fancy new plotting libraries (which includes paid options!) and finding things like:

Plotly's example of "scatter with a color dimension", which is also the top google hit for "color by group plotly python."

trace1 = [go.Scatter(
    y = np.random.randn(500),
    mode='markers',
    marker=dict(
	size='16',
	color = np.random.randn(500), #set color equal to a variable
	colorscale='Viridis',
	showscale=True
    )
)]

That's it. The key line in there is color=np.random(), which only addresses a continuous color scale. Thinking looking for color scale information would be helpful also wasn't. It just shows a bunch of ways to put up continuous colors scales, not discrete for groups.

A little further up, we have this kludge:

c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, N)]

for i in range(int(N)):
    ...
    trace0= go.Scatter(
    ...
	marker= dict(
	    ...
	    color= c[i]
	), name= y[i],
    l.append(trace0);

Elsewhere, plotly suggests this, which seems silly in and of itself, but especially for a large set:

transforms = [dict(
    type = 'groupby',
    groups = subject,
    styles = [
	dict(target = 'Moe', value = dict(marker = dict(color = 'blue'))),
	dict(target = 'Larry', value = dict(marker = dict(color = 'red'))),
	dict(target = 'Curly', value = dict(marker = dict(color = 'black')))
    ]
  )]

The highest search result from SO just says to use another library (colorlover)!

When I see things like this, my first reaction is honestly that it must not be true; there's no way I'm reading the documentation right. This is after spending a long time with ggplot2, however, where the key point is that visualization is simply about mapping aesthetics to aspects of the data.

The above seems to make me a slave to the data… it already constains distinctions (unique values); why do I need to tell my plotting library how to map these to colors? This strikes me as having to create a dict() mapping x and y numeric values to their eventual locations in pixels.

Sort of by accident I stumbled on this doc page that didn't appear to be trying to tell me anything about groups, but inadvertently made it evident a separate list of go.Scatter objects would get me discrete colors for free. So I ended up with this:

data = [go.Scatter(x=[df['wt'][i]],
		   y=[df['mpg'][i]],
		   mode='markers',
		   name=df['car'][i],
		   text=df['car'][i]) for i in range(len(df))]

plotly.offline.iplot(data)

To be fair, I like plotly. I like the hover text, especially for larger datasets where colors actually aren't the best way to tell them apart. Even with these 10 cars, the colors are hard to distinguish. The ability to put in text=foo is super handy. Our finished result:

cufflinks

I don't have much to say here; it's supposed to be a way to sort of layer plotly right onto pd.DataFrame objects. That makes it really succinct, but I also find the documentation lacking.

import cufflinks as cf

df.iplot(kind='bar', x='car', y='mpg')

This does show how nice the grammar of graphics is. Other than needing to specify that we want markers, we're literally just changing the kind of plot with the mappings staying the same. It's just a different way to draw the same thing.

df.iplot(kind='scatter', x='car', y='mpg', mode='markers')

The grouped colors fell apart a bit for me again, perhaps because at the end of the day cufflinks is just plotly. I found this promising walkthrough, but one of the colorscale generation examples failed me. In the cufflinks docs themselves, they basically say this isn't possible and just fallback to plotly syntax:

Plotting multiple column scatter plots isn't as easy with cufflinks. Here is an example with Plotly's native syntax

So, I'll pass since I did this with plotly above already…

altair

I learned about altair from one of the inspirations between this exercise, which was a talk from Jake VanderPlas on the python visualization landscape. It was a great overview of a bunch of what's out there, and it's even more impressive that he traced their sort of "lineage" and how they relate to one another.

I quite enjoy altair; I feel like it does the grammar of graphics nicely, and it's not too cumbersome. One downside is that I couldn't get the chart size to play along, but it apparently works for other types, maybe just not bars?

import altair as alt
alt.enable_mime_rendering()

alt.Chart(df).mark_bar().encode(x='car', y='mpg')

alt.Chart(df).mark_point().encode(x='car', y='mpg')

Awesomely, the mapping was perfectly beautiful for adding color and I guessed, not even looking at the syntax!

alt.Chart(df).mark_point().encode(x='wt', y='mpg', color='car')

bokeh

The first two are fairly uninteresting. I was excited to see that they'd put some nice thought into colors! Indeed, that ability to select a colorBrewer palette was quite nice. Even better might be a simple color=var argument to the call, with a global palette=foo, but this isn't that bad as-is.

Now, once the colors were all set… there was no legend! I looked into it, and in my skim of how to futz with legends, I decided I didn't care enough to press on. I left it with the legend plastered over the data so you can examine the default.

For whatever reason, plotting discrete variables required me having to tell bokeh what the x_range should be for the figure. I think that's odd.

from bokeh.plotting import figure, output_notebook, show

output_notebook()

p = figure(x_range=list(df['car']), plot_width=600, plot_height=400)
p.vbar(x=df['car'], top=df['mpg'],
       width=0.9, bottom=0)
show(p)

Pretty similar for the dot plot:

p = figure(x_range=list(df['car']), y_range=[0, max(df['mpg'])*1.1], plot_width=600, plot_height=400)
p.scatter(x=df['car'], y=df['mpg'])
show(p)

And the scatter plot attempt:

from bokeh.palettes import brewer
palette = brewer['Set3'][len(df)]

p = figure(plot_width=600, plot_height=400)
for i in range(len(df)):
    p.scatter(x=df['wt'][i], y=df['mpg'][i],
	      legend=df['car'][i], color=palette[i])

show(p)

pygal

pygal was certainly interesting. It's a little different than the others, but was more or less straightforward.

I found I had to pass an empty argument for y_labels_major or I'd get these sort of heavy weighted grid lines which I didn't want. It also looks to suffer from needing the x-axis marks to be labeled. I wish it was just as easy as x=foo, y=bar. Oh well.

import pygal

chart = pygal.Bar(width=800, height=600,
		  explicit_size=True, show_legend=False)
chart.y_labels_major = ['']
chart.x_labels = df['car']
chart.add('', df['mpg'])

chart

Same issue other libraries suffer from with respect to the limits as well. Without telling pygal to increase the range, the top tick mark was much lower than the highest dot. Since I wanted to expand to include y=0 anyway, it wasn't a big deal. Just an observation of something I don't think should be necessary.

chart = pygal.XY(width=600, height=400,
		 explicit_size=True, show_legend=False,
		 stroke=False)#, range=(0, max(df['mpg'])*1.1))
chart.y_labels_major = ['']
chart.x_labels = df['car']
chart.add('', [(i, df['mpg'].iloc[i]) for i in range(len(df))])
chart

Colors came for free with grouping! I did get hung up a bit on the fact that you apparently need the points to be tuples inside of a list. I didn't get that initially and found that chart.add(name, [x, y]) was ignoreing the values for y. I'd get a bunch of points on the x-axis instead.

chart = pygal.XY(width=800, height=600,
		 explicit_size=True)
chart.y_labels_major = ['']
for i in range(len(df)):
    chart.add(df['car'][i], [(df['wt'][i], df['mpg'][i])])
chart

closing

Hopefully for other noobs like myself that was helpful. I hope to continue this exercise with more complicated examples. I still feel a bit spoiled with ggplot2, but I have my fingers crossed that with practice some of the things that are now intuitive with R become so with at least one python library. I'm an impossible internal-debater and decision-postponer. I really want to know I've landed with the best possible thing, or at least on a decision I can rationally defend.

I'd love to just stick wtih ggpy, but the commits holding at ~1-2 years ago is not reassuring as well as issues from 2014 asking for features that R's gglot2 already has (and no response). The only thing I'm sure on is that ggplot2 is pretty awesome and I don't think python is there yet.