With yt we have used matplotlib almost exclusively for 2D plotting of adaptive mesh refinement data. This week, I implemented a plotter inside Chaco a plotting system designed for interactive exploration of data.
But first! An interlude about how we visualize 2D AMR data.
Our data is naturally a patch-based format. What that means is that higher resolution cells are grouped into grid patches -- this ultimately means that the data itself is relatively convenient to look at, and also for some post-processing computation we end up being able to make a lot of guesses at which data to look at or analyze based on its location in grid patches. Additionally, these grid patches overlap at different resolution levels: finer resolution grids are inside coarser grids.
When generating a 2D representation, we typically will take either an orthogonal slice or a "projection" through the data. (The additional spatially-organized 2D representation we have in yt is the oblique slice, but it's something of a special case.) Back when designing the projection and slice algorithms, we decided it was the most efficient to construct a set of "flattened" data: by projecting through the data, we end up with a set of data that represents the finest resolution data available along every sightline through the domain. This is represented by five values: (x, y, dx, dy, z) where x and y are in the plane of the projected domain. For slices, we get something similar, but this time it's only the finest data available at any orthogonal slice.
The alternatively mechanism would be to generate fixed-resolution arrays and fill them in; this gives and takes various things, but ultimately while we provide support for this type of function, and it's the only way to project inline with the simulation (slices can be done adaptively inline), we prefer the "adaptive" projections. An additional advantage, one that appeals to me particularly (as my AMR runs go to 30 or so levels of refinement) is that we only have to project once with the adaptive projections and we can zoom and pan as much as we like. More on zooming and panning later!
In matplotlib, there are two natural ways to present this data: by constructing a set of rectangular patches that this data represents and then filling them with the appropriate value, or by pre-pixelizing, using a bit of our knowledge about how AMR data works, and then displaying an image. The advantage of the first is that we are able to provide only a single set of data to matplotlib, once, and then we can adjust the limits inside the context of matplotlib and describe the data exclusively in the context of the domain. However, there's a lot of overhead involved -- often we end up with projections that, with just the five values alone, run into the hundreds of megs. (For very large simulations, particularly.) Generating a second layer of objects, one for every cell at the finest resolution in the image plane, is probably going to add too much overhead. So instead we run the pixelizer! (Of course, at some stage matplotlib also runs a pixelizer, to turn the multiple patches into a single output image. But it does so with a little less knowledge than we have about the nature of the data, so the generality costs would cost us.)
The pixelizer is relatively simple: for every datapoint in the projection or slice, if it falls within the bounds of our image, deposit it into the appropriate cells with which it overlaps. If it fully overlaps, deposit the entire thing; otherwise, only deposit a fractional amount. This process is relatively fast, but not without cost. The biggest cost is actually in discarding the points that do not fall within the region -- the iteration itself! Because our data is periodic, we have to check multiple times for most datapoints whether or not they are within a given region. This also costs us. Overall, however, the process is straightforward and fast. (Of course, I do realize that the entire projection process: from the initial grid patches, to summing over an axis, to the compositing and the pixelizing, is something OpenGL was designed to do! But, we do it all in C.)
So now we have a mechanism that takes multi-resolution AMR data and turns it into an image, given only the data itself, a resolution, and a 2D bounding box. Typically this is used by yt to generate matplotlib plots. This week, though, I have been working a bit on a tile display wall routine -- dispatching rendering jobs to different nodes and then having them feed back some portion of the render job to a display tool that broadcasts it to multiple monitors. (More on that process in a future blog post.)
However, one thing that was always really lacking was an intuitive mechanism for interacting with a plot: in years past, I'd hacked up a matplotlib and wx-based GUI that had some neat features -- a slider to set the width and re-centering abilities. But it was clunky, and kind of slow, and I hated maintaining it. The book "wxPython in Action" was quite good, and wx is definitely a very capable toolset, but I ended up using the "pubsub" mechanism in wx and I eventually got lost in the events and handlers and couldn't find my way out.
Enthought's Chaco toolkit, though, was designed for interactive use -- whipping up GUIs around structured data (using the "Traits" interface) is easy and straightforward. A couple years ago, they added the ability to get 1D data via functions -- so the data could be recalculated on the fly, as the range needed changed. This week and last, with the help of the head Chaco developer, I was able to extend this to work with image data, as well.
First, a brief description of the UI I was hoping to get: Chaco very easily presents a "google maps" style interface to image data, where you can click and drag to pan and scroll your mouse wheel to zoom in and out. I wanted to have it set up such that I could do this with the AMR data -- as you zoomed in, it would re-pixelize such that it was displaying a smaller portion at higher resolution, much like with normal google maps. Almost all of this was available in the standard Chaco distribution -- the pan and zoom tools, the displaying of image data, and the setup for inserting callbacks between different components. Displaying a single fixed image in Chaco, with the GUI and everything, is on the order of a dozen lines, many of which are boilerplate.
By adding a FunctionImageData class (almost identical to the FunctionDataSource class, except extended to 2D and subclassing Image Data) I was able to mock up the re-generation of data: as the range of data displayed changed, the data handed back was generated. By inserting a callback to update the range that the data mapped to (i.e., not only generating higher resolution data, but mapping it back to a smaller data region, thus ensuring it fully filled the image plot) the entire experience was complete.
The FunctionImageData requires a callback function that receives a "low" and "high" value that describe the new bounds of the visible data, but I also had to update the "index" -- to do all of this together, I wrote this helper class:
class ImagePixelizerHelper(object):
index = None
def __init__(self, panner):
self.panner = panner
def __call__(self, low, high):
b = self.panner.set_low_high(low, high)
if self.index is not None:
num_x_ticks = b.shape[0] + 1
num_y_ticks = b.shape[1] + 1
xs = mgrid[low[0]:high[0]:num_x_ticks*1j]
ys = mgrid[low[1]:high[1]:num_y_ticks*1j]
self.index.set_data( xs, ys )
return b
(Note that even though I sung the praises of Traits earlier, I don't really use them here. That will happen, it just hasn't yet.) The "panner" here is something that updates a pixelized image when the bounds change. What the check for index inside __call__ does is mock up the re-generation of the image indices, if an index is supplied. This also transforms the initialization of the image plot into a two-strep process:
pd = ArrayPlotData()
plot = Plot(pd)
self.pd = pd
helper = ImagePixelizerHelper(self.panner)
fid = FunctionImageData(func = helper)
fid._data = self.panner.buffer
self.fid = fid
bounds = self.panner.bounds
pd.set_data("imagedata", fid)
img_plot = plot.img_plot("imagedata", colormap=jet,
interpolation='nearest',
xbounds=(0.0, 1.0),
ybounds=(0.0, 1.0))[0]
helper.index = img_plot.index
self.helper = helper
fid.data_range = plot.range2d
So here I set up the array plot data, I set up the plot based on it, I create the pixelizer helper, then I add a default value to the pixelizer. This is a crucial step, because it's how the image plot knows what shape the data is. (There are some other things about the image plot I'm not fully sure of just now, and this is one I now sort of understand -- the relationship between mappers, indexers and plots.) I've then set the helper to have access to the image plot's index, and the data range is set to the plot's data range. At this point the two know how to talk together. Below this I add a few tools for zooming and panning, and a colorbar too
At this point, our plot works. And it's simple and easy and, best of all, very interactive. Every time the data range changes the pixelizer updates and passes back a new image: so we've got the best of both worlds. We only have to store in memory a relatively small amount of data, and it goes to the highest resolution possible. Here's an example, where I am zooming into a dataset of mine.
It could go on for a while -- this dataset is really quite deep (this was three scrolls, but I could probably do fifteen or so) but I think this gets the point across. I was able to zoom in just by scrolling with my mouse, and I can click and drag. It's intuitive, it's very very fast, and I only had to write a tiny bit of code to get it to go. This is what I want!
I think that maybe it's time to revisit having a GUI. I'd forgotten how much fun it is to explore my data, and if I can really get one going in such a short amount of code, it might be very effective to do this rather than constantly having to tweak scripts and scp images. And with the vast amounts of data I am now sort of drowning in, this will be crucial. The dataset I originally decided to do this for is just immense; I was struggling at coming up with a way to see it all, to see all the tiny features and structures, because it's at a time in the universe when such things are relatively far apart, but also quite numerous on the length scales we're simulation. Using this kind of an interface will let me discuss individual halos with the other members of the project, and we'll be able to really see and share and converse, rather than just looking at static images over email.
Another huge advantage that only just occurred to me is that the same code I used to write the backend for the tile display wall could be used for remote visualization in this same manner. By repurposing that to feed directly back into Chaco, instead of grabbing the arrays myself, we might be able to avoid any copying of data (even the pre-projected values) onto a GUI-able host. That could be very awesome, too. The entire Chaco/Traits/ETS toolkit is very powerful and I think I see it being an extremely valuable part of our future plans for interactive analysis.
(edit: evidently I messed up the options for code highlighting and for including all the images in a single slideshow, and from the post-edit area I can't figure out how to fix them.. I guess I have a bit more to learn about posterous! Sorry about that.)