1921 Allison Walker, Large Scale Data Techniques for Research in Ecology presentation
Articles Blog

1921 Allison Walker, Large Scale Data Techniques for Research in Ecology presentation

It is widely known that many species of bird
travel huge distances each year as part of seasonal movement between breeding and wintering
grounds. Migrating birds share airspace with airplanes,
both commercial and military, windmills, and other man-made structures like high-rise buildings.
It’s important for both the birds’ safety and the longevity of these aircraft and structures
that we are able to accurately predict when flocks will be migrating and where. We know that birds migrate using cues from
the sun and stars, the earth’s magnetic field, as well as mental maps. So why do their
migratory routes vary so vastly each year? Why can we not yet easily predict when different
flocks of birds will be traveling through different airspace? Well, that’s where an
exploration of weather patterns becomes so important. While ecologists have traditionally focused
on GPS tracking data when studying migratory patterns, it is becoming increasingly important
to integrate additional data sources, like weather data. To this end, there arose a need
for an efficient architecture allowing ecologists at the University of Amsterdam to access and
study a new stream of meteorological radar data. Querying a Postgres relational database
just wasn’t going to cut it for this task. My name is Allison Walker, and I am privileged
to be a participant in the 2019 Summer of HPC program. Over the next five minutes I
will describe the project that I have focused on at SURFSara in Amsterdam: Large Scale Data
Techniques for Research in Ecology. Radars are complicated, both in terms of their
technology, and the data that they capture. Radar collects information about objects in
the surrounding area by using radio waves. The returned echoes can be used to identify
aircraft, weather patterns and even flocks of birds. Every time a radar completes a sweep, a significant
amount of metadata is captured and stored. Examples of this metadata are:
• Radar latitude, longitude and height • Radar elevation angle
• Radar range • The timestamp
• The data itself: information is captured for up to 16 different metrics measured in
each sweep. All of these data and metadata, need to be
considered in order to correctly project the data captured into an accurate visualization. Further, meteorological radars capture data
in a polar format. Typically, data for visualization is collected and stored in a gridded format,
ie. the position in the matrix represents the position of a pixel. In polar data, however,
the dimensions are azimuths (degrees in a circle) and radius (the distance from the
measuring radar station). The complexity and size of radar data means
that it’s no simple task to filter, process and visualize, which is basically what ecologists
studying these data need to do. So in this project we developed a data pipeline that
could efficiently provide ecologists with access to visualizations for multiple radars
around Europe. All the user has to do, is input their filters. How did we do it? Let’s cover that now. A typical radar might perform a fresh sweep
every 15 minutes, meaning 96 sweeps per day. Data collected in each sweep is stored in
its own file, in an HDF5 format. HDF5s are hierarchical, and somewhat complicated to
navigate. In building an efficient ETL pipeline for
visualizing radar data, a critical first step was to condense these individual files into
a simpler format. The many complexities of the radar measurement system, meant that it
was important that we build a solution that could easily filter and return only the data
that the researcher is interested in for their particular question. Parquet was the most sensible format for these
purposes. Parquet is a column-oriented data storage format that is efficiently filtered,
and highly compatible with Spark, the next key tool in this project’s toolbox. Once the radar data is formatted into Parquet
and saved in Minio – an object storage software -, a number of functions for loading, filtering
and visualizing the relevant data were designed. Spark, a general-purpose cluster-computing
framework allowed us to seamlessly distribute the compute for these functions across the
nodes in our Kubernetes cluster. Next, we used python and Wradlib, a python library
for Weather Radar Data Processing, to easily generate visualizations like this: To further improve the value of the visualizations,
the next step in the pipeline? Georeferencing. Key tools used in this stage were: Geoviews,
Holoviews and Xarray. These are all nifty libraries that let us place the visualized
radar data exactly where it belongs on the map. Finally, again with the help of Spark,
we have this beautiful georeferenced output. So, there we have it! A functional data pipeline
allowing ecologists to query and visualize radar data based on their desired filters.
Of course, the final product can be further optimised, but this solution already creates
huge efficiencies for the ecologists. Importantly, this was also a great learning opportunity
for me to gain experience with data pipelines and data engineering. I have also developed
a new fascination with both birds and radar technology! Thank you to both PRACE and SURFSara
for this awesome opportunity.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top