Transforming and Visualizing COVID-19 Data

Patrick Taylor
SnapLogic Engineering
11 min readApr 9, 2020

--

Step by Step Guide

The total number of confirmed COVID-19 cases, worldwide, now exceeds one million.

Many resources are available on the Internet to help us track and understand the growth of this pandemic, including the dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering. This dashboard is backed by a data repository published on GitHub. Even though the source data is high quality, it requires significant structural transformation for an effective visualization. SnapLogic Intelligent Integration Platform handles the transformations easily.

In this article, I will walk through the construction of a SnapLogic pipeline, which reads a subset of this data from the GitHub repository, filters the data for a few countries, and transforms the data so that it can be viewed with a line chart using SnapLogic’s DataViz feature. I have included lots of screenshots and detailed explanations so that you can easily follow along. The final output looks like this:

Figure 1: Interactive chart of COVID-19 cases made with SnapLogic

The Source Data

Our source data is JHU’s time_series_covid19_confirmed_global.csv file, which is automatically updated every day, and contains time-series data for total confirmed COVID-19 cases arranged by country/region, as shown below. GitHub’s Raw button in this view is linked to the URL for the raw CSV file.

Figure 2: COVID-19 time series CSV data

The Pipeline

Here’s an overview of our SnapLogic pipeline to read and transform this data:

Figure 3: SnapLogic pipeline that produces interactive chart based on raw CSV data.

Let’s walk through this pipeline, Snap by Snap.

File Reader + CSV Parser

The first two Snaps in this pipeline demonstrate a very common pattern:

  • A File Reader to read the raw data, configured with the raw file’s URL.
  • A Parser Snap specific to the file’s data format, in this case CSV. No additional configuration is needed for this Snap.

It’s important to understand how much data we’re working with. We can do that by running the pipeline and then viewing the Pipeline Execution Statistics, which shows that the CSV Parser generated 258 output documents, one per row of the CSV file.

Figure 4: Pipeline execution statistics

SnapLogic platform allows you to process data even in preview mode, as you are building your pipelines. To easily work with these 258 output documents in preview, open the Settings of the Designer and change the Preview doc count to 500:

Figure 5: SnapLogic Designer settings to change the preview doc count

Now if we validate the pipeline, we can preview all of the CSV Parser’s output documents. The default Table view is the best format to show this data given its tabular structure.

Figure 6: Tabular output from the CSV parser

Filter By Country

Next, let’s filter this data to find the rows (documents) corresponding to just four countries we’re focusing on for this example. To do this, we use a Filter Snap, configured with an expression using SnapLogic’s expression language, which is based on JavaScript but with additional features to work with SnapLogic documents. This expression is evaluated for each input document, and the Filter snap requires an expression with a boolean-type result: true or false. If the expression evaluates to true for a given input document, it will pass through as an output document of this Snap; if false, the document will be skipped.

Figure 7: Filter Snap configuration

To see the full Filter expression, click the down arrow and then the expand editor icon to open the Expression builder:

Figure 8: Expression builder interface for the Filter Snap

Let’s take a closer look at this expression:

[“US”, “Italy”, “Spain”, “United Kingdom”]
.indexOf($[‘Country/Region’]) >= 0
&& $[‘Province/State’] == “”

This expression uses an array to specify the countries of interest and the indexOf operator to test whether the Country/Region value is an element of this array. It also tests whether the Province/State value is empty, since we’re not interested in rows with a value for that column.

If we preview the output of this Snap, we see the four rows (documents!) of interest:

Figure 9: Output of the Filter Snap

Prune Fields

Next in the pipeline is a Mapper Snap, by far the most powerful and frequently used Snap in SnapLogic. Here we use the Mapper to implement another typical pattern used in a pipeline of this nature: filtering the columns of the data, which we’ll call fields or keys since each document is represented as a JSON object. Let’s look at the Mapper’s configuration:

Figure 10: Pruning input data fields for a simplified output

Note the following about this configuration:

  • The Pass through setting is enabled, which allows all fields to pass through by default, without explicitly mapping them. This includes the fields corresponding to dates (1/22/20, etc). This means the output will automatically include new data for additional dates the next time we run this pipeline.
  • The first row of the Transformations table maps the value of the Country/Region field to the name country in the output. The presence of the / character in the original name makes it invalid to use the expression $Country/Region, so we would have to use the longer alternative syntax $[‘Country/Region’]. With this mapping, we’ll be able to refer to this value simply as $country.
  • Any row that has an Expression but no value for Target path will be deleted from the output, which is only needed when Pass through is enabled. Note that we must have a row for the Country/Region field here again, otherwise the value for this field will appear twice in the output document, as Country/Region (because Pass through is enabled) and as country, as specified by the first mapping.

If we preview the output of this Snap, we’ll see just the date fields and the country field, all the way to the right:

Figure 11: Prune fields output

Let’s take a closer look at this output. Each value under a date column in this table is the total number of confirmed cases on that date in the corresponding country in the right-most column. This is the data we want to represent graphically, but before we can do that, we need to flip the structure so that the columns become rows. That will require some more advanced mapping and expressions, over several steps.

Map Columns to Arrays

Next in our pipeline is another Mapper:

Figure 12: Mapping columns to arrays

This has a single expression:

$.entries().filter(pair => pair[0] != ‘country’)
.map(pair => { Date: pair[0], [$country]: pair[1] })

Let’s break this down.

$.entries()

This sub-expression will produce an array of the key/value pairs in the object, where each pair is a subarray with two elements: the key and the value. You can see this result using the sub-expression feature of the Expression builder:

Figure 13: Subexpression while mapping columns to rows

.filter(pair => pair[0] != ‘country’)

This sub-expression will filter the array of key/value pairs produced by entries, resulting in a new array of the pairs which match the callback function given as its argument. In the screenshot above we’re seeing the first few pairs of the document containing the US data: [“1/22/20”, “1”], [“1/23/20”, “1”], [“1/24/20”, “2”], etc. If you look back at the Prune Fields output, you can see the final pairs in this row: [“4/2/20”, “243453”], [“country”, “US”]. Our filter function will match all but the last pair, where pair[0] == “country”.

.map(pair => { Date: pair[0], [$country]: pair[1] })

The input to this map sub-expression is an array of the key/value pairs representing each date and the case count for that date: [“1/22/20”, “1”], …, [“4/2/20”, “243453”] for the US row.

The map function returns a new array where each pair will be transformed by the expression to the right of the =>, which is an object literal resulting in an object with two properties:

  • Date: pair[0] results in a property named Date containing the first half of each pair (the date value).
  • [$country]: pair[1] results in a property whose name is the value of the expression $country, and whose value is the second half of each pair (the case count).

So for the last pair of the row of US data, the map function will produce an object like this:

{“Date”: “4/2/20”, “US”: “243453”}

This is much closer to the form we need to start plotting points on a chart.

When you change the Preview Type to JSON, here’s the output preview of the Map Cols To Arrays Snap:

Figure 14: JSON preview of the column to array mapper

What you’re seeing here is a partial view of the first output document of this Snap, created by transforming the first input document containing the data for Italy. Unlike the input document, which was a simple JSON object consisting of name/value pairs, the output object is a JSON array of JSON objects.

If you switch the Preview Type back to Table and click on a few cells in the last row, representing this same data, it lets you peer into this nested array structure.

Figure 15: Nested array output of the columns to arrays mapper

Split

As shown above, each input document to the next Snap in our pipeline is an array of objects. We want to split up these arrays into their component objects. To do that we can use the JSON Splitter Snap, where we configure the Json Path setting with the path within each input document referencing the array elements that should be mapped as the output documents. Clicking the suggest icon at the right end of the Json Path field helps us do this. Clicking the top node in this tree sets the path to $[*].

Figure 16: Configuration of JSON splitter

Here is the output of the JSON splitter for the rows with values for US:

Figure 17: Output of the JSON splitter for US rows

Sort By Date

Next, we will sort the data by Date. This is quite easy with the Sort Snap, configured with $Date as the Sort path:

Figure 18: Configuration for Sort By Date

The output is what you might expect:

Figure 19: Output of the Sort Snap

Note that the date values here are of string type, so they are sorted alphanumerically, without interpreting them as dates. Which means “3/9/20” is at the end of the data for March. We could deal with that if we needed to, but we don’t. This output is perfectly adequate for our remaining steps.

Group By Date

Now that the documents are sorted by Date, we can use the Group By Fields Snap to group documents together by that field. Note this will only work correctly if the data is, in fact, sorted.

Figure 20: Configuration of Group By Fields Snap

This Snap is configured with one or more Fields to group by, in this case just the Date. The Snap will collect each group of input documents with the same value for the Date field and output a single document containing the data from all of these input documents, nested under an array in the output document, where the array is stored under a field named by the Target field setting: recs. To make that clear let’s just preview the output as JSON. Here is the last page of the JSON output

Figure 21: Output of Group By Fields Snap

Last step in this pipeline is another Mapper.

Reduce

You’ve already seen the map function of the expression language used in our last Mapper. This may ring a bell if you’re familiar with the MapReduce programming model at the heart of “Big Data” processing. Well guess what? The SnapLogic expression language also has a reduce function, which we’ll use in our final Mapper. Here’s our configuration (with some side panels collapsed):

Figure 22: Configuration of the final Mapper Snap

Let’s break down the expression:

$recs.reduce((result, elem) => result.merge(elem), {})

$recs is the array within each input document, which looks like this:

Figure 23: Array within each input document

This contains everything we need for the output document.

.reduce(function, initialValue) will reduce this array to a single value using the given function, starting the iteration with the given initial value. Here, this function is expressed as a JavaScript lambda expression:

(result, elem) => result.merge(elem)

The function’s parameters are named on the left side of the => result and elem. The expression on the right is used to compute the result of the function. The reduce function will call this function once for every element in the array, with the output of each call serving as input to the subsequent call via the result parameter, and the next array element’s value via the elem parameter. The initialValue given to reduce is the value of the result param to use for the first iteration. In this case, we’re using the merge function to merge this array of objects into a single object.

This will be easier to understand if we show each iteration:

On each iteration of reduce, another element of the array gets merged into the output object. Every element has the same Date value, so that stays the same after each iteration, but we end up with a single object that contains all of the country case counts for that date.

You can see the resulting output documents in the Output preview data panel in the lower-right corner of the Reduce Mapper shown above.

Data Visualization

Let’s preview the output of the Reduce Mapper from the previous step.

Figure 24: Final chart-friendly data

Bingo! Our data finally looks chart-friendly. Let’s open the DataViz panel by clicking on the left-pointing arrow shape just above the DataViz label on the right side of the window. The panel expands to occupy the right half.

Set the Chart Type to “Line Chart — Date as X axis”, X-axis to “Date”, and Key to Visualize to the country names to see the final line chart.

Figure 25: Final visualization of the COVID-19 data

I hope this helped you gain a better understanding of some powerful features of the SnapLogic platform. You can download the pipeline from this post in the SnapLogic community and try it out yourself. There are links throughout this article to relevant sections of the SnapLogic Product Documentation, where you’ll find more information.

I hope for the best of luck to you and yours during this difficult time.

Stay healthy, stay safe!

UPDATE (April 2021): When I ran this pipeline again recently, I noticed a new problem with date sorting that didn’t occur with the data when I initially developed the pipeline. We’re sorting dates as formatted strings, so “10/1/20” was ending up before “9/30/20” in the sorted output, for example.

To fix this, use a Mapper to reformat the dates:

Here’s the expression as text:

Date.parse($Date,"MM/dd/yy").toLocaleDateString({"format":"yyyy/MM/dd"})

--

--