Aggregating Data With Bash
Bash scripting is not always the best way to work with data. There are many Technologies that might work better for the particular project or data that you are working with. When most people hear about data mining they think of tools like SQL, Python, R, etc. which are all great and flexible tools for working with data. However, sometimes you are stuck with a purely Bash/Unix environment and need to be able to implement some Bash scripting to work with text based data.
There is a lot that can be done with basic Bash tools and scripts, however, there are some hurtles that you will need to over come to provide data analysis through Bash scripting. Bash of course is the terminal shell that comes with many Unix operating systems. If you have little or no experience using a terminal for development, I would highly recommend going through some tutorials to familiarize yourself with the basics.
Bash provides many built in tools that will make analyzing raw text data much easier. However, to work with data you are really going to need tools that will allow you to aggregate and overlap segments of data. Luckily Bash provides some programs that will allow us to create these tools on our own. In this post I’d like to talk about aggregating data, which is a little easier and more straight forward than overlapping data. Aggregating is also important for getting high level counts on different cuts of your dataset.
For this post I am going to be using AWK. AWK is a very useful text programming language, that will be the main technology allowing us to sum up segmented data. I am choosing AWK because it comes with most Unix/Linux distros and is an easy tool to illustrate how to build a basic data aggregator. If you were building this tool for more day to day data analysis I would recommend a language like Python which will give you a little more flexibility. I have a more realistic Python example here if you are interested.
We will want an aggregator that scales well with large amounts of data, so for my design I chose to expect sorted input so that we do not have to load more than two rows of the data into memory. This will create more processing overhead because we will need to sort our input, but it will avoid memory issues when you are working with very large datasets. The basic logic behind the script will be that as the program loops though each line of input, it will check that row’s groupBy column, if that column is the same as the last, add the sum column to a counter variable. If it is not, print the output and start counting again if you are not at the end of the file. Here is my data aggregator in AWK:
BEGIN{
# Expect tab delimited data
FS="\t"; OFS="\t";
# Initialize variables
holder=$groupBy; counter=0;
}
{
# Print data if the groupBy column changes
if ($groupBy != holder && NR != 1) {
print holder,counter;
holder=$groupBy;
counter=$sum;
}
# Sum up columns
else {
holder=$groupBy;
counter=counter+$sum;
}
}
END{
# Output final record
print holder,counter;
}
This script will take in the group by column and the sum column as AWK variables. It will then output totals for each group or segment to the terminal. Now that we have our aggregation script in order we need some data to work with. I am going to use a modified version of the sample Coffee Chain business dataset that comes with Tableau desktop to work through how to sum and analyze data. Here is a link to the tab delimited dataset that I am using. You can also get this data set in a slightly different format from Tableau’s website.
One of the major draw backs to using Bash is that we will have to remember what column represents what data as we perform our analysis. This is not an issue if you are using something like SQL because the fields in each record of the dataset are already named. Here are the columns that we will be working with:
1 Area Code
2 Date
3 Market
4 Market Size
5 Product
6 Product Line
7 Product Type
8 State
9 Type
10 Budget COGS
11 Budget Margin
12 Budget Profit
13 Budget Sales
14 COGS
15 Inventory
16 Margin
17 Marketing
18 Number of Records
19 Profit
20 Sales
21 Total Expenses
For my examples, I will choose to ignore most of these columns in favor of generating basic high level cuts and counts of the underlying data. Lets take a look at the Profit, Sales, and Total Expenses cut by Market, State, and Product. We can get all of those totals with a few simple Bash for loops:
# Loop through each dimension: Market, Product, State
for group in 3 5 8
do
# Print group header
cat coffee_chain_data.txt | awk -v group=${group} 'BEGIN{FS=OFS="\t"}{if (NR==1) print "\n"$group":"}'
# Loop through sum columns: Profit, Sales, Total Expenses
for sumCol in 19 20 21
do
# Print sum header
cat coffee_chain_data.txt | awk -v sumCol=$sumCol 'BEGIN{FS=OFS="\t"}{if (NR==1) print "\n"$sumCol":"}'
# Aggregate the data
cat coffee_chain_data.txt | sed '1d' \
| sort -t$'\t' -k$group,$group | awk -f sum.awk -v groupBy=$group -v sum=$sumCol
done
done
The output of this command will look like this:
Market:
Profit:
Central 96844
East 83083
South 34762
West 110838
Sales:
Central 265045
East 178576
South 103926
West 272264
Total Expenses:
Central 69033
East 50113
South 31002
West 79514
Product:
Profit:
Amaretto 9602
Caffe Latte 11883
Caffe Mocha 31790
...
We can improve on efficiency if we modified our aggregation script to sum multiple columns at once. This would allow us to read through the entire data set only three times instead of six. If you find yourself aggregating data in Bash quite a bit, a more flexible summing script is essential. However, for the sake of simplicity I will leave it as is. Feel free to modify a script like this to fit your own specific needs.
You can also cut the data by month (or any other dimension for that matter) by adding that field to your groupBy column. Here is an example of total Market Sales by month:
# Print aggregated data grouped by Date and Market
cat coffee_chain_data.txt | awk 'BEGIN{FS="\t";OFS="\t"}{if (NR!=1) { $3=$2"-"$3; print $0; }}' \
| sort -t$'\t' -k3,3 | awk -f sum.awk -v groupBy=3 -v sum=20
01/01/2012-Central 10346
01/01/2012-East 6780
01/01/2012-South 3976
01/01/2012-West 10453
02/01/2012-Central 10503
02/01/2012-East 6920
02/01/2012-South 4082
02/01/2012-West 10587
...
This would allow you to trend the aggregated Market counts over a timeline. You could use the same method to get counts by Market Product, by State and Type or any other aggregation of the dataset that you can think of.
This may not be the best way to work with this type of data but if you find it necessary to use Bash for data analysis this method can go a long way. If you are doing any work with Bash for data analysis, I hope that you have found this post useful. I plan to get another post up about overlapping various segments soon, so keep an eye out for that one. If you guys have any comments or questions or just want to let me know how you work with data, don’t hesitate to leave a comment below.