Sign in

Big Data Consultant | Learn | Build | Share https://github.com/ksree

How to analyze global air pollution data on the cloud

According to WHO, 7 million people die every year from exposure to fine particles in polluted air that lead to diseases such as stroke, heart disease, lung cancer, chronic obstructive pulmonary diseases, and respiratory infections, including pneumonia.

91% of the world’s population live in places where air quality exceeds WHO guideline limits.

Treemap of the worlds most polluted countries (pm2.5) 2020

Here we analyze global air quality data from openaq:

  • Extract and aggregate global historical air pollution data from openaq s3 bucket using Apache Spark
  • Find the worlds most polluted cities and countries (measured by PM2.5 levels)
  • Calculate monthly and yearly averages for air quality indicator
  • Visualize using Google…


Apache Avro, Java, File Format, Data Serialization

Data serialization with Apache Avro

What is Apache Avro?

Avro is an open-source language-agnostic data serialization framework. The schema of Avro files is specified in JSON format, making it easy to read and interpret. Files that store Avro data should always also include the schema for that data in the same file.

Avro includes APIs for C, C++, C#, Java, JS, Perl, PHP, Python, and Ruby. Being language agnostic, files stored using Avro can be passed between programs written in different languages.

You can find the source code for this tutorial here: https://github.com/ksree/apache-avro-demistified

https://en.wikipedia.org/wiki/Apache_Avro#/media/File:Apache_Avro_Logo.svg
Source: https://www.apache.org/logos/?#avro

Avro File Format

An Avro file consists of a file header, followed by one or more file data blocks.


GCP/Dataproc/BigQuery/Data Studio/Apache Spark/Amazon S3/Climatology/Climate Change

Visualize observable changes in global temperature using NOAA’s historical weather data, Apache Spark, BiqQuery, and Data Studio

Photo by Karsten Würth on Unsplash

We have all read about and experienced the effects of climate change every day around us. We have seen numbers like: The current global average temperature is 0.85ºC higher than it was in the late 19th century, and each of the past three decades has been warmer than any preceding decade since records began in 1850*.

I got curious about how climatologists determine these numbers. There is a whole lot of research going on in this area. I came across one important weather dataset from NOAA that is widely used in research.

In this blog, I will explain how I…


Azure/GCP/AWS/Terraform/Spark

Create a Multi-Cloud Data Lake using Terraform and run a configuration driven Apache Spark data pipeline on COVID-19 data

Photo by Hunter Harritt on Unsplash

Five years back when I started working on enterprise big data platforms, the prevalent data lake architecture was to go with a single public cloud provider or on-prem platform. Quickly these data lakes grew into several terabytes to petabytes of structured and unstructured data(only 1% of unstructured data is analyzed or used at all). On-prem data lakes hit capacity issues, while single cloud implementations risked so-called vendor lockin.

Today, Hybrid Multi-Cloud architectures that use two or more public cloud providers are the preferred strategy. 81% of public cloud users reported using two or more cloud providers.

Kapil Sreedharan

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store