Schema Evolution With Delta

Delta Lake is an open source storage layer that provides ACID transactions. Spark DataFrames can be saved in delta format by just specifying the format as “delta”. Similarly, the saved delta table can be read by reading the format as “delta”.

The longer we use Delta, the more likely it is that we will run into a scenario where the incoming data has a schema that is slightly different from the target Delta table schema. Like with every other thing around us, evolution of schema over time is a very common scenario.

Continue reading “Schema Evolution With Delta”

Run a Flask App with WSGI and NGINX on EC2

In the previous post titled Create a Simple Flask App on EC2, we create a simple Hello World Flask app and deployed it on an EC2 instance. On running our app, we say a warning message that read –

WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.

What this is telling us is that, while Flask has an inbuilt web server and serves our purpose for development, it is not suitable for production. In this post, we will further our previous post, and use WSGI server to talk to our Flask app and NGINX to handle the traffic between WSGI server and our web browser.

Continue reading “Run a Flask App with WSGI and NGINX on EC2”

Create a Simple Flask App on EC2

Recently, I was faced with a situation where I had to quickly create a simple service in Python that I could invoke and run some real time tests based on the input data. I spend a few minutes searching about the quickest way to achieve this and came across Flask. It is very easy to develop a Flask app, or to convert an existing Python app to use Flask and make it in a service. It is not something we will use in production as is, but for development, it totally works awesome.

Continue reading “Create a Simple Flask App on EC2”

Create Python Virtual Environment on AWS EC2

Creating virtual environments when developing applications in Python is one of the most common requirements. While on my Windows laptop I used Anaconda to do the same, for development on my AWS EC2, I set this up using a few simple commands. Here I show how to set up a virtual environment on a linux EC2 instance on AWS.

Continue reading “Create Python Virtual Environment on AWS EC2”

Delta Table Vacuum

Once we start appending/overwriting/merging data into delta tables, the number of parquet files in the target location keeps increasing. It is a good practice to keep the number of files in check as this might soon start affecting the read performance.

Delta lake deals with this with “vacuum” operation. Vacuum operation accepts a value for number of hours and deletes all the files that are older than that. By default, this limit is 7 days or 168 hours.

Continue reading “Delta Table Vacuum”

How to create a Spark DataFrame

A dataframe is a collection of data, organised much like a table in a relational database with columns and rows. There are many methods available on a dataframe that can help with filtering, selecting, aggregating the data within.

There are many ways a DataFrame can be created. Below I show some of the common ones that I have used in pySpark.

Continue reading “How to create a Spark DataFrame”