Scala – Insights

Schema Evolution With Delta

Delta Lake is an open source storage layer that provides ACID transactions. Spark DataFrames can be saved in delta format by just specifying the format as “delta”. Similarly, the saved delta table can be read by reading the format as “delta”.

The longer we use Delta, the more likely it is that we will run into a scenario where the incoming data has a schema that is slightly different from the target Delta table schema. Like with every other thing around us, evolution of schema over time is a very common scenario.

Scala – Some vs Option

This is a quick post showing the difference between Some and Option. When I first started using Scala, I did not understand the difference between ‘Some’ and ‘Option’. Though subtle, the difference is an important one and helps decide which is to use when.

Delta Table Vacuum

Once we start appending/overwriting/merging data into delta tables, the number of parquet files in the target location keeps increasing. It is a good practice to keep the number of files in check as this might soon start affecting the read performance.

Delta lake deals with this with “vacuum” operation. Vacuum operation accepts a value for number of hours and deletes all the files that are older than that. By default, this limit is 7 days or 168 hours.

Append, Overwrite, Merge into Delta Lake

Delta Lake is an open source storage layer that provides ACID transactions. Spark DataFrames can be saved in delta format by just specifying the format as “delta”.