Schema Evolution With Delta

Delta Lake is an open source storage layer that provides ACID transactions. Spark DataFrames can be saved in delta format by just specifying the format as “delta”. Similarly, the saved delta table can be read by reading the format as “delta”.

The longer we use Delta, the more likely it is that we will run into a scenario where the incoming data has a schema that is slightly different from the target Delta table schema. Like with every other thing around us, evolution of schema over time is a very common scenario.

Continue reading “Schema Evolution With Delta”

Delta Table Vacuum

Once we start appending/overwriting/merging data into delta tables, the number of parquet files in the target location keeps increasing. It is a good practice to keep the number of files in check as this might soon start affecting the read performance.

Delta lake deals with this with “vacuum” operation. Vacuum operation accepts a value for number of hours and deletes all the files that are older than that. By default, this limit is 7 days or 168 hours.

Continue reading “Delta Table Vacuum”