A dataframe is a collection of data, organised much like a table in a relational database with columns and rows. There are many methods available on a dataframe that can help with filtering, selecting, aggregating the data within.
There are many ways a DataFrame can be created. Below I show some of the common ones that I have used in pySpark.
Imports
import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.getOrCreate()
Creating an empty DataFrame
emptySchema = StructType([])
# create empty dataframe with empty schema
emptyDF = spark.createDataFrame([],emptySchema)
schema = StructType([
StructField("id", StringType()),
StructField("dt", StringType()),
StructField("value", DoubleType())
])
# create empty dataframe with a defined schema
emptyDFWithSchema = spark.createDataFrame([],schema)
Creating DataFrame with data
# create a dataframe with given data
dfFromData = spark.createDataFrame([['Alex',16,10],['Tom',16,20],['Bob',15,12]])
schema = StructType([
StructField("name", StringType()),
StructField("age", IntegerType()),
StructField("points", IntegerType())
])
# create a dataframe with given data as well as schema
dfFromDataWithSchema = spark.createDataFrame([['Alex',16,10],['Tom',16,20],['Bob',15,12]],schema)
Create DataFrame from a RDD
rdd1 = sc.parallelize(["jan","feb","mar","april","may","jun"],3)
schema = StructType([StructField("month", StringType())])
# create dataframe using CreateDataFrame method
rddDF1 = spark.createDataFrame(rdd1.map(lambda x: (x,)))
# create dataframe using CreateDataFrame method and specified schema
rddDF1WithSchema = spark.createDataFrame(rdd1.map(lambda x: (x,)),schema)
# create dataframe using toDF method
rddDF2 = rdd1.map(lambda x: (x,)).toDF()
# create dataframe using CreateDataFrame method and specified schema
rddDF2WithSchema = rdd1.map(lambda x: (x,)).toDF(schema)
Create DataFrame using a list
l = [('Tim','10','12'),('Tom','5','9'),('Harry','10','5')]
listDF = spark.createDataFrame(l, ['name','val1','val2'])
Create DataFrame from a Pandas DataFrame
l = [('Tim','10'),('Tom','5'),('Harry','15')]
pandasDF = pd.DataFrame(l, columns = ['Name', 'Age'])
sparkDF = spark.createDataFrame(pandasDF)
Create DataFrame using a CSV file
path = '/in-data/testfile.csv'
schema = StructType([
StructField("id", StringType()),
StructField("dt", StringType()),
StructField("value", DoubleType())
])
csvDF = spark.read.csv(path,schema=schema,header=True)