PySpark connection with MS SQL Server

Now this is very easy task but it took me almost 10+ hours to figured it out that how it should be done properly. Now first of all you need to create or get spark session and while creating session you need to specify the driver class as shown below (I was missing this configuration initially). I'm connecting my spark session with MS SQL server So that's why I have included SQL driver JAR file. In case you are dealing with some other database So then you have to specify different driver.

from pyspark.sql.session import SparkSession
spark = SparkSession \
    .builder \
    .config("spark.driver.extraClassPath","mssql-jdbc-6.4.0.jre8.jar") \
    .appName("Python Spark SQL data source example") \

After creating spark session successfully, Now we'll read some data from our MS SQL server.

jdbcDF ="jdbc") \
    .option("url", "jdbc:sqlserver://localhost:1433;databaseName=DB_NAME") \
    .option("dbtable", "TABLE_NAME") \
    .option("user", "DB_USER") \
    .option("password", "DB_PASSWORD").load()

That's it. Your spark dataframe is created successfully and now you can apply different transform methods on it. As a test you can check number of rows in the dataframe by following method.


And if you want to have a quick glance of what's in your dataframe so you should use .show() method.

I think this is it for this particular example. let me know if you guys face any issue while reading data from SQL server and I try my best to help you out, Thanks and Bye! :)