Creating Scrapper/Crawler using Selenium in Python

Why use Selenium web driver?

We use selenium basically to completely render our web page as most of the sites are made up of Modern JavaScript frameworks. Mostly it is used in developing Crawlers/Scrappers for gathering data from different pages of a website.

Also Selenium is used in Automation testing of web applications. From Example if you want to test that by filling the form on your website will redirect the users to correct page than write a test to make sure of that.

I will use Python language to give you people a short demo of How to use Selenium web driver for Scrapping data from web pages. Development platform will Ubuntu 14.04.

Installing Selenium using pip

Before installing Selenium, you need to install one of the following browser,

  • PhantomJS
  • Chrome
  • Firefox

For this post I'm gonna use PhantomJS as it is widely used in developing crawlers. So open up your terminal and run the following commands in order to download and copy phantomjs into bin directory.

$ wget https://github.com/Pyppe/phantomjs2.0-ubuntu14.04x64/raw/master/bin/phantomjs
$ cp phantomjs /usr/local/bin/phantomjs
$ chmod +x /usr/local/bin/phantomjs

Now its time to install Selenium web driver using pip. Create a virtual environment and activate it. For installing Selenium, run the following commands.

$ virtualenv venv
$ . venv/bin/activate
$ sudo apt-get install libxml2-dev libxslt1-dev python-dev
$ pip install selenium
$ pip install lxml
$ sudo apt-get install python-lxml

At that point you'll have all that you need to start writing your first scrapper using Selenium.

Writing small scrapper using Selenium

In the following example we will be scrapping href link of an Instagram profile picture and than will download that picture. In my case I'm gonna scrape profile picture of official Gotham-Instagram.

Now create a learning-scrapping.py file in your project directory and import all the necessary modules.

learning-scrapping.py

import urllib
from selenium import webdriver

webdriver will be use to create an instance of a PhantomJS and than we'll use that instance to search anything on that web page using xpath queries. Now get the profile url from Instagram and store it in a variable.

# Official profile url for Gotham page
profile_link = 'https://www.instagram.com/gothamonfox/'

Now create a class which will be use for scrapping data from the given url.

class InstagramScrapper(object):

    def __init__(self):
        pass

    def scrape_job_links(self):
        pass

    def scrape(self):
        pass

Now in __ini__.py create an instance of PhantomJS browser and assign it a window size.

def __init__(self):
    self.driver = webdriver.PhantomJS(executable_path='/home/hassan/phantomjs/phantomjs')
    self.driver.set_window_size(1120, 550)

Ok, we have now an instance of a PhantomJS browser and now its time to give that browser a link/URL. In order to that there is a function called get(url) which takes a url as an argument. Call that function using self.driver.

def scrape_job_links(self):
    """ Scrape Instagram prfile picture href link and than download it """
    self.driver.get(profile_link)
            print self.driver.title

Now query using function find_elements_by_xpath(). This will return a list of objects that will match the xpath query.

xpath_for_img_tag = "//div[@class='i38']/img"

# quering using xpath and than getting first element of list
src_url_to_dp = self.driver.find_elements_by_xpath(xpath_for_img_tag)[0]

Now that you have successfully scrapped an Instagram profile and fetch a url for the display pic, its time to download that profile picture. Add the following line to download the image from a given url.

urllib.urlretrieve(src_url_to_dp.get_attribute("src"), "profile-pic.jpg")

After the execution of this line, image will be downloaded to your project root directory.

Congratulations! You have created a very basic script using Selenium that will scrape url of an Instagram profile picture. But its just a small demo of Selenium. You can do a lot more than just getting urls from a web page. You can interact with the page elements e.g filling text fields, choosing an option from a drop down menu, clicking buttons and almost everything that you can do on an actual web browser.

What's Next ?

You have just started and there is a lot more that you need to look into it and explore Selenium. So here is a very useful link that will make you an expert.

Feel free to comment below if you face any problems in getting started with Selenium in Python. I will surely try my best to sort it out, Thanks.

high performance ssd vps