Python Web Scraper

Job Title: Python Web Scraper

Job ID: PWS202204-01

Job description

Roles and Responsibilities

  • As a Web Scraper, your role is to apply your knowledge set to fetch data from multiple online sources
  • Optimize the scraping capability to ensure the data is scrapped efficiently with minimum usage of server bandwidth. Scrape difficult websites by deploying anti-blocking and anti-captcha tools.
  • Develop highly reliable web crawlers and parsers across various websites
  • Extract structured/unstructured data and store them into SQL/No SQL data store • Work closely with Project/Business/Research teams to provide scrapped data for analysis
  • Develop frameworks for automating and maintaining constant flow of data from multiple sources.
  • Develop a deep understanding of the data sources on the web and know exactly how, when, and which data to scrap, parse and store this data
  • Active participation in troubleshooting and debugging.
  • Guide and mentor other data engineers.
  • Develop a Data Ingestion framework for automating and maintaining constant flow of data from multiple sources to the database.
  • Perform code reviews and suggest design changes.
  • Comply with coding standards and technical design.
  • Increase process efficiency by identifying repeatable jobs and automating them using appropriate tools and techniques
  • Creating efficient web crawlers. Create more/better ways to crawl relevant information
  • Familiarity with best practices and design patterns of programming languages

Desired Candidate Profile

  • Strong knowledge of any of multiple open-source and proprietary scraping frameworks available
  • Strong knowledge of scraping frameworks such as Scrapy, Beautiful Soup,, URLlib and Selenium.
  • Good knowledge of Python Pandas library for data manipulation.
  • Good to have Experience of complex crawling (like captcha, Mobile OTP based crawling, bypassing proxy)
  • Experience in various data extraction methods (like data extraction from PDF Files, web pages, etc.)
  • Good understanding of HTML DOM, CSS, JavaScript, XPATH and RESTful web service
  • Familiarity with AWS, cloud-based technologies are a plus
  • Working knowledge in various SQL/NoSQL DBs , message queues & Web Restful APIs
  • Experience in Linux based OS, (ubuntu would be a plus)

Qualifications

  • BSCS/ MCS/Computer Engineering / Information Technology
  • Min 1-year experience, of which 8 months must be hands-on experience in crawling/scraping using frameworks such as Scrapy, Beautiful Soup, Selenium, APIs.
  • Min 6 months in Python Programming is a must.

Send your CV at career@3gca.org / hr@3gca.org. Mention job title and ID in the subject line.