Job Title: Python Web Scraper
Job ID: PWS202204-01
Roles and Responsibilities
- As a Web Scraper, your role is to apply your knowledge set to fetch data from multiple online sources
- Optimize the scraping capability to ensure the data is scrapped efficiently with minimum usage of server bandwidth. Scrape difficult websites by deploying anti-blocking and anti-captcha tools.
- Develop highly reliable web crawlers and parsers across various websites
- Extract structured/unstructured data and store them into SQL/No SQL data store • Work closely with Project/Business/Research teams to provide scrapped data for analysis
- Develop frameworks for automating and maintaining constant flow of data from multiple sources.
- Develop a deep understanding of the data sources on the web and know exactly how, when, and which data to scrap, parse and store this data
- Active participation in troubleshooting and debugging.
- Guide and mentor other data engineers.
- Develop a Data Ingestion framework for automating and maintaining constant flow of data from multiple sources to the database.
- Perform code reviews and suggest design changes.
- Comply with coding standards and technical design.
- Increase process efficiency by identifying repeatable jobs and automating them using appropriate tools and techniques
- Creating efficient web crawlers. Create more/better ways to crawl relevant information
- Familiarity with best practices and design patterns of programming languages
Desired Candidate Profile
- Strong knowledge of any of multiple open-source and proprietary scraping frameworks available
- Strong knowledge of scraping frameworks such as Scrapy, Beautiful Soup,, URLlib and Selenium.
- Good knowledge of Python Pandas library for data manipulation.
- Good to have Experience of complex crawling (like captcha, Mobile OTP based crawling, bypassing proxy)
- Experience in various data extraction methods (like data extraction from PDF Files, web pages, etc.)
- Familiarity with AWS, cloud-based technologies are a plus
- Working knowledge in various SQL/NoSQL DBs , message queues & Web Restful APIs
- Experience in Linux based OS, (ubuntu would be a plus)
- BSCS/ MCS/Computer Engineering / Information Technology
- Min 1-year experience, of which 8 months must be hands-on experience in crawling/scraping using frameworks such as Scrapy, Beautiful Soup, Selenium, APIs.
- Min 6 months in Python Programming is a must.
Send your CV at firstname.lastname@example.org / email@example.com. Mention job title and ID in the subject line.