Reptile Engineer JD - engineer, induction, JD, reptile

Responsible for: Multi-platform information capture, cleaning and analysis work

Requirements:

Familiar with common open source crawler frameworks, such as scrapy / pyspider?
Understand the principle of cookie-based login, and be familiar with common information extraction techniques , Such as regular expressions, XPath
Familiar with common anti-crawler technologies, and have certain resistance capabilities
Experience in distributed crawler architecture*

Responsible:

Design and develop a distributed web crawler system, capture and analyze multi-platform information, monitor the crawler progress and alert feedback in real time
Web page information and APP data extraction, cleaning, and deduplication, etc. Work

Requirements:

Have a solid algorithm and data structure ability
Familiar with crawler principles, Familiar with common anti-crawler technologies
Master the http protocol, familiar with common data extraction technologies such as html, dom, xpath, etc.
Persons with experience in large-scale data processing, data mining, information extraction, etc. Priority

Responsible: p>

Responsible for the design and development of a distributed web crawler system, for the capture and analysis of multi-platform information
Responsible for the extraction of page content for web search, and the filter weight under the search field ( simhash/minhash), clustering, anti-spam, page analysis, tags, classifiers (Bayes/Bayes/LR/SVM), data mining, etc., to improve the crawling efficiency of the platform
Participate in crawlers Core algorithm and strategy optimization, familiar with the scheduling strategy of the collection system
Real-time monitoring of crawler progress and alert feedback

Requirements:

Familiar with Linux system, mastering languages such as Python
Mastering the principle of web crawling And technology, understand the principle of cookie-based login, and be familiar with web page information extraction technologies based on regular expressions, XPath, CSS, etc.
Familiar with the entire crawler design and implementation process, have experience in large-scale web information extraction and development, and be familiar with Various anti-crawler technologies, with experience in distributed crawler architecture
The ability to link analysis (pagerank, trustrank), feature extraction (page quality, authority, topic, linear/non-linear regression, LDA), etc. is preferred

Responsible:

< ul>

Responsible for designing and developing a general crawler system, extracting and analyzing various platform page contents;

Study various websites and link forms, and discover their characteristics and laws;< /li>

Solve technical problems, including anti-anti-climbing, pressure control, etc., to improve the efficiency and quality of web crawling;

Requirements:< /p>

Proficient in python, computer networks, proficient in multi-threading, familiar with common crawler frameworks such as Scrapy;
Familiar with Linux operations, regular expressions, MySQL, MongoDB and other common databases, understand Various Web front-end technologies;
Can solve the problems of account closure, IP closure, verification code recognition, image recognition, etc.;

Responsible:

Develop a distributed web crawler system to capture information on multiple platforms and Analysis work?
Responsible for web page information and App data extraction and deduplication work?
Cooperate with algorithm posts to complete ETL related tasks

Requirements:

Master the principles and technologies of web crawling, understand the principles of cookie-based login, and be familiar with web information extraction technologies based on regular expressions and XPath?

< li>Familiar with commonly used open source crawler frameworks, such as scrapy / pyspider?

Solid coding ability and algorithm foundation, familiar with Python / Shell development under Linux