Ana Quil
01-19-2022Crawler crawling should be the most common requirement, such as the evaluation data of restaurants you want. Of course, we should pay attention to copyright issues here, and many websites also have anti climbing mechanisms.
The most direct way is to write crawler code in Python. Of course, the premise is that you need to know the basic syntax of Python. In addition, PHP can also be a crawler, but its function is not as perfect as python, especially when it involves multi-threaded operations.
In Python crawlers, there are basically three processes.
Use requests to crawl content. We can use the requests library to grab web information. The requests library can be said to be a powerful tool of Python crawler, that is, Python's HTTP library. It is very convenient to crawl the data in web pages through this library, which can help us save a lot of time.
Parse content using XPath. XPath is the abbreviation of XML path, that is, XML path language. It is a language used to determine the location of a part in an XML document. It is often used as a small query language in development. XPath can be indexed by elements and attributes.
Save data using pandas. Pandas is a high-level data structure that makes data analysis easier. We can use pandas to save crawled data. Finally, it is written to XLS or MySQL databases through pandas.
Requests, XPath and pandas are three powerful tools of Python. Of course, there are many sharp tools for Python crawlers, such as selenium, phantom JS, or the headless mode of puppeteer.
In addition, we can grab web information without programming, and we can use some collection tools.
How to use the log collection tool
Sensor acquisition is basically based on specific equipment. The information collected by the equipment can be collected. We won't focus on it here.
Let's look at log collection.
Why do I do log collection? The biggest function of log collection is to improve the performance of the system by analyzing user access, so as to improve the system carrying capacity. It is also convenient for technicians to optimize based on the actual access of users.
Log collection is also one of the important tasks of operation and maintenance personnel. What are the logs and how to collect them?
Log means diary. It records the whole process of users accessing the website: who is at what time, Through what channels (such as search engine and website input), what operations have been performed; whether the system has generated errors; even including the user's IP, HTTP request time, user agent, etc. these log data can be written in one log file or divided into different log files, such as access log, error log, etc.
What is the burial point?
Embedding point is the key step of log collection. What is embedding point?
Embedding point is to collect corresponding information at the required position and report it. For example, the access of a page, including user information and device information; Or the user's operation behavior on the page, including the length of time, etc. This is the buried point. Each buried point is like a camera. It collects user behavior data and makes multi-dimensional cross analysis of the data, which can truly restore the user's use scene and user needs.
So how do we bury it?
Embedding point is to implant statistical code where you need statistical data. Of course, the implanted code can be written by yourself or use third-party statistical tools. I talked about the principle of "no repeated wheel making". Generally speaking, you need to write your own code, which is generally the main core business. For monitoring tools such as buried points, the market is relatively mature. Here, you are recommended to use third-party tools, such as Youmeng, Google analysis, talkingdata, etc. They all use the method of embedding points in the front end, and then you can see the user's behavior data in the third-party tools. However, if we want to see deeper user operation behavior, we need to customize embedding points.
To sum up, log collection helps us understand the user's operation data, which is suitable for scenarios such as operation and maintenance monitoring, security audit, business data analysis, etc. Generally, the web server will have its own log function. Flume can also be used to collect, summarize and transmit large amounts of log data from different server clusters. Of course, we can also use third-party statistical tools or custom embedded points to get the statistical content we want.
summary
Data collection is the key to data analysis. We often think of Python web crawlers. In fact, there are a wide range of data collection methods and channels. Some can directly use open data sources. For example, if you want to obtain the historical price and transaction data of bitcoin, you can download it directly from kaggle without crawling.
On the other hand, according to our needs, the data to be collected is also different. For example, in the transportation industry, the data collection will be related to the camera or tachometer. For operation and maintenance personnel, log collection and analysis are the key. Therefore, we need to select appropriate collection tools for specific business scenarios. In addition, the most important point is to use agents to cooperate with crawling. Proxies are the key steps to increase crawling efficiency. The residential proxies provided by roxlabs can effectively prevent website interception and crawling and ensure the normal operation of business.