Reptral performance analysis and optimization

We wrote a single-task version of the crawler to crawl the user information of Zhenai.com two days ago. What about its performance?

We can take a look at the network utilization. We can see that the download rate is maintained at about 200kbps using the performance analysis window in the task manager, which can be said to be quite slow.

image

We aim to pass the analysis sheet Let’s take a look at the design of the mission version crawler:

image

From the above figure, we can see that the engine takes the request out of the task queue, sends it to Fetcher to fetch resources, waits for the data to return, and then sends the returned data to Parser for analysis, waits for it to return, and sends the returned request Add it to the task queue and print out the item at the same time.

The slowness is that the network resources are not fully utilized. In fact, we can send multiple Fetchers and Paresers at the same time, and while waiting for them to return, we can do other processing. This is easy to achieve with the concurrent syntactic sugar of go.

image

In the picture above, Worker is The merger of Fetcher and Parser, the Scheduler distributes many Requests to different Workers, the Worker returns the Requests and Items to the Engine, the Items are printed out, and then the Request is placed in the scheduler.

Based on this, use code to implement:

Engine:

package engine

import (
"log"
)

type ConcurrentEngine struct {
Scheduler Scheduler
WokerCount int
}

type Scheduler interface {
Submit(Request)
ConfigureMasterWorkerChan(chan Request)
}

func (e *ConcurrentEngine) Run(seeds ...Request) {

in := make(chan Request)
out := make(chan ParserResult)

e.Scheduler.ConfigureMasterWorkerChan(in)

// Create Worker
for i := 0; i createWorker(in, out)
}


//Task Distribute to Worker
for _, r := range seeds {
e.Scheduler.Submit(r)
}


for {

//Print out items
result := <- out
for _, item := range result.Items {
log.Printf("Get Items: %v ", item)
}

//Send the Request in out to the Scheduler
for _, r := range result.Requests {
e. Schedule r.Submit(r)
}

}
}

//workerConut goroutine to exec worker for Loop
func createWorker(in chan Request, out chan ParserResult) {
go func() {
for {
request := <-in

parserResult, err := worker(request)

//An error has occurred and continue to the next
if err != nil {
continue
}

//Send parserResult
br /> out <- parserResult
}
}()
}

Scheduler:

package scheduler

import "crawler/engine"

//SimpleScheduler one workChan to multi worker
type SimpleScheduler struct {
workChan chan engine.Request
}

func (s *SimpleScheduler) ConfigureMasterWorkerChan(r chan engine.Request) {
s.workChan = r
}

func (s *SimpleScheduler) Submit(r engine. Request) {
go func() {s.workChan <- r }()
}

Worker:

func worker(r Request) (ParserResult , error) {

log.Printf("fetc hing url:%s ", r.Url)
//Crawl data
body, err := fetcher.Fetch(r.Url)

if err! = nil {
log.Printf("fetch url: %s; err: %v ", r.Url, err)
//An error occurs and continue to crawl the next url
return ParserResult{}, err
}

//Analyze the crawled results
return r.ParserFunc(body), nil
}

< p>main function:

package main

import (
"crawler/engine"
"crawler/zhenai/parser"
"crawler /scheduler"
)

func main() {

e := &engine.ConcurrentEngine{
Scheduler: &scheduler.SimpleScheduler{},
WokerCount :100,
}

e.Run(
engine.Request{
Url: "http://www.zhenai.com/zhenghun" ,
ParserFunc: parser.ParseCityList,
})

}

Open 100 Workers here, and check the network utilization again after running, it becomes 3M above.

image

Due to the length of the code, Students who need it can follow the official account reply: go crawler? to get it.



This official account is free provides csdn download service, massive IT learning resources, If you are ready to enter the IT pit, be inspired to become Excellent programmers, then these resources are very suitable for you, including but not limited to java, go, python, springcloud, elk, embedded, big data, interview materials, front-end and other resources. At the same time, we have formed a technical exchange group. There are a lot of bigwigs in it, and they will share technical articles from time to time. If you want to learn and improve together, you can reply [2] in the background of the official account, and invite and add technical exchanges for free The group learns and improves from each other, and will share programming IT-related resources from time to time.


Scan the QR code to pay attention, and the exciting content will be pushed to you at the first time

image

Leave a Comment

Your email address will not be published.