download:Python’s three-year-old distributed crawler, Scrapy, builds a search engine

What era is the future? It’s the data age! Data analysis services, Internet finance, data modeling, natural language processing, medical case analysis… More and more work is done based on data, and crawlers are the most important way to get data quickly, compared to other languages, Python crawler is more simple and efficient for the crowd Suitable for interested in reptiles, want to do is to develop large data but couldn’t find the data Don’t know how to set up a set of stable and reliable distributed crawler classmates Want to set up the search engine technology reserve requirements but don’t know how to proceed with classmates have a certain native crawler based Understanding the front page, object-oriented concepts, Computer network protocol and database knowledge codes are as follows:

Package main import (” FMT “” math/rand” “time”) var (Web = fakeSearch(” Web “) Image = fakeSearch(” Image “) Video = FakeSearch (” video “)) type Result string type Search func(query string) Result func fakeSearch(kind string) Search { return func(query string) Result { time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond) return Result(fmt.Sprintf(“%s result for %q\n”, kind, query)) } } func Google(query string) (results []Result) { results = append(results, Web(query)) results = append(results, Image(query)) results = append(results, Video(query)) return} func main() {rand.Seed(time.now ().unixnano ()) start := time.now () results := Google(” golang “) Elapsed := time.since (start) ftt.println (results) ftt.println (Elapsed)}

[Web result for “golang” image result for “golang” video result for “golang”] 153.365484ms

Google Search 2.0

Run web, image, and video searches at the same time and wait for results. No locks, no condition variables, no callbacks.

The code is as follows, focusing on the Google function.

Package main import (” FMT “” math/rand” “time”) var (Web = fakeSearch(” Web “) Image = fakeSearch(” Image “) Video = FakeSearch (” video “)) type Result string type Search func(query string) Result func fakeSearch(kind string) Search { return func(query string) Result { time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond) return Result(fmt.Sprintf(“%s result for %q\n”, kind, query)) } } func Google(query string) (results []Result) { c := make(chan Result) go func() { c <- Web(query) } () go func() { c <- Image(query) } () go func() { c <- Video(query) } () for i := 0; i < 3; i++ { result := <-c results = append(results, Result)} return} func main() {rand.Seed(time.now ().unixnano ()) start := time.now () results := Google(” golang “) elapsed := time.Since(start) fmt.Println(results) fmt.Println(elapsed) }

Google 2.1 Don’t wait for slow servers. No locks, no unconditional variables, no callbacks. After the timeout completion of the select, the timeout channel defined by time.after needs to be placed outside the for loop.

Package main import (” FMT “” math/rand” “time”) var (Web = fakeSearch(” Web “) Image = fakeSearch(” Image “) Video = FakeSearch (” video “)) type Result string type Search func(query string) Result func fakeSearch(kind string) Search { return func(query string) Result { time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond) return Result(fmt.Sprintf(“%s result for %q\n”, kind, query)) } } func Google(query string) (results []Result) { c := make(chan Result) go func() { c <- Web(query) } () go func() { c <- Image(query) } () go func() { c <- Video(query) } () timeout := time.After(80 * time.Millisecond) for i := 0; i < 3; i++ { select { case result := <-c: results = append(results, result) case <-timeout: FMT.Println(” timed out “) return}} return} func main() {rand.Seed(time.now ().unixnano ()) start := time.now () results Since(start) FMT.Println(results) FMT.Println(Elapsed) := Google(” golang “) Elapsed := time.Since(start) FMT.Println(results) FMT.