• Suck the cat with code! This paper is participating in[Cat Essay Campaign].

Most recently, the Nuggets launched their “Suck cats with Code” campaign, which launched two soul tortures:

  • Do you have a cat?
  • Do you envy people who have cats?

My answer is that I don’t have a cat and I don’t envy people who do. The activity requires the code to suck the cat. After seeing this activity, I found that there are many partners who are interested in cats. Some group friends also always give me a sun to the cat, which makes me curious about cats.

You have to buy a cat to suck a cat, right? Don’t you at least have to know cats to buy a cat? As a person who knows nothing about cats, today I would like to take advantage of this activity to have a good understanding of various pet cats.

Here I found a website dedicated to trading cats – cat Trade: www.maomijiaoyi.com/

The cat breeds section lists all types of pet cats:We can take some of the data and learn about the characteristics of various pet cats.

The results of this paper are as follows:

The data collection

First we climb the list of links to which the home page can be accessed:

 from lxml import etree
 import requests
 ​
 headers = {
     "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
 }
 url_base = "http://www.maomijiaoyi.com"
 session = requests.Session()
 ​
 # Visit the cat breed entry page for links to each breed detail page
 url = url_base+"/index.php? /pinzhongdaquan_5.html"
 res = session.get(url, headers=headers)
 html = etree.HTML(res.text)
 main_data = []
 for a_tag in html.xpath("//div[@class='pinzhong_left']/a"):
     url = url_base+a_tag.xpath("./@href") [0]
     pet_name, pet_price = None.None
     pet_name_tag = a_tag.xpath("./div[@class='pet_name']/text()")
     if pet_name_tag:
         pet_name = pet_name_tag[0].strip()
     pet_price_tag = a_tag.xpath("./div[@class='pet_price']/span/text()")
     if pet_price_tag:
         pet_price = pet_price_tag[0].strip()
     print(pet_name, pet_price, url)
     main_data.append((pet_name, pet_price, url))
Copy the code

The print result is as follows:

To learn more about cats, I have to click on the details page to see the detailed properties:

Parse the data in each of the three sections and test the first link parse:

 pet_name, pet_price, url = main_data[0]
 res = session.get(url, headers=headers)
 html = etree.HTML(res.text)
 row = {}
 # Parse the base properties
 for text in html.xpath("//div[@class='details']//text()"):
     text = text.strip()
     if not text:
         continue
     if text.endswith(":"):
         key = text[:-1]
     else:
         row[key] = text
 row["Reference price"] = pet_price
 Parse the appearance properties
 for shuxing in html.xpath("//div[@class='shuxing']/div"):
     name, v = shuxing.xpath("./div/text()")
     row[name.strip()] = v.strip()
 row["Link"] = url
 # parse the details
 titles = html.xpath(
     "//div[@class='content']/div[@class='property_title']/div/text()")
 property_tags = html.xpath(
     "//div[@class='content']/div[@class='property_list']/div")
 for title, property_tag in zip(titles, property_tags):
     p_texts = []
     for p_tag in property_tag.xpath(".//p|.//div"):
         p_text = "".join([t.strip()
                           for t in p_tag.xpath(".//text()") if t.strip()])
         if p_text:
             p_texts.append(p_text)
     text = "\n".join(p_texts)
     row[title] = text
 row
Copy the code

You can see that the first two parts of the data are parsed smoothly:

The data of the third part was also successfully analyzed:

In addition to the text description information, we also need to save the picture. Below parse the url and download:

 img_urls = [
     url_base+url for url in html.xpath("//div[@class='big_img']/img/@src") if url]
 row["Picture address"] = img_urls
 ​
 for i, img_url in enumerate(img_urls, 1) :with requests.get(img_url) as res:
         imgbytes = res.content
     with open(f"imgs/{pet_name}{i}.jpg"."wb") as f:
         f.write(imgbytes)
Copy the code

You can see that several images have been downloaded successfully:

So we can clean up the site code, save the text data to Excel, and save the images to a file:

 import pandas as pd
 from lxml import etree
 import requests
 ​
 headers = {
     "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
 }
 url_base = "http://www.maomijiaoyi.com"
 session = requests.Session()
 ​
 # Visit the cat breed entry page for links to each breed detail page
 url = url_base+"/index.php? /pinzhongdaquan_5.html"
 res = session.get(url, headers=headers)
 html = etree.HTML(res.text)
 main_data = []
 for a_tag in html.xpath("//div[@class='pinzhong_left']/a"):
     url = url_base+a_tag.xpath("./@href") [0]
     pet_name, pet_price = None.None
     pet_name_tag = a_tag.xpath("./div[@class='pet_name']/text()")
     if pet_name_tag:
         pet_name = pet_name_tag[0].strip()
     pet_price_tag = a_tag.xpath("./div[@class='pet_price']/span/text()")
     if pet_price_tag:
         pet_price = pet_price_tag[0].strip()
     main_data.append((pet_name, pet_price, url))
 data = []
 for pet_name, pet_price, url in main_data:
     res = session.get(url, headers=headers)
     html = etree.HTML(res.text)
     row = {}
     # Parse the base properties
     for text in html.xpath("//div[@class='details']//text()"):
         text = text.strip()
         if not text:
             continue
         if text.endswith(":"):
             key = text[:-1]
         else:
             row[key] = text
     row["Reference price"] = pet_price
     Parse the appearance properties
     for shuxing in html.xpath("//div[@class='shuxing']/div"):
         name, v = shuxing.xpath("./div/text()")
         row[name.strip()] = v.strip()
     row["Link"] = url
     # parse the details
     titles = html.xpath(
         "//div[@class='content']/div[@class='property_title']/div/text()")
     property_tags = html.xpath(
         "//div[@class='content']/div[@class='property_list']/div")
     for title, property_tag in zip(titles, property_tags):
         p_texts = []
         for p_tag in property_tag.xpath(".//p|.//div"):
             p_text = "".join([t.strip()
                               for t in p_tag.xpath(".//text()") if t.strip()])
             if p_text:
                 p_texts.append(p_text)
         text = "\n".join(p_texts)
         row[title] = text
     img_urls = [
         url_base+url for url in html.xpath("//div[@class='big_img']/img/@src") if url]
     row["Picture address"] = img_urls
     data.append(row)
     for i, img_url in enumerate(img_urls, 1) :with requests.get(img_url) as res:
             imgbytes = res.content
         with open(f"imgs/{pet_name}{i}.jpg"."wb") as f:
             f.write(imgbytes)
 df = pd.DataFrame(data)
 df.to_excel("Cat. XLSX", index=False)
Copy the code

The first few columns of the crawl result are as follows:

Pictures of various cats to download:

With the above Excel data, we can analyze and process it:

The data analysis

First read Excel data:

 import pandas as pd
 ​
 df = pd.read_excel("Cat. XLSX")
Copy the code

Observing the data, we found that many pet cats have multiple aliases. We can make a diagram to show the aliases corresponding to each cat:

 from pyecharts import options as opts
 from pyecharts.charts import Graph
 ​
 links = []
 nodes = []
 nodes.append({"name": "Cat"."symbolSize": 10})
 ​
 for name, alias in df[["Chinese scientific name"."Alias"]].values:
     nodes.append({"name": name, "symbolSize": 10})
     links.append({"source": "Cat"."target": name})
     for dest in alias.split(",") :if name == dest:
             continue
         nodes.append({"name": dest, "symbolSize": 10})
         links.append({"source": name, "target": dest})
 c = (
     Graph(init_opts=opts.InitOpts(width="800px", height="800px"))
     .add("", nodes, links, repulsion=250,
          linestyle_opts=opts.LineStyleOpts(width=0.5, curve=0.3, opacity=0.7))
     .set_global_opts(title_opts=opts.TitleOpts(title="Breed of pet cat."))
 )
 c.render_notebook()
Copy the code

The main name can be viewed with the mouse pointer to the center point:

Distribution of origin of pet cats:

 from pyecharts.charts importValue_counts () c = (Bar().add_xaxis(data.index.to_list()).add_yaxis()"", data.values.tolist())
     .set_global_opts(
         xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=15)),
         title_opts=opts.TitleOpts(title="Distribution of origin of pet Cats")
     )
 )
 c.render_notebook()
Copy the code

You can see that all kinds of pet cats are mainly distributed in Britain, the United States and Scotland.

So if you draw a tree chart showing the distribution of cat species in each country:

 data = []
 tmp = df.groupby("Origin", as_index=False).agg(
     品种=("Chinese scientific name".",".join), variety number =("Chinese scientific name"."count"))
 for src, dest in tmp.values[:, :2]:
     dests = dest.split(",")
     children = []
     data.append({"value": len(dests), "name": src, "children": children})
     for dest in dests:
         children.append({"name": dest, "value": 1})
 c = (
     TreeMap(init_opts=opts.InitOpts(width='1280px', height='560px'))
     .add("", data,
          levels=[
              opts.TreeMapLevelsOpts(
                  treemap_itemstyle_opts=opts.TreeMapItemStyleOpts(
                      border_color="# 555", border_width=1, gap_width=1
                  )
              ),
              opts.TreeMapLevelsOpts(
                  color_saturation=[0.3.0.6],
                  treemap_itemstyle_opts=opts.TreeMapItemStyleOpts(
                      border_color_saturation=0.7, gap_width=5, border_width=10
                  ),
                  upper_label_opts=opts.LabelOpts(
                      is_show=True, position='insideTopLeft', vertical_align='top'
                  )
              ),
              opts.TreeMapLevelsOpts(
                  color_saturation=[0.3.0.5],
                  treemap_itemstyle_opts=opts.TreeMapItemStyleOpts(
                      border_color_saturation=0.6, gap_width=1
                  ),
              ),
              opts.TreeMapLevelsOpts(color_saturation=[0.3.0.5]),
          ])
     .set_global_opts(title_opts=opts.TitleOpts(title="Distribution of origin of pet Cats"))
 )
 c.render_notebook()
Copy the code

Then take a look at the proportion of body size:

 from pyecharts.charts import Pie
 ​
 c = (
     Pie()
     .add(
         "Body"Value_counts ().reset_index().values.tolist(), radius=["40%"."55%"],
         label_opts=opts.LabelOpts(
             position="outside",
             formatter="{a|{a}}{abg|}\n{hr|}\n {b|{b}: }{c} {per|{d}%} ",
             background_color="#eee",
             border_color="#aaa",
             border_width=1,
             border_radius=4,
             rich={
                 "a": {"color": "# 999"."lineHeight": 22."align": "center"},
                 "abg": {
                     "backgroundColor": "#e3e3e3"."width": "100%"."align": "right"."height": 22."borderRadius": [4.4.0.0],},"hr": {
                     "borderColor": "#aaa"."width": "100%"."borderWidth": 0.5."height": 0,},"b": {"fontSize": 16."lineHeight": 33},
                 "per": {
                     "color": "#eee"."backgroundColor": "# 334455"."padding": [2.4]."borderRadius": 2,
                 },
             },
         ),
     )
     .set_global_opts(
         title_opts=opts.TitleOpts(title="Proportion of breed size"),
     )
 )
 c.render_notebook()
Copy the code

You can see that there is only one cat that is the largest, the Muppet. Here we find the cheapest and most expensive cats, with the lowest and most expensive currently considered to be the cheapest breeds and the most expensive breeds:

TMP = df. Reference price.str.split("-", expand=True)
 tmp.columns = ["Rock-bottom price"."Maximum price"]
 tmp.dropna(inplace=True)
 tmp = tmp.astype("int") cheap_cat = df.loc[tmp.index[tmp. lowest price == tmp. lowest price.min()]."Chinese scientific name"].to_list() costly_cat = df.loc[tmp.index[tmp.highest price == tmp.highest price.max()]."Chinese scientific name"].to_list()
 print("The cheapest varieties are:", cheap_cat)
 print("The most expensive items are:", costly_cat)
Copy the code
The cheapest varieties are: [' Garfield ', 'golden gradually layer', 'silver layer gradually', 'orange cat'] most expensive varieties are: [' ragdoll ', 'Maine cat' and 'glabrous cats']Copy the code

For data set [‘ whole ‘, ‘hair’, ‘color’, ‘head’ and ‘eyes’ and’ ears’ and ‘nose’, ‘tail’, ‘breast’, ‘neck’, ‘precursor’, ‘after flooding] these columns are descriptions of cat words, we can be combined in a whole, to meow star people do a word cloud:

 import stylecloud
 from IPython.display import Image
 text = ""
 for row in df[['the whole'.'hair'.The 'color'.'head'.'eyes'.'ears'.'nose'.'tail'.'breast'.'neck'.'precursor'.'after flooding']].values:
     for v in row:
         if pd.isna(v):
             continue
         text += v
 stylecloud.gen_stylecloud(text,
                           collocations=False,
                           font_path=r'C:\Windows\Fonts\msyhbd.ttc',
                           icon_name='fas fa-cat',
                           output_name='tmp.png')
 Image(filename='tmp.png')
Copy the code

Then we made word cloud maps of personality and life habits.

Personality characteristics Word cloud map:

 import jieba
 import stylecloud
 from IPython.display import Image
 ​
 stopwords = ["Master"."They"."Kitty"."Don't"."Personality traits"."Cats"] words = df."str").apply(jieba.lcut).explode()
 words = words[words.apply(len) > 1]
 words = [word for word in words if word not in stopwords]
 stylecloud.gen_stylecloud("".join(words),
                           collocations=False,
                           font_path=r'C:\Windows\Fonts\msyhbd.ttc',
                           icon_name='fas fa-square',
                           output_name='tmp.png')
 Image(filename='tmp.png')
Copy the code

 import jieba
 import stylecloud
 from IPython.display import Image
 ​
 stopwords = ["Master"."They"."Kitty"."Don't"."Personality traits"."Cats"] words = df. Living habits"str").apply(jieba.lcut).explode()
 words = words[words.apply(len) > 1]
 words = [word for word in words if word not in stopwords]
 stylecloud.gen_stylecloud("".join(words),
                           collocations=False,
                           font_path=r'C:\Windows\Fonts\msyhbd.ttc',
                           icon_name='fas fa-square',
                           output_name='tmp.png')
 Image(filename='tmp.png')
Copy the code

The cat figure generated

After the above analysis, we have a basic understanding of cats, let’s make a picture of cats of various breeds.

What kind of picture should I make? I thought about it and made a mind map.

First generate classification text:

 for a, bs inGroupby (df. Body type):print(a)
     for b in bs.values:
         print(f"\t{b}")
Copy the code
Midsize Garfield golden fold English short blue English short blue and White British short hair American Short hair Scottish Fold Silver fold exotic Bombay Siamese Leopard cat Bengal large muppet small Maine ginger hairless Highland fold Manchikon Bantam Persian orange Abyssinian German curlyCopy the code

At this point I paste it into a mind map and edit it often over a period of time to get: