Before I wrote a lot of network data data capture cases, whether it is about R language or Python, it uses a lot of XML \ HTML \ CSS \ Ajax \ JSON and other concepts, but I have not done a detailed comb of these concepts, resulting in many small partners confused.

The recent basic network capture tutorial ended, from today on, to comb out some common Web concepts (of course is a layman small white perspective to explain, if there is inappropriate, please forgive me). The concept of combing for the whole network to grasp the development of the idea is very important.

Three core concepts will be introduced in the following days:

  • xml
  • html
  • json

The official definition of XML is Extensible Markup Language (XML), which is mainly used for data transfer, while HTML is hypertext markup Language (HTML), which is mainly used for web page display.

Syntactically, XML and HTML can be grouped together; they follow the same syntax, but differ in their roles and tag names on the Web.

<? The XML version = "1.0" encoding = "ISO - 8859-1"? > <note> <to>George</to> <from>John</from> <heading>Reminder</heading> <body>Don't forget the meeting! </body> </note>Copy the code

A typical XML document is shown above. The first line is the XML document header declaration, which contains the version and character encoding information of the XML. The next few lines are the subject matter of the XML document. The content information contained in this XML file is encapsulated in label pairs. Each value is included between a start label (<label>) and a close label (<label/>). Nesting between label hierarchies is allowed.

So the content logic of the above document is:

note
--to     =George
--from   =John
--heading=Reminder
--body   =Don't forget the meeting!
Copy the code

You can also think of it as key-value pairs, except that the key names are both symmetric. All <label> are called labels, or elements, and the corresponding <label>text<label/> contains the content or value of the label. In XML documents, there are no predefined fixed labels, and label naming is very liberal. Users should define their own according to their own business or project needs, as long as no reserved word is involved and the document format is correct.

The main difference between HTML and XML is that it has a conventionally fixed document structure, with a predefined set of fixed tags.

<! DOCTYPE HTML > < HTML > <head> <title> My first HTML page </title> </head> <body> <p> The contents of the body element will be displayed in the browser. The contents of the </p> <p>title element are displayed in the browser's title bar. </p> </body>Copy the code

A typical HTML document is shown above. The first sentence, like XML, is still the header declaration of the HTML document, which tells the HTML version information. The fixed format of HTML is reflected in that each HTML content consists of a head and a body. The head is used to explain the title, encoding method and referenced external document information of the HTML, and the body is used to store the content information to be displayed in the browser. Not only that, because THE HTML document is ultimately rendered by the browser and presented to people’s friendly reading experience, the HTML document has a large number of predefined fixed tags, such as various forms, lists, blocks, interactive menus and other content. Detailed HTML internal tag keywords can be found in the W3C reference manual.

www.w3school.com.cn/html/index….

Pure HTML is just static text, and browser rendering is done based on attributes defined within each level of the HTML document (<label style=’fashion’>) and external CSS style sheets. A CSS style sheet is a beauty book for building complex web pages. It defines the appearance, thickness, color, background color, spacing, and so on of all the fonts, lines, blocks, forms, controls, menus, and backgrounds within an HTML document. CSS is usually embedded in the SCRIPT tag of an HTML header tag (<head>) as an external stand-alone file.

<link media="all" type="text/css" rel="stylesheet" href="https://edu.hellobi.com/libs/formvalidation/css/formValidation.min.css">
Copy the code

The browser calls the HTML document and renders the entire page according to the loaded CSS stylesheets at the same time, resulting in a beautiful looking web page.

It can be seen that although HTML is in the same line with XML grammar, HTML plays a special role. Its structure system has fixed templates, a large number of commonly used predefined tags, and it also needs to embed CSS style sheets and reference JS dynamic scripts. The whole structure looks very large. XML, on the other hand, is fairly compact and suitable for pure data storage and transmission.

That’s the general difference between XML and HTML (you still need to go to the W3C tutorial for some in-depth differences or concepts).

json

JSON(JavaScript Object Notation) is a lightweight data interchange format. It originated as a JavaScript data object and has since become a popular data exchange standard on the Web.

Json syntax, with its very obvious key-value pair structure, is easy to understand:

The above XML document would look like this if it were written using JSON.

{ "note":{ "to":"George", "from":"John", "heading":"Reminder", "body":"Don't forget the meeting!" }}Copy the code

The syntax of json is very clear key-value pairs. Keys cannot be repeated and must be enclosed in double quotation marks. Nested key-value pairs can be characters (enclosed in double or single quotation marks), numeric values, Boolean types (true\false), arrays ([1,2,3,5]), or null. Elements of the same level are separated by commas (,). The content contained in curly braces is called an object, and the median value of a key-value pair can also be an object.

Json is somewhat like XML in that it has only one set of syntax standards, no fixed document templates or predefined tags (or keys), so both XML and JSON can be used to write custom data objects.

At this point, you can see the difference between XML and JSON. Both can be used as data storage objects and have a key-value pair like syntax. XML, however, has a symmetric tag structure, while JSON only uses punctuation marks such as “{“,” [“, “<“, “>” as a hierarchy and tag starting point structure, so JSON saves a lot of redundant character information, which is one of the points of contention on the web about the relative merits of XML and JSON.

Let’s take a look at XML and JSON in action in a desktop environment from an application perspective.

In current desktop and Web applications, XML is mainly used to write configuration files, and JSON is used to submit HTTP request parameters or return data in Web scenarios. To emphasize the point, JSON can also be used for desktop software configuration files, and XML can be used for network file transfer and data exchange.

On the familiar Office platform, all template files and color matching files are written in XML.

<? The XML version = "1.0" encoding = "utf-8"? > <a:clrScheme xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" name="Aspect"> <a:dk1> <a:sysClr val="windowText" lastClr="000000"/> </a:dk1> <a:lt1> <a:sysClr val="window" lastClr="FFFFFF"/> </a:lt1> <a:dk2> <a:srgbClr val="323232"/> </a:dk2> <a:lt2> <a:srgbClr val="E3DED1"/> </a:lt2> <a:accent1> <a:srgbClr val="F07F09"/> </a:accent1> <a:accent2> <a:srgbClr val="9F2936"/> </a:accent2> <a:accent3> <a:srgbClr val="1B587C"/> </a:accent3> <a:accent4> <a:srgbClr val="4E8542"/> </a:accent4> <a:accent5> <a:srgbClr val="604878"/> </a:accent5> <a:accent6> <a:srgbClr val="C19859"/> </a:accent6> <a:hlink> <a:srgbClr val="6B9F25"/> </a:hlink> <a:folHlink> <a:srgbClr val="B26B02"/> </a:folHlink> </a:clrScheme>Copy the code

This is a typical Office color matching document. (Note that many configuration files of Office platform are shared by multiple programs, and color matching files are shared by Word, Excel and PPT.)

But in the new version of Microsoft’s BI tool, the color matching files for PowerBI have been written using json syntax.

{
    "name": "Economists", 
    "dataColors": [
        "#D5695D", 
        "#C6332C", 
        "#5D8CA8", 
        "#016392", 
        "#65A479", 
        "#098154", 
        "#e9f3ea", 
        "#f8f2e4"
    ], 
    "background": "#e9f3ea", 
    "foreground": "#f8f2e4", 
    "tableAccent": "#568410"
}
Copy the code

This is a color template file for PowerBI, written in JSON, which only applies to the suffix (.json) identification and has no document header (unlike XML). As you can see, the file defines five key-value pairs, the name of the color table, and an array of eight color values, background, foreground, and table background.

Tabeau, a BI tool excellent for business visualization scenarios, also uses XML as the writing language for color matching templates:

<? The XML version = "1.0" encoding = "utf-8"? > <workbook> <preferences> <color-palette name="ECO-01" type="regular"> <color>#00516C</color> <color>#5D91A7</color> <color>#00A4DC</color> <color>#6BCFF6</color> <color>#008982</color> <color>#6DBBBF</color> <color>#7A250F</color> <color>#EA8F74</color> <color>#A8A9AD</color> </color-palette> <color-palette name="ECO-02" type="regular"> <color>#adadad</color> <color>#7bd3f6</color> <color>#7c260b</color> <color>#ee8f71</color> <color>#76c0c1</color> <color>#a18376</color> <color>#c3d6df</color> <color>#c9c9c9</color> <color>#c9c9c9</color> </color-palette> </preferences> </workbook>Copy the code

This color table defines two sets of color palettes. The main body is XML syntax, but after formatting, it is very easy to understand.

I randomly picked three software configuration files, two of which were written in XML and one in JSON. In view of the current development trend, XML standards are defined early, which is a first-mover advantage, while JSON is lightweight and less redundant information, and application scenarios are gradually expanding.

The above three scenarios are all on the desktop, so let’s take a look at the Web scenario:

The course content information of NetEase Cloud Classroom is loaded asynchronously, and its request submission parameters and corresponding data format are in JSON format.


Zhihu Live course information, parameter submission and corresponding is also the preferred JSON.

B station video information list, corresponding data format joSN format.

The data returned is in HTML format (can be categorized as XML, because the syntax and parsing tools are consistent).

As you can see, most of the mainstream sites that involve asynchronous loading choose JSON as the data exchange format, while static sites or sites that do not want to open apis still use HTML/XML more. However, as web-side Ajax technology spreads further in the future, IT is believed that the JSON standard will be more widely used.

Above, I’ve shown examples of XML/HTML and JSON on the desktop and on the Web (not randomly selected, not representative of anything).

Having said that, what is the relationship between XML and JSON and the web data fetching that we want to learn more about?

XML and JSON, to some extent, almost determine the technical solution and process you use when writing a data grabber.

We know that successfully constructing a request is the first step in the process of fetching data, and I’ve talked a lot about request construction, whether it’s a GET request or a POST request, whether it’s passing parameters, whether it’s passing a form.

XML/HTML and JSON involve the second step of web fetching – web pages and data parsing.

Because XML/HTML is a markup language, it has key-value form to some extent, but because of the tag pair, neither R nor Python can directly convert it into relational tables. So the requested XML/HTML needs to be extracted using Xpath or CSS expressions, both of which are covered in previous chapters.

Python Series 16 — XPath and Web page parsing libraries

Python Series 17 — CSS expressions and Web page parsing

R language data capture actual combat — RCurl+XML combination and XPath parsing

Left hand with R right hand Python series — simulation login educational administration system

Python network data fetching: Xpath parsing

Python – CSS web page parsing

Left hand with R right hand Python series — simulation login educational administration system

XML and HTML syntax are the same, so the parsing tools used are the same.

Json itself is semi-structured data, and as a popular and common format for data exchange, both R and Python have ready-made interface tools to call, as well as benchmarking containers for semi-structured tools. Another application scenario of JSON is the storage structure of noSQL database, such as mongoDB. However, in mongoDB, the JSON standard is extended to BSON to improve its performance and compatibility as a container. Both mongoDB and R have structural tools to connect to Python.

The JsonLite package in R has a ready-made fromJSON() function that converts json return values directly to list or data.frame(depending on whether the json internal structure complies with relational standards). Json package in Python that provides json.loads() to load and convert JSON data to dict.

NetEase cloud classroom Excel course crawler ideas

Left hand with R right hand Pyhon series — fun live classes grab actual combat

Python data capture and visualization practice — NetEase Cloud classroom artificial intelligence and big data course practice

R language network data capture another difficult problem, finally broken!

R language crawler actual combat — NetEase cloud classroom data analysis course plate data crawling

R language crawler actual combat — Zhihu Live course data crawling actual combat

This article provides a brief overview of XML and JSON concepts, syntax, application scenarios, and common interface conversion functions with R and Python. A good command of XML and its parsing tools determines the efficiency of HTML web page parsing, and a good command of JSON determines the efficiency of calling server APIS and processing returned values. Therefore, XML and JSON-related content is of great importance in web data retrieval.

Fortunately, the two technologies involved technical tool, have had a brief explanation of the hall before, of course, these content in each piece singled out are large enough, want to further understand the need to refer to professional books or the W3C’s online documentation, but because there are ready-made function encapsulation, we can easily complete web almost don’t have to understand the underlying parsing and data sorting.

Common XML/JSON/HTML formatting tools:

Tool.oschina.net/codeformat/…

www.json.cn/#

www.bejson.com/

Tool.chinaz.com/Tools/Unico…

tool.oschina.net/encode

For online courses, please click on the link below:

Hellobi Live | September 12 R language data visualization application in the business scenario past case please click I lot: github.com/ljtyduyu/Da…