A super serious paragraph about serialized sets, subsets, and supersets of samples

I’m a developer, I read code, I write code, I write code that will write code, I write code that will write code that will be read by other code. These are very Martian words, but they have their beauty. Finally, however, writing code that will write code for other code to read can quickly become more convoluted than this text. There are many ways to do this. A less complicated approach, and one favored by the developer community, is data serialization. For those of you who don’t know the buzzword I just threw at you, data serialization is the process of taking some information from one system, converting it into a format that other systems can read, and then passing it on to other systems.

While there are enough data serialization formats to bury the Burj Khalifa, they mostly fall into two categories:

  • Easy for humans to read and write,
  • Easy to read and write by machine.

It’s hard to have it both ways, because humans like loose typing and flexible formatting standards that make us more expressive, while machines tend to be told exactly everything without ambiguity or lack of detail, and think “strict specification” is their favorite taste.

Because I’m a Web developer, and because we’re an organization that creates websites, we’ll stick with special formats that web systems can understand or don’t require much effort to understand, and that are especially useful for human readability: XML, JSON, TOML, CSON, and YAML. Each has its own pros and cons and appropriate use case scenarios.

The facts first

Back in the early days of the Internet, some very smart guys decided to put together a Standard Language that every system would understand and creatively called it the Standard Generalized Markup Language (SGML). SGML is very flexible and well defined by publishers. It became the father of languages like XML, SVG, and HTML. All three conform to the SGML specification, but they are subsets of stricter rules and less flexibility.

Eventually, people began to see the benefits of very small, compact, readable, and easily generated data that could be shared programmatically between systems with little overhead. Around that time, JSON was born and was able to meet all the requirements. On the other hand, other languages are emerging to handle more specialized use cases, such as CSON, TOML, and YAML.

XML: No

Originally, THE XML language is very flexible and easy to write, but it suffers from being verbose, hard for humans to read, hard for computers to read, and a lot of syntax that is not entirely necessary to convey information.

Today, its use for data serialization on the Web has disappeared. Unless you’re writing HTML or SVG, you won’t see XML in many other places. Some outdated systems still use it today, but it’s often too heavy to pass data around.

I can already hear XML grandfathers scribbling on their stele why XML is awesome, so I’ll offer a small addition: XML can be easily read and written by systems and people. However, really, I mean ridiculously, it’s hard to create a system that can read it in a standardized way. Here is a simple and beautiful XML example:

<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide 
      
       Computer
       
      
       44.95
       
      
       2000-10-01
        
       
        An in-depth look at creating applications with XML.
        
      Copy the code

That’s great. Easy to read, understand, write, and code a system that can read and write it. But consider this example:

<! DOCTYPE r [ <!ENTITY y"a]>b"> ]>
<r>
<a b="&y; >"/ > <! [CDATA[[a>b <a>b <a]]> <?x <a> <!-- <b> ?> c --> d </r>Copy the code

This is 100% valid XML up here. It’s almost impossible to read, understand or reason. Writing code that can use and understand this will take at least 36 hairs and 248 pounds of coffee grounds. We don’t have that much time or coffee, and most of us older programmers are bald now. So let it live on in our memory, like CSS Hacks, IE 6 and vacuum tubes.

JSON: Side-by-side party

Well, we can all agree that XML = bad. So what are the good alternatives? JavaScript Object Notation (JSON for short) JSON (which sounds like Jason) was invented by Brendan Eich and popularized by the great and powerful JavaScript opinion leader Douglas Crockford. It is now used almost everywhere. The format is easy to write by humans and machines, fairly easy to parse according to the strict rules of the specification, and flexible — allowing for deep nesting of data, supporting all primitive data types, and interpreting collections as arrays or objects. JSON becomes the de facto standard for transferring data from one system to another. Almost all languages have built-in functionality to read and write it.

JSON syntax is simple. Square brackets denote arrays, curly brackets denote records, and colon-delimited values denote properties or “keys” (on the left) and values (on the right). All keys must be enclosed in double quotes:

  {
    "books": [{"id": "bk102"."author": "Crockford, Douglas"."title": "JavaScript: The Good Parts"."genre": "Computer"."price": 29.99."publish_date": "2008-05-01"."description": "Unearthing the Excellence in JavaScript"}}]Copy the code

It should make complete sense to you. It’s clean and concise, and it removes a lot of extra crap from XML and conveys the same amount of information. JSON is king right now, and the rest of this article will cover other language formats that are simply simplified versions of JSON in an attempt to make them more concise or human-readable, with very similar structures.

TOML: Cut it down to outright altruism

TOML (Tom’s Obvious minimization Language, Minimal Language) allows you to define deeply nested data structures in a fairly quick and concise manner. The Tom in the name refers to inventor Tom Preston Werner, a creator and software developer active in our industry. The syntax is a bit awkward compared to JSON, more like an INI file. This is not a bad syntax, but it takes some getting used to.

[[books]]
id = 'bk101'
author = 'Crockford, Douglas'
title = 'JavaScript: The Good Parts'
genre = 'Computer'Price = 29.99 publish_date = 2008-05-01t00:00:00 +00:00 description ='Unearthing the Excellence in JavaScript'
Copy the code

There are some great features integrated into TOML, such as multi-line strings, automatic escape of reserved characters, date, time, integer, floating point, scientific notation, and “table extension” data types. This last point is special and why TOML is so succinct:

[a.b.c]
d = 'Hello'
e = 'World'
Copy the code

The above extends to the following:

{
  "a": { 
    "b": {
      "c": { 
        "d": "Hello"
        "e": "World"}}}}Copy the code

With TOML, you can be sure to save a lot of time and file length. Few systems use it or something very similar as a configuration, which is its biggest drawback. There simply aren’t many languages or libraries available to explain TOML.

CSON: A simple sample of what a particular system contains

First, there are two CSON specifications. One stands for CoffeeScript Object Notation and the other for Cursive Script Object Notation. The latter is not often used, so we don’t pay attention to it. We’ll just focus on CoffeeScript.

CSON needs a little introduction. First, let’s talk about CoffeeScript. CoffeeScript is a language that generates JavaScript by running a compiler. It allows you to write JavaScript in a much cleaner syntax and translate it into actual JavaScript, which you can then use in your Web applications. CoffeeScript makes writing JavaScript easier by removing many of the extra syntax required in JavaScript. One big problem CoffeeScript gets rid of is curly braces — you don’t need them. Again, CSON is JSON without braces. It relies on indentation to determine the hierarchy of the data. CSON is very easy to read and write, and generally requires fewer lines of code than JSON because there are no parentheses.

CSON also provides additional details that JSON does not. Multi-line strings are easy to write, you can enter comments by starting a line with the # symbol, and you don’t need to separate key-value pairs with commas.

books: [
  id: 'bk102'
  author: 'Crockford, Douglas'
  title: 'JavaScript: The Good Parts'
  genre: 'Computer'Price: 29.99 publish_date:'2008-05-01'
  description: 'Unearthing the Excellence in JavaScript'
]
Copy the code

This is a big problem for CSON. It is CoffeScript Object Notation CoffeeScript Object Notation. This means using CSON in CoffeeScript parsing/tokenization /lex/ translation or whatever. CoffeeScript is the system that reads the data. If the purpose of data serialization is to allow data to be passed from one system to another, here we have a data serialization format that can only be read by a single system, making it as useful as the fire-proof matches, waterproof sponges, or annoying flimsy fork parts of a fork spoon.

If this format is adopted by other systems, it could be very useful in the developer world. But that basically hasn’t happened so far, so using it in alternative languages like PHP or JAVA isn’t going to work.

YAML: The cry of youth

Developers are happy because YAML comes from a Python contributor. YAML has the same feature set and similar syntax as CSON, a range of new features, and parsers available in almost every Web programming language. It has additional features such as circular references, flexible wrapping, multi-line keys, type-conversion labels, binary data, object merging, and collection mapping. It’s very readable and writable, and it’s a superset of JSON, so you can use fully qualified JSON syntax in YAML and everything works just fine. You hardly need quotes, which can explain most basic data types (strings, integers, floats, Booleans, etc.).

books:
  - id: bk102
  author: Crockford, Douglas
  title: 'JavaScript: The Good Parts'Genre: Computer price: 29.99 publish_date:!! str 2008-05-01 description: Unearthing the Excellencein JavaScript
Copy the code

Younger people in the industry are rapidly adopting YAML as their preferred format for data serialization and system configuration. It was very clever of them to do so. YAML has all the benefits of cson-like simplicity and all the functionality of jSON-like data type interpretation. YAML is as easy to read as Canadians are to get along with.

YAML has two problems, and for me, the first is the big one. As of this writing, the YAML parser is not built into multiple languages, so you will need to use third-party libraries or extensions to parse.yaml files for the language of your choice. This is not a big deal, but it seems that most developers who create parsers for YAML choose to randomly throw “additional functionality” into the parser. Some allow tokenization, some allow chain references, and some even allow inline computation. This is all well and good (up to a point), except that none of these features are part of the specification and are therefore hard to find in other parsers in other languages. This leads to system constraints, and you end up with the same problems as CSON. If you use functionality found only in one parser, other parsers will not be able to interpret the input. Most of these features are nonsensical and belong not to the dataset but to your application logic, so it’s best to simply ignore them and write YAML to the specification.

The second problem is that few parsers fully implement the specification. All the basics are there, but it’s hard to find something more complex and newer, like flexible packaging, document markup, and circular references to the preferred language. I haven’t seen a need for any of these yet, so I hope they don’t disappoint you. Given the above, I prefer to keep the more mature feature set presented in the 1.1 specification and avoid the new things found in the 1.2 specification. However, programming is an evolving monster, so by the time you finish reading this article, you may be able to use the 1.2 specification.

In the end of philosophy

This is the last paragraph. Each serialization language should be evaluated on a case-by-case basis. Some are the bee’s knees when it comes to machine readability. For human readability, some are the Cat’s meow, some are gilded turds. Here’s the final breakdown: If you want to write code that other code can read, use YAML. If you’re writing code that writes code that other code can read, use JSON. Finally, if you’re writing code that translates code to be read by other code, reconsider your life choices.


Via: www.zionandzion.com/json-vs-xml…

GraveAccent by Tim Anderson, Lujun9972

This article is originally compiled by LCTT and released in Linux China