As a Java developer, why do I recommend that you avoid using Java serialization in your daily development?

Nowadays, most of the back-end services are implemented based on the micro-service architecture. The services are split according to the business division, realizing the decoupling of services, but also bringing some new problems, such as the communication between different services need to realize the invocation through interfaces. To share a data object between two services, the object needs to be converted into a binary stream, transmitted over the network, to the other service, and then converted into an object for service method invocation. This encoding and decoding process is called serialization and deserialization.

In a high concurrency system, the speed of serialization will affect the response time of the request, and the large volume of transmitted data after serialization will lead to the decrease of network throughput. Therefore, an excellent serialization framework can improve the overall performance of the system.

We all know that Java provides the RMI framework to realize the interface exposure and invocation between services, and the serialization of data objects in RMI is Java serialization. At present, mainstream frameworks rarely use Java serialization, such as Json serialization used by SpringCloud. Dubbo is compatible with Java serialization, but still uses Hessian serialization by default.

Java serialization

First, let’s look at what Java serialization is and how it works. Java provides a serialization mechanism that serializes an object into binary form for writing to disk or output to the network, and deserializes byte arrays read from the network or disk into objects for use in programs.

ObjectInputStream and ObjectOutputStream, two input and output stream objects provided by the JDK, can only deserialize and serialize objects of classes implementing the Serializable interface.

The default serialization method of ObjectOutputStream is to serialize only the non-transient instance variables of the object, not the transient instance variables of the object, and not the static variables.

What is the use of a version number of serialVersionUID generated in an object that implements a class that implements the Serializable interface? It verifies during deserialization that the serialized object is loaded with the deserialized class. If it is a class with different version numbers of the same class name, the object cannot be fetched in deserialization.

The serialization methods are writeObject and readObject, which are usually the default methods. You can also override them in classes that implement the Serializable interface to customize your own serialization and deserialization mechanisms.

There are also two override methods defined in the Java serialization class: writeReplace(), which is used to replace serialized objects before serialization, and readResolve(), which is used to process returned objects after serialization.

Java serialization defects

Serialization provided by the JDK is rarely found in the RPC communication frameworks we have used, mainly because the default serialization of the JDK has the following shortcomings: it is not cross-language, vulnerable to attack, serialized streams are too large, serialization performance is poor, etc.

####1. Unable to cross languages

Now a lot of high complexity of the system, USES the many kinds of language to code, and Java serialization currently only supports the Java language implementation framework, most of other languages without using Java serialization framework, also did not realize Java serialization this agreement, as a result, if the two based on the application of different language communication between, With Java serialization, you cannot serialize and deserialize objects transferred between two application services.

####2. Vulnerable

The Java website’s secure coding guidelines state that “deserialization of untrusted data is inherently dangerous and should be avoided”. Visible Java serialization is not safe.

We know that objects are deserialized by calling the readObject() method on ObjectInputStream, which is a magic constructor that can instantiate almost any object on the classpath that implements the Serializable interface. This means that the method can execute any type of code while deserializing the byte stream, which is very dangerous.

For objects that take a long time to deserialize, you can launch an attack without executing any code. An attacker could create a loop of objects and then pass the serialized objects into the program for deserialization, which would cause the hashCode method to be called to an exponential number of times, raising stack overflow exceptions. For example, the following case can be well illustrated.


Set root = new HashSet();  
Set s1 = root;  
Set s2 = new HashSet();  
for (int i = 0; i < 100; i++) {  
   Set t1 = new HashSet();  
   Set t2 = new HashSet();  
   t1.add("test"); // make t2 not equal to T1
   s1.add(t1);  
   s1.add(t2);  
   s2.add(t1);  
   s2.add(t2);  
   s1 = t1;  
   s2 = t2;   
} 
Copy the code

From an earlier paper by the FoxGlove Security team: Through Apache Commons Collections, the Java deserialization vulnerability could be used to attack the latest versions of WebLogic, WebSphere, JBoss, Jenkins, OpenNMS, The major Java Web Servers are innocent.

Apache Commons Collections is a third-party library that extends the Collection structure of the Java standard library, provides many powerful data structure types, and implements a variety of Collection utility classes.

The attack principle is as follows: Apache Commons Collections allows chaining arbitrary class function reflection calls, and attackers upload the attack code to the server through ports that implement the Java serialization protocol. This is performed by TransformedMap in Apache Commons Collections.

How to fix this bug?

Many serialization protocols have a set of data structures to store and retrieve objects. For example, JSON serialization, ProtocolBuf, etc., support only some basic types and array data types to avoid creating uncertain instances of deserialization. Although their design is simple, they are sufficient to meet the data transfer requirements of most current systems. We can also control deserialization by deserializing an object whitelist by overriding the resolveClass method and validating the object name in that method. The code looks like this:

@Override
protected Class resolveClass(ObjectStreamClass desc) throws IOException,ClassNotFoundException {
	if(! desc.getName().equals(Bicycle.class.getName())) {throw new InvalidClassException(
		"Unauthorized deserialization attempt", desc.getName());
	}
	return super.resolveClass(desc);
}
Copy the code

####3. The serialized stream is too large

The serialized binary stream size is an indication of serialization performance. The larger the serialized binary array, the more storage space it takes up and the higher the cost of storage hardware. If we are doing network transmission, we will use more bandwidth, which will affect the throughput of the system.

Java serialization uses ObjectOutputStream to implement object to binary encoding. Is there any difference in the size of binary array implemented by this serialization mechanism compared to the size of binary array implemented by ByteBuffer in NIO?

We can verify this with a simple example:

User user = new User();
user.setUserName("test");
user.setPassword("test");
      
ByteArrayOutputStream os =new ByteArrayOutputStream();
ObjectOutputStream out = new ObjectOutputStream(os);
out.writeObject(user);
byte[] testByte = os.toByteArray();
System.out.print("ObjectOutputStream Byte code length:" + testByte.length + "\n");
Copy the code

ByteBuffer byteBuffer = ByteBuffer.allocate( 2048);

byte[] userName = user.getUserName().getBytes();
byte[] password = user.getPassword().getBytes();
byteBuffer.putInt(userName.length);
byteBuffer.put(userName);
byteBuffer.putInt(password.length);
byteBuffer.put(password);        
byteBuffer.flip();
byte[] bytes = new byte[byteBuffer.remaining()];
System.out.print("ByteBuffer encoding length:" + bytes.length+ "\n");

Copy the code

Operation structure:

ObjectOutputStream Byte encoding length:99ByteBuffer Encoding length:16
Copy the code

Here it is clear that the binary array size encoded by the Java serialization implementation is several times larger than the binary array size encoded by the ByteBuffer implementation. As a result, the flow after the Java sequence becomes larger, which ultimately affects the throughput of the system.

####4. Poor serialization performance

Serialization speed is also an important indicator of serialization performance. If the serialization speed is slow, the efficiency of network communication will be affected and the response time of the system will be increased. Let’s compare the performance of Java serialization with NIO ByteBuffer encoding:

User user = new User(); user.setUserName("test"); user.setPassword("test"); long startTime = System.currentTimeMillis(); for(int i=0; i<1000; i++) { ByteArrayOutputStream os =new ByteArrayOutputStream(); ObjectOutputStream out = new ObjectOutputStream(os); out.writeObject(user); out.flush(); out.close(); byte[] testByte = os.toByteArray(); os.close(); } long endTime = System.currentTimeMillis(); System.out.print("ObjectOutputStream serialization time: "+ (endtime-startTime) + "\n");Copy the code
long startTime1 = System.currentTimeMillis(); for(int i=0; i<1000; i++) { ByteBuffer byteBuffer = ByteBuffer.allocate( 2048); byte[] userName = user.getUserName().getBytes(); byte[] password = user.getPassword().getBytes(); byteBuffer.putInt(userName.length); byteBuffer.put(userName); byteBuffer.putInt(password.length); byteBuffer.put(password); byteBuffer.flip(); byte[] bytes = new byte[byteBuffer.remaining()]; } long endTime1 = System.currentTimeMillis(); System.out.print("ByteBuffer serialization time: "+ (endtime1-starttime1)+ "\n");Copy the code

Running results:

ObjectOutputStream Serialization time: 29 ByteBuffer serialization time: 6Copy the code

From this example, it is clear that encoding in Java serialization takes much longer than ByteBuffer.

There are many serialization frameworks that can replace Java serialization. Most of them avoid the default serialization of Java, such as FastJson, Kryo, Protobuf, Hessian, etc. Here is a brief introduction to the Protobuf serialization framework.

Protobuf is a multi-language serialization framework launched by Google. Currently, in the performance comparison test reports of serialization frameworks on mainstream websites, Protobuf ranks first in terms of codec time and binary stream compression size.

Protobuf is based on a.proto file that describes fields and their types, with tools that can generate data structures in different languages. When serializing the data object, Protobuf generates the Protocol Buffers format encoding from the. Proto file description.

So what is the Protocol Buffers storage format?

Protocol Buffers is a portable and efficient structured data storage format. It uses a t-L-V (identity-length-field value) data format to store data, with T representing the positive sequence of the field (tag). Protocol Buffers correspond each field in the object to a positive sequence. The information about the correlation is guaranteed by the generated code. Use integer values instead of field names during serialization, so traffic can be drastically reduced; L represents the length of the Value in bytes, which is usually only one byte. V represents the encoded value of the field value. This data format does not require separators or Spaces and reduces the number of redundant field names.

Protobuf defines its own encoding that maps almost all the basic data types of languages such as Java/Python. Different encoding methods correspond to different data types and can use different storage formats. As shown below:

For Varint encoded data, since the storage space occupied by the data is fixed, there is no need to store the byte Length, so Protocol Buffers are actually stored t-V, which reduces the storage space by another byte.

The Varint encoding of a Protobuf is a variable length encoding. The last bit (or highest bit) of each byte is a flag bit (MSB), which is represented by 0 and 1. 0 indicates that the current byte is the last byte, and 1 indicates that there is another byte after the number.

For int32 digits, generally four bytes are required. If Varint is used, small int32 digits can be represented by one byte. For most integer data, it is usually less than 256, so this is a good way to compress the data.

We know that int32 represents a positive or negative number, so we usually use the last bit to represent the positive or negative number. Now Varint uses the last bit as the flag bit, so how to represent the positive or negative integer? In Varint, Zigzag encoding is used to convert negative numbers to unsigned numbers, and sint32/sint64 is used to represent negative numbers, which greatly reduces the number of encoded bytes.

This data storage format of Rotobuf not only compresses and stores data well, but also is very efficient in encoding and decoding performance. Protobuf encoding and decoding processes are combined. Proto file format and Protocol Buffer unique encoding format only require simple data operations and displacement operations to complete encoding and decoding. The overall performance of Protobuf is very good.

conclusion

  1. Java’s default serialization is implemented through the Serializable interface. As long as the class implements this interface and generates a default version number, it automatically implements serialization and deserialization without manual Settings.
  2. Java’s default serialization, while easy to implement, suffers from security vulnerabilities, non-cross-language, and poor performance, so I strongly recommend avoiding Java serialization.
  3. Throughout the mainstream serialization framework, FastJson, Protobuf, Kryo are relatively characteristic, and the performance and security aspects have been recognized by the industry, we can combine our own business to choose a suitable serialization framework, to optimize the serialization performance of the system.

More original technical articles and learning materials please pay attention to the public number: hometown learning Java, hope to progress with you, thank you!