“This is the 13th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021”

Hello, everyone, I am Huaijin Shake Yu, a big data meng new, home has two gold swallowing beast, Jia and Jia, can code can teach next almighty dad

If you like my article, you can [follow ⭐]+[like 👍]+[comment 📃], your three companies is my motivation, I look forward to growing up with you ~


Cause 1.

When we receive the data, we usually need ETL to process it, but the original data had better be stored in the database. In this way, we used the data for two times.

The default principle for performing multiple operator operations on an RDD in Spark is as follows: Every time you perform an operator operation on an RDD, you re-calculate the source, calculate the RDD, and then perform the operator operation on the RDD. The performance of this approach is poor.

So if you save the raw data and then filter it, it will be reloaded, which is a waste of time.

2. Optimization begins

Persist RDD that is used multiple times. In this case, Spark saves the DATA in the RDD to memory or disk according to your persistence policy. Each subsequent operator operation on this RDD will fetch the persistent RDD data directly from memory or disk and perform the operator instead of recalculating the RDD from its source.

To persist an RDD, simply call cache() and persist() on the RDD

The cache() method says to persist all RDD data in memory in a non-serialized manner.

The persist() method says: manually select the persistence level and persist it in the specified manner.

df = sc.sql(sql)
df1 = df.persist()
df1.createOrReplaceTempView(temp_table_name)
subdf = sc.sql(select * from temp_table_name)
Copy the code

In this case, the RDD is not reloaded.

For the persist() method, we can choose different levels of persistence for different business scenarios.

Persistence level The meaning
MEMORY_ONLY The data is stored in memory using the unserialized Java object format. If there is not enough memory to hold all the data, the data may not be persisted. So the next time you run an operator on this RDD, the data that is not persisted needs to be recalculated from the source. This is the default persistence strategy, which is actually used when the cache() method is used.
MEMORY_AND_DISK Using the unserialized Java object format, the preference is to try to keep the data in memory. If there is not enough memory to hold all the data, the data will be written to the disk file, and the data persisted in the disk file will be read and used the next time the RDD operator is executed.
MEMORY_ONLY_SER The basic meaning is the same as MEMORY_ONLY. The only difference is that the data in the RDD is serialized, with each partition of the RDD being serialized into a byte array. This approach is more memory efficient and avoids persistent data taking up too much memory and leading to frequent GC.
MEMORY_AND_DISK_SER The basic meaning is the same as MEMORY_AND_DISK. The only difference is that the data in the RDD is serialized, with each partition of the RDD being serialized into a byte array. This approach is more memory efficient and avoids persistent data taking up too much memory and leading to frequent GC.
DISK_ONLY Write all the data to a disk file using the unserialized Java object format.
MEMORY_ONLY_2, MEMORY_AND_DISK_2, and so on. For either of the above persistence strategies, if the suffix _2 is added, it means that a copy of each persistent data is made and the copy is saved to another node. This copy-based persistence mechanism is primarily used for fault tolerance. If a node fails and persistent data is lost in memory or on disk, subsequent RDD calculations can also use copies of that data on other nodes. If there is no copy, the data will have to be recalculated from the source.

How to choose the most appropriate persistence strategy

  • By default, the highest performance is MEMORY_ONLY, but only if your memory is large enough to hold all the data in the entire RDD. Because no serialization or deserialization is performed, this part of the performance overhead is avoided. Subsequent operator operations on this RDD are based on pure in-memory data, do not need to read data from disk files, and have high performance. And there is no need to make a copy of the data and remotely transfer it to another node. However, it is important to note that in a real production environment, there are only a limited number of scenarios where this strategy can be used directly. If there is a large amount of data in the RDD (say, billions), using this persistence level directly will cause the JVM to run out of OOM memory exceptions.

  • If a memory overflow occurs when using the MEMORY_ONLY level, it is recommended to try the MEMORY_ONLY_SER level. This level serializes RDD data and stores it in memory, where each partition is just a byte array, greatly reducing the number of objects and memory footprint. This level of performance overhead over MEMORY_ONLY is mainly the overhead of serialization and deserialization. However, subsequent operators can operate based on pure memory, so overall performance is relatively high. If the amount of data in the RDD is too large, the OOM memory may overflow.

  • If no level of pure memory is available, it is recommended to use the MEMORY_AND_DISK_SER policy rather than the MEMORY_AND_DISK policy. Because at this point, it means that the RDD is very large and the memory cannot be completely laid down. Less data is serialized, saving memory and disk space. In addition, data is first cached in the memory. Data is written to disks only when the memory cache is insufficient.

  • DISK_ONLY and the suffix _2 level are generally not recommended, because reading and writing data solely based on disk files can lead to a dramatic performance degradation, and sometimes it is better to recalculate all RDD’s. For the level whose suffix is _2, all data must be replicated and sent to other nodes. Data replication and network transmission incur high performance costs, and are not recommended unless high availability of jobs is required.


conclusion

If you like my article, you can [follow ⭐]+[like 👍]+[comment 📃], your three companies is my motivation, I look forward to growing up with you ~

You can pay attention to the public number “Huaijin Shake Yu jia and Jia”, access to resources download