Fun point

  • 3.3 the explain
  • 3.5 Object Size
  • 3.6.1 Overwriting an index with a specific field +
  • 4 Additional optimizations -TLAB

1. The background

When users query meeting records within three months, the average delay for returning results is as high as three to four seconds.

1.1 Optimization objectives

Reduce interface latency and ensure that the response time is within 200ms

Optimization has three dimensions: throughput, delay and system capacity.

  • Throughput: Refers to how many operations a system can accomplish per unit of time
  • Latency: refers to the response time of an operation, such as when the results of a search for an item must be shown to the user within 200ms
  • System capacity: Refers to the additional constraints placed on the hardware environment when throughput and latency are met

1.2 Involves optimization points

Network IO, SQL optimization, object size, TLAB

1.3 Analysis Tools or Commands Involved

Arthas, Top, JStack, Explain

1.4 Tuning Steps

When I analyze, I like to examine and analyze from the system level first, such as from the dimensions of CPU, memory, network and disk, so as to locate the problematic code from top to bottom.

1.5 Preparing pressure Test Data

Simulate the largest data record of online users, a total of 200 meeting minutes in three months, each meeting minutes has an average of ten participants

The data relationship is as follows

2. Look for interface bottlenecks (loop HTTP requests)

2.1 the top

Run the top command to check the CPU usage during the pressure measurement

On a quad-core machine, our application accounts for only 120% of the CPU, which is not reasonable, so we can use the jstack command to see what the threads in the Java process are doing

2.2 jstack

We can use top-HP PID to display the status of all threads in the process, and then convert the PID corresponding to the thread into hexadecimal, and then use jstack command to check the status of the thread, but I think it is more troublesome to do so one by one

I prefer to use jstack -l Java process PID > stack.txt and pull down the file for analysis.

Jstack only once may not be accurate. Jstack prints thread snapshots, so there is a chance that at some point the printed snapshot is normal, so jStack should be analyzed several times

The results show that there is a lot of thread is blocked in the java.net.SocketInputStream.socketRead0 (Native Method), on this based on stack information found that code inside a loop body inside is the result of the third-party interface call

for (MeetingRecord meetingRecord : MeetingRecords) {
    // Go to another app to get some information, dude!!
    getVirtualRoomById(meetingRecord.getId());
}
Copy the code

And just to see how long this is going to take, I’m going to trace arthas

2.3 arthas

Trace class-pattern method-pattern Displays the internal invocation path of a method and outputs the time spent on each node in the method path

In the loop body of this request, the minimum time for a call is 7ms, and the maximum time is 1069ms, which is 22 seconds.

2.4 solve

If you know why, change to batch query and then arthas

Okay, that’s the big one.

However, the delay has not reached the standard. Although the average delay has dropped to 300ms, it has not reached the standard. We can only continue to optimize

Wry smile. PNG

3. Looking for system bottlenecks (SQL optimization)

As usual, let’s look at the CPU usage using the top command

3.1 the top

Ok, now that you see the CPU usage is pretty normal, continue with the jstack command analysis

3.2 jstack

Continue to use jstack -l Java process PID > stack. TXT analysis, found that there are still quite a multithreaded obstruction in java.net.SocketInputStream.socketRead0 (Native Method), but it is different, These blocks are caused by two SQL, my first reaction is not slow query, quickly put these two SQL out, with explain analysis

3.3 the explain

SQL1: select * from meeting_record where staff_id = XXX and created_date between XXXX and XXXX, retrieve meeting records based on employee ID and time

SQL 2: Select * from meeting_member mm left join meeting m on mm. Meeting_id = m.id where mm. Meeting_id = XXXX, Get participants (foreign key, left JOIN)

3.3.1 SQL1 analysis

Analyze SQL1 using Explain

select * from meeting_record where staff_id = XXX and created_date between XXXX and XXXX

+----+-------------+----------------+------------+------+-------------------------------+------------------------------- +---------+-------------+------+----------+-----------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------------+------------+------+-------------------------------+------------------------------- +---------+-------------+------+----------+-----------------------+ | 1 | SIMPLE | meeting_record | NULL | ref | 22 idx_staff_id_and_created_date | idx_staff_id_and_created_date | | const, const | 1 | | 100.00 Using index condition | +----+-------------+----------------+------------+------+-------------------------------+------------------------------- +---------+-------------+------+----------+-----------------------+Copy the code
  • Type: ref, which means using a normal index to compare a value, it is possible to find multiple rows that match the condition
  • Possible_type: possible index
  • Key: idx_staff_id indicates that the staff_id index is used
  • Ref: const, indicating that the value matching the index column is a constant
  • Extra: Using index condition, where the index is (staff_id,created_date), but the actual type of type is ref, which means that the index only uses staff_id to compare with a value, and created_date is not used. Using index condition means that indexes that meet staff_id conditions are filtered by created_date, and then the required fields are retrieved from the table

Using index condition: the index is pushed down. After version 5.6, the index records are screened out, and then filtered by the condition that does not meet the index retrieval in WHERE, and then the table is queried. This has the advantage of saving many back table operations and reducing random IO

3.3.2 rainfall distribution on 10-12 SQL2 analysis

select * from meeting_member mm left join meeting m on mm.meeting_id = m.id where mm.meeting_id = XXXX

+----+-------------+-------+------------+--------+---------------+---------+---------+---------------------------------- ----+------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+--------+---------------+---------+---------+---------------------------------- - + - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- + | | | SIMPLE | 1 mm NULL | ref | idx_mno | idx_mno | 131 | const | 1 | | NULL 100.00 | | 1 | SIMPLE | NULL | | m eq_ref | PRIMARY | PRIMARY | 130 | mindlinker_meeting_dev. Mm. Meeting_no | 1 | | NULL | 100.00 +----+-------------+-------+------------+--------+---------------+---------+---------+---------------------------------- ----+------+----------+-------+Copy the code

There’s nothing to talk about here, the only problem is foreign keys, old code, I don’t know why they were used in the first place

Both SQL queries are not slow queries, and both indexes are used

3.4 jstat

Since SQL is fine and CPU utilization is reasonable, let’s look at memory

Jstat-gc PID 1000 is used to print gc per second. It is found that something is really wrong. Two or three YGC per second and one FGC per ten seconds

3.5 Object Size

Given the above situation, I would guess that the two SQL objects are too large to cause frequent YGC. Do the math

  • MarkWord: 8 bytes
  • Type pointer: 4 bytes (type pointer compression is enabled-XX:+UseCompressedClassPointer)
  • Normal object pointer: 4 bytes (object pointer compression is enabled-XX:+UserCompressedOop)
  • Aligned padding: The JVM reads and writes in 8 bytes in memory, so padding is required when the size of an object is not an integer multiple of 8
// MarkWord + type pointer + attribute size + alignment fill
// 8 + 4 + 80 + 4 = 96 bytes
public class MeetingRecord {
    private long id; / / 8 bytes
    private String meetingNo; / / 4 bytes
    private String staffId; / / 4 bytes
    private Date createdDate; / / 4 bytes.// Add 80 bytes to the top
}
Copy the code
// MarkWord + type pointer + attribute size + alignment fill
// 8 + 4 + 124 + 0 = 136 bytes
public class Meeting {
    privateString id; .// Total 124 bytes
}
Copy the code
// MarkWord + type pointer + attribute size + alignment fill
// 8 + 4 + 72 + 4 = 88 bytes
public class MeetingMember {
    private long id;
    privateString meetingNo; .// Total 72 bytes
}
Copy the code

According to the entity relationship, MeetingRecord: MeetingMember: Meeting = 1:10:1

However, there is a problem here, that is, Meeting and MeetingMember are associated with foreign keys, so the object ratio in this scenario is

MeetingRecord : MeetingMember : Meeting = 1 : 10 : (1 + 10)

I measured it by 200 meetings with 10 participants per meeting, so the total size of objects produced by one request from these two SQL statements is approximately

  MeetingRecord * 200 + MeetingMember * 200 * 10 +  Meeting * 200 * (10 + 1)
= 96 * 200 + 88 * 200 * 10  + 136 * 200 * 11
= 494400 B = 482 KB
Copy the code

Good guy, two SQL give me the whole 482KB, optimize the SQL to go.

Pointer compression: Due to the alignment populating mechanism of Java objects, this causes Java objects to be multiples of 8 bytes. In this case, the JVM blocks heap memory into 8-byte units, similar to the operating system’s memory paging mechanism. Instead of storing the object’s real 64-bit address, pointer addresses can store a mapped address number

3.6 optimize SQL

3.6.1 Overwriting an index with a specific field +

In combination with services, Meeting record only needs to obtain the ID of the Meeting

Select * from meeting_record where staff_id = XXX and created_date between XXXX and XXXX meeting_no from meeting_record where staff_id = XXX and created_date between XXXX and XXXXCopy the code

Since only meeting_NO is needed here, we can also optimize using overwrite indexes to reduce the random I/O generated by the return table

// Extra is Using index condition select meeting_no from meeting_record where staff_id = XXX and created_date between XXXX And XXXX // overwrite index, // change to (staff_id,created_date,meeting_no) select Meeting_no from meeting_record where staff_id = XXX and created_date between XXXX and XXXXCopy the code

Let’s use Explain to see what happens after optimization

+----+-------------+----------------+------------+-------+------------------------------+------------------------------+ ---------+------+------+----------+--------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------------+------------+-------+------------------------------+------------------------------+ ---------+------+------+----------+--------------------------+ | 1 | SIMPLE | meeting_record | NULL | range | 22 idx_staff_created_meeting_no | idx_staff_created_meeting_no | | NULL | 1 | | 100.00 Using the where; Using index | +----+-------------+----------------+------------+-------+------------------------------+------------------------------+ ---------+------+------+----------+--------------------------+Copy the code
  • type: rangeIndicates that the index uses a range scan
  • Extra: Using where; Using indexRetrieve meeting_no by using staff_id, created_date, created_DATE, meeting_no, meeting_no, meeting_no

After optimization

3.6.2 Removing the Exclusion Key

Select * from meeting_member mm left join meeting m on mm. Meeting_id = m.id where mm. Meeting_id = XXXX Select * from meeting_member where meeting_no = XXX select name,avatar,status... from meeting_member where meeting_no = XXXCopy the code

3.7 Situation after optimization

3.7.1 Object Size

The MeetingRecord is gone because only the MeetingID in the MeetingRecord is returned

Meeting : MeetingMember = 1 : 10

  Meeting * 200 + MeetingMember * 200 * 10
= 136 * 200 + 88 * 200 * 10
= 203200 B = 198 KB
Copy the code

284 KB less than before optimization

3.7.2 GC conditions

It looks a lot more normal, and by the time I got here, the latency was up to par

4. Additional optimization -TLAB

Since the objects generated by this interface take up more than 136KB of heap space in a single request, how will these objects be allocated in Eden

Because the Eden area is shared by threads, when multiple threads allocate objects in the Eden area together, there will be thread safety problems. Therefore, lock allocation is required. To improve allocation efficiency, each thread has its own allocation buffer (TLAB) to avoid and reduce the use of locks, thus achieving fast allocation

4.1 Process of object allocation

TLABRefillWasteFraction This can be adjusted by TLABRefillWasteFraction. The default value is 64, which refers to the ratio of wasted space to TLAB blocks in TLAB, i.e. 100KB of TLAB will waste 100/64 = 1.5KB.

I’m only going to focus on the green part here, but there are still some areas that I haven’t drawn completely, such as what if I fail to allocate new TLAB, what if I fail to allocate slow

4.2 Usage of TLAB

Use -xx :+PrintTLAB to check TLAB usage

Focus on this

TLAB totals: thrds: 189  refills: 8959 max: 199 slow allocs: 676 max 34 waste:  1.1% gc: 3474216B max: 58648B slow: 2512B max: 344B fast: 282288B max: 5560B
Copy the code
  • THRDS: Total number of threads using TLAB (189 threads)
  • Refills: Number of new TLabs allocated between last GC and this GC (8959)
  • Max: Maximum number of new TLAB assignments in a single thread (199)
  • slow allocs: Number of slow allocations of objects for all threads (676)
  • Max: The maximum number of slow allocations of an object in a single thread (34)
  • Waste: TLAB memory waste percentage of all threads (1.1%)
  • Gc: TLAB space that is not yet being used by any thread when GC occurs
  • Slow: Space wasted by old Tlabs when new ones are generated
  • Fast: refers to the space wasted by the old TLAB when a new TLAB is created in C1

As you can see, between the last GC and this GC, tLabs have been reassigned 9000 times and slow allocations 676 times.

4.3 Process of applying for TLAB and slow allocation of objects

Since the application for new TLAB and slow allocation of objects need to be operated in Eden area, concurrency should be considered at this time.

Flowchart for TLAB partition application and slow allocation of objects:

4.4 optimize TLAB

4.4.1 TLAB situation before optimization

The machines I tested had a similar configuration to the machines online, with 2GB of ram and 1280MB of heap space. According to the default JVM configuration, the size of the young generation is 426MB and the Eden region is 340MB

By default, the total size of tLabs accounts for 1% of Eden area, that is, the total space of all TLabs is 3481KB. According to the above threads, the average TLAB size allocated by each thread is 18KB, which is smaller than the size of objects generated by the two SQL calculations I made before.

4.4.2 Adjusting TLAB size

Assuming that each thread gets 200KB of TLAB, then 200 threads is 39MB, which means that the total TLAB size in Eden is about 8%

You can set the JVM parameter – XX: TLABWasteTargetPercent = 8 to adjust

See the effect

TLAB Totals: THRDS: 121 Refills: 500 Max: 11 slow allocs: 5 Max: 1 waste: 12.2% GC: 43304504B Max: 8907064B slow: 16000B max: 15536B fast: 356856B max: 54400BCopy the code

Much better, Refills down from 9000 to 500, slow assignments down from 670 to 5, but waste up from 1.1% to 12.2%

Because not all threads’ TLabs need to be as large as 200KB, some threads’ Tlabs may only use tens of KB between the last GC and this one, resulting in a waste of remaining TLAB space during GC. So the value of waste goes up accordingly. So adjusting the TLABWasteTargetPercent value requires a compromise. Although the allocation efficiency is improved, the problem of memory fragmentation and wasted space is also more obvious

5. Supplements

  • Garbage collector

I chose CMS because CMS is a collector whose goal is to obtain the shortest collection pause time, which fits my optimization goals. So why not G1? The main reason is that the memory is so small. The G1’s official recommended memory size is 6GB, and I only have 2GB

  • Business logic optimization

In fact, there are many optimizations in the business layer, such as the use of limit for paging to prevent too much data from being queried at one time. However, these changes need to cooperate with the client, so I did not move

  • Why not cache

Mainly because this part of logic is old code, there is no guarantee of unit testing, dare not mess with, plus time and influence range and other aspects of consideration, decided to adopt the minimum change

All right, finish sprinkling


reference

  • GC Reference Manual -Java Edition
  • In-depth understanding of the Java virtual machine
  • JVM G1 source code analysis and tuning