System background and requirements

  • API gateway

Traditional financial systems are highly dependent on the integration of systems, especially external systems, in the following ways

A. File transfer. SFTP or A variant of A secure and robust transfer protocol

B. Sending attachments in emails, Excel.

C. Browse web pages and download files

D. SWIFT – a proprietary information exchange protocol between banks

These methods have some common problems to varying degrees, that is, the data exchange is not timely, or the protocol is complex and expensive. As an important part of the digital transformation of the financial system, API gateway has become an important cornerstone to change all this. As an important bridge between the gateway and the user, it shoulders the important responsibilities of security authentication and authorization, request forwarding and so on.

  • 7 * 24

The transaction system of financial institutions is crucial, especially for a global top 500 multinational bank that has business in most countries around the world. Users may access this system at any time. Therefore, building a set of highly available API gateway is of course one of the requirements.

  • Zero Downtime Deployment Continuous delivery

In order to use the functions developed by us more quickly and solve the needs of users, our business has an increasingly strong demand for continuous delivery. In the past development mode, users’ needs were developed in a waterfall mode, which often took several months or even more than one year for some large projects. And the r&d team is completely separated according to the organization structure of development, test and operation. Operations teams performing delivery are often unfamiliar with systems that have been developed in the past several months or even in the past year. Now, our r&d process has been completely transformed to agile, and the r&d team no longer distinguishes between development, test and operations. Our system development personnel have full stack capability, and also can release and operate the production environment system.

Difficulties and Challenges

  • Layer 3 network isolation eDMZ, iDMZ, DRN

Due to the company’s requirements on specific application systems, the API gateway must be deployed in layer 3 network isolation mode, including external DMZ, Internal DMZ, and DRN services. Each layer is protected by an enterprise-class network firewall. The user’s request is processed by the reverse proxy server after the Internet enters eDMZ and sends the request to iDMZ. The reverse proxy server in iDMZ must verify the validity of user requests and only send the requests that meet the permission verification to the DRN core service API. Therefore, we must handle requests from the reverse proxy server well while handling continuous delivery, thus achieving zero impact on user requests.

  • Long distance double live

Due to company requirements for certain levels of application systems, THE API gateway must be deployed to a remote machine room for disaster backup. To maximize the performance of our servers, we also require data centers in both locations to handle user requests simultaneously. As a result, we need to consider dual machine room deployments as we move forward with continuous delivery.

  • Microservices Architecture

I’m sure many of you are already familiar with the comparison between traditional monolithic applications and microservices architectures, so I won’t go into too much detail here. API gateway uses Spring Boot and Spring Cloud technology to build a distributed application system based on microservices. API gateway has more than 10 microservices including authorization authentication service, routing service and so on. This requires ensuring that these microservices can be safely upgraded and not affect user usage during continuous delivery.

  • Elegant off with current limiting

In the process of application publishing, it is inevitable to limit the flow of user requests and gracefully shut down application services to achieve zero impact on the requests being processed.

Continuous delivery in DevOps

  • Infra as Code Ansible one-click deployment

Due to the frequency of releases and the increasing demand for correctness and repeatability of releases, our requirement for deployment is full automation. We have an Ansible automatic publishing script for each service release. Scripts can precisely control the order of publishing, using blue-green publishing, allowing users to upgrade applications without being aware of it.

  • Develop unified operation and maintenance team

Our r&d team is mainly based in Mainland China and Hong Kong, and even in Poland. At the peak, we had a global r&d team of more than 20 people, but we did not have a dedicated R&D and operations staff. All the developers in the team were involved in the code writing, testing and production environment deployment.

  • Automated functional regression testing

Because our goal was to deliver continuously without compromising software quality, it was important to implement fully automated functional regression testing to ensure that we had sufficient confidence that existing functionality would not be affected before each release. At the same time, automated regression testing can also greatly reduce the labor cost of testing, shorten the testing cycle, avoid human error, improve the skills of the team, and let the developers really put into meaningful work, instead of repeating the manual labor every time. As a result, the team’s developers are motivated.

  • Automated performance testing

As part of our software quality assurance, we must also perform performance testing before each production environment upgrade to ensure that new features do not degrade current software performance. Because of the reason why we continue to deliver, automated software testing must be performed, we always use Jmeter to API gateway at the core of the process performance testing, test report on the performance of the output and the report before the comparison and analysis, in this way, we can easily draw the whether changes to existing software performance decreases.

  • Jenkins Automated build

In order to standardize and speed up the software construction process, we used Jenkins to automate the construction of each microservice of the API gateway. During the construction process, we compiled using the specified JDK and ran Junit unit tests to ensure that all the unit tests were passed before the application was built to meet the quality requirements. In order to ensure the quality of software, we require unit test coverage of at least 75%, and in fact, some of our core services, such as authorization verification services, have unit test coverage of more than 90%.

  • Sonarqube quality testing

As part of the automated build process, we use Sonarqube to perform code style and quality checks. We require that no Critial or Major problems exist in the software we distribute. For the rest of the low and medium level issues, we will analyze and mark them, and make corrections where necessary to ensure that the quality of the code meets our expectations.

  • Checkmarx Static code security scan

As the topic of Internet security attracts more and more attention, as the gateway system of a large bank, our requirements for code security are extremely high. In our build process, Checkmax will be called for static analysis of the code to detect whether there are security vulnerabilities. We require all online applications to be free of Critial and Major vulnerabilities.

  • Full link monitoring

In complex microservices architecture systems, almost every API request forms a complex distributed service invocation link. So, in the event of a failure, how do you locate the faulty service and check for performance issues during invocation? We use Zipkin to monitor the whole link of the system.

Zipkin

Call chain trace analysis

The Span of the same TraceID is collected and sorted by time. ParentID strung together is the call stack.

If an exception or timeout occurs, TraceID is printed in the log. Use TraceID to query the invocation chain to locate problems.

Call process tracing

  • A global TraceID is generated when the request arrives, and the entire invocation chain can be connected by TraceID. One TraceID represents one request.
  • In addition to TraceID, SpanID is needed to record the call parent-child relationship. Each service keeps track of the parent ID and span ID through which the parent-child relationship can be organized into a complete chain of calls.
  • A span without a parent ID becomes a root span, which can be regarded as a call chain entry point. All of these ids can be represented by globally unique 64-bit integers;
  • TraceID and SpanID are passed through for each request throughout the call.
  • Each service records the TraceID that came with the request and the SpanID that came with it as the parent ID, as well as the SpanID it generated.
  • To view a complete call, trace all the call records based on TraceID, and organize the entire parent-child relationship with the parent ID and span ID.

Zipkin architecture diagram

The next step…

Google Cloud Migration

why

  • Implement remote hypermetro relational database, go to Oracle

At present, our services are deployed in the group’s self-built computer room, and the Oracle relational database is also used at the back end. We will have 2 Oracle instances per data center and use VCS to configure primary/secondary hot backup. We use DataGuard to synchronize data between different data centers. Therefore, based on the existing architecture, our database is just a hot-warm mode, which cannot achieve the ideal live-live.

We often experience database switching because we test the application services every year to ensure that the backup data center has the same load capacity as the primary data center and can complete the switch within the specified time. Also, when the primary database fails, we go through the same switch. Firstly, we cannot complete all the switching work in a very short time, and secondly, there is a certain delay in backing up the data center.

Based on the above reasons, we consider using Cloud SQL of Google Cloud or Spanner to replace the relational database and realize the remote hypermetro of the relational database. Reduce the work during database switchover and ensure data synchronization between multiple computer rooms.

  • Server Elastic Scaling

Our gateway is likely to see a surge in requests in the next 1-2 years due to business growth. So it’s very important for us to have flexible scalability. In today’s self-built computer room environment, it takes several months or even a year for us to apply and configure a server in a production environment, and customer requests may surge in the next second, so we must find such a solution.

  • Comprehensive Infra as Code is scheduled to rebuild the server

To accommodate the flexible scaling of the server, our build and deployment code will also include the configuration of the server, such as JDK, environment variables, etc., so that we can use scripts to generate a usable server directly online.

  • Unified Infra and application monitoring

At present, we adopt a variety of monitoring platforms in our work.

  • We have the Plexus monitoring service for the server
  • We use Splunk for distributed application logging
  • For the possibility of service monitoring, we also adopted BEATS monitoring service built by ourselves
  • For server alarm services, we use xMatters. It’s all over the place.

Therefore, I hope to use a unified monitoring platform to configure servers, applications, and alarms in a unified manner, simplifying work and unifying the operating environment.