Author: Niu Zhiheng, Tencent Interactive Entertainment Development engineer

Commercial reprint please contact Tencent WeTest for authorization, non-commercial reprint please indicate the source.

Original link:Wetest.qq.com/lab/view/42…


WeTest takeaway

This article contains a detailed introduction to XSS vulnerability attack and defense, including vulnerability basis, XSS basis, coding basis, XSS Payload, AND XSS attack defense.


Part ONE: basic knowledge of vulnerability attack and defense



XSS belongs to vulnerability attack and defense, we need to understand some of the jargon in this field to study it, so that we can communicate with each other. At the same time, I built a simple attack model for XSS vulnerability learning.


1. Vulnerability terminology

It’s good to know some simple terms.


VUL

A Vulnerability is a Bug that can cause damage to or exploit a system.


POC

Proof of Concept C. It can be a text description and screenshots that can prove the existence of the bug, but more generally is the code that proves the existence of the bug; Systems with vulnerabilities are generally not broken.


EXP

Exploit C. Exploit a vulnerability to attack the system’s code.


Payload

The payload is the attack code that you include in an exploit.


PWN

“Hacker” is a slang term for breaking into devices or systems.


0DAY vulnerability and 0DAY attack

A zero-day or zero-day exploit is typically a security vulnerability that has not yet been patched.

A zero-day or zero-day attack is an attack that exploits this vulnerability.

Zero-day vulnerability is not only the favorite of hackers, but also an important parameter to evaluate the technical level of hackers.


CVE vulnerability number

Common Vulnerabilities and Exposures, entitled, gave a Common name for the widely agreed information security Vulnerabilities or Vulnerabilities that had been exposed.

You can search for an introduction to the vulnerability by CVE number at https://cve.mitre.org/. You can also search for an introduction to the vulnerability at http://www.scap.org.cn/



2. Vulnerability attack model



A simple attack model is shown above. An attack is the process of injecting Payload from the injection point to the execution point. A smooth process means the bug is being exploited.



Part TWO: XSS basics



With the basics behind us, we are now ready to dive into the basics of XSS. If XSS foundation is not good, there is no need to study, we do not have a common language.



1. What is XSS?

XSS stands for Cross-site Scripting, a cross-site scripting attack. An attacker uses the website injection point to inject the Payload that can be parsed by malicious clients. When the attacked accesses the website, the Payload is executed by the client execution point to achieve certain purposes, such as obtaining user rights, malicious transmission, and phishing.


2. Classification of XSS

In fact, it is difficult to learn XSS well without understanding the classification. There are many misunderstandings about XSS classification, and many articles explain them wrong. Here I give a relatively good CLASSIFICATION of XSS.



2.1 Classification by Payload Source



Type stored XSS

Payload exists permanently on the server, so it is also called persistent XSS. When the browser requests data, the Payload is returned from the server and executed.


The process is shown as follows:



Storage XSS examples:

Publish a post containing Payload-> Save it to the database -> The attacked accesses the page containing the post Payload


Reflective XSS

It is also called non-persistent XSS. In the first case, the Payload comes from the client and is executed directly on the client. The second case: temporary data sent from the client to the server is displayed directly to the client for execution.


The process is shown as follows:



Reflective XSS examples:

1. Propagates a link that contains Payload-> The attacked accesses the link Payload.

2. In the search box on the client, enter the content containing payload. Then, the server displays a message indicating that the search content is not found.


2.2 Payload Location



DOM-based XSS

The client JavaScript code manipulates the DOM or BOM, causing the Payload execution vulnerability. It is called DOM-based XSS because it is mainly used to manipulate the Payload caused by the DOM. It can also be used to manipulate the BOM, so it is a bit inaccurate.

The DOM-based Payload is not in the HTML code, which makes it difficult to automate vulnerability detection.


The process is shown as follows:



Examples of reflective DOM-based XSS:

Enter the content containing payload in the search box on the client. If the server displays a message indicating that the search content is not found, the payload is executed.


Example of dom-based storage XSS:

->JavaScript executes the Payload by manipulating the DOM and BOM


HTML-based XSS

Payload is included in the HTML returned by the server and is executed when the browser parses the HTML. It’s easy to automate vulnerability detection because the Payload is in the HTML. Of course, HTML-based XSS also has reflection and storage.


The process is shown as follows:



Example of reflective HTML-based XSS:

Enter the content containing payload in the search box on the client -> The server displays a message indicating that the search content is not found and the payload contained in the HTML is executed.

Example of memory HTML-based XSS

When a post is published, the Payload is stored in the database. When the attacked accesses the page containing the post, the Payload is executed in the HTML page


3. Attack purpose and harm of XSS

Many people who write insecure code do not have a clear understanding of the harm of vulnerabilities. The following is the 2017 OWASP network threat Top10:



As you can see, XSS plays an important role in network threats.


3.1 purpose

1. The cookies

2. Tamper with web pages for phishing or malicious spreading

3. Website redirection

4. Obtain user information


3.2 harm

1. Spreading hazards

2. System security threats



Part 3: Payload of XSS attacks



In this part, we analyze the Payload in the attack model. To understand Payload, we must know encoding, and to learn JS well, we must also know encoding well. To really do a good job of network security code is the most basic.


1. Coding basics

The coding part is the most important. It’s boring but it has to be learned. A lot of these variations of Payload are based on your code. Here through the hexadecimal coding tool you learn to code thoroughly.


1.1 Coding Tools

Hexadecimal viewer: easy to view file hexadecimal code

MAC:HEx Friend

windows: HxD


Editor Sublime: Files can be saved with different encoding types through Sublime



1.2 the ASCII

Definition: American Standard code for information interchange, a computer coding system based on the Latin alphabet for displaying modern English and other Western European languages.


Encoding mode: belongs to monad section encoding. ASCII code specifies a total of 128 characters, occupying only the last 7 bits of a byte, and the first 1 bit is uniformly specified as 0. 33 characters from 0 to 31 and 127 are used for control or communication. The value contains 95 characters ranging from 32 to 126(32 is a space).


1.3 the ISO – 8859-1 (Latin1)

Definition: Latin1 is the alias of ISO-8859-1, which contains characters of Western European languages, Greek, Thai, Arabic, and Hebrew in addition to ASCII characters. The euro symbol appeared relatively late and was not included in ISO-8859-1.


Encoding method: ISO-8859-1 encoding is single-byte encoding, backward compatible with ASCII, its encoding range is 0x00-0xFF, 0x00-0x7F is completely consistent with ASCII, 0x80-0x9F is control character, 0xA0-0xFF is text symbol.


Note: ISO-8859-1 encoding represents a narrow range of characters and cannot represent Chinese characters. However, because it is a single-byte code, and the computer’s most basic unit of representation, so many times, isO-8859-1 code is still used to represent. For example, although “Chinese” two words do not exist iso8859-1 code, to GB2312 code for example, should be “D6D0 CEC4” two characters, when using ISO8859-1 code will be separated into 4 bytes to express: “D6 d0 CE C4” (in fact, storage is also processed in bytes). So in mysql latin1 can represent any encoded character.


Relationship between Latin1 and ASCII encoding: fully ASCII compatible.


1.4 Unicode Encoding (UCS-2)

A Code Point is a numeric representation of a character. A character set can generally be represented by one or more two-dimensional tables consisting of multiple rows and columns. The points where rows and columns cross in a two-dimensional table are called code points, and each code point is assigned a unique number, which is called code point value or code point number.


BOM (Byte Order Mark) : Indicates the Order of bytes in the header of a file. If the first Byte is first, it is big-endian. If the second Byte is first, it is little-endian.


In the Unicode character set there is a character called “ZERO WIDTH no-break SPACE” whose code point is FEFF. FFFE is a non-existent character in Unicode, so it should not appear in the actual transfer. Before transmitting the byte stream, we can pass the character “ZERO WIDTH no-break SPACE” to indicate the size end, so the character “ZERO WIDTH no-break SPACE” is also called BOM.


BOM can also be used to indicate the encoding of text files. Windows uses BOM to mark the encoding of text files. Mac file with or without BOM.


For example: \u00FF :00 is the first byte and FF is the second byte. And code point representation is the same as the big-endian mode.


Unicode encoding character set: A collection of all characters in the world, assigned to each character a unique character number called a Code Point, represented by U+ followed by a hexadecimal number. All characters are divided into 17 planes (numbered 0-16) according to their frequency of use, namely the basic multilingual plane and supplementary plane. The basic multilingual plane, also called plane 0, collects the most widely used characters, with code points ranging from U+0000 to U+FFFF, and each plane has 216=65536 code points.


Unicode encoding: Characters in the Unicode character set can be encoded in many different ways, such as UTF-8, UTF-16, UTF-32, compressed conversion, and so on. What is commonly referred to as Unicode encoding is that UCS-2 maps character numbers (like Unicode code points) directly to character encodings, that is, character numbers are character encodings without special encoding algorithm conversion. Is fixed-length double-byte encoding: because our UCS-2 only includes this multilingual plane (U+0000 to U+FFFF).


BOM of UCS-2: Big-end mode: FEFF. Small end mode: FFFE.


The file is saved as UTF-16 BE with BOM. It is equivalent to ucS-2 big-ende mode, which starts with FEFF in hexadecimal format



The relationship between Latin1 and Unicode encoding: Latin1 corresponds to the first 256 code points of Unicode.


1.5 UTF – 16

Definition and encoding: UTF-16 is one of the uses of Unicode, which uses 2-byte storage for all characters defined in the Unicode basic Multilingual plane, whether they are Latin letters, Chinese characters, or other characters or symbols. Characters defined in the secondary plane are stored as two 2-byte values in surrogate pairs. It’s double byte encoding.



Relationship between UTF-16 and UCS-2: UTF-16 can be regarded as the parent set of UCS-2. In the absence of surrogate code points, UTF-16 and UCS-2 refer to the same thing. But when auxiliary flat characters are introduced, it is called UTF-16. Now, if software claims to support UCS-2 encoding, it implies that it cannot support word sets larger than 2bytes in UTF-16. For UCS codes less than 0x10000, the UTF-16 code is equal to the UCS code.



Utf-16 BOM: Big-end mode: FEFF. Small end mode: FFFE.


1.6 UTF-8

Definition and encoding: UTF-8 is the most widely used implementation of Unicode on the Internet. It is an encoding designed for transmission and makes it borderless so that characters from all cultures of the world can be displayed. One of the biggest features of UTF-8 is that it is a variable length encoding method. It can use 1 to 4 bytes to represent a symbol, varying the length of the byte depending on the symbol. When a character is in the ASCII range, it is represented as a byte, retaining the ASCII character encoding of one byte as part of it. Note that Unicode is a Chinese character of two bytes. In UTF-8, a Chinese character is 3 bytes. The conversion from Unicode to UTF-8 is not a direct correspondence, but requires some algorithms and rules.


Unicode symbol range

Utf-8 encoding (hexadecimal)

0000 0000-0000 007F

0xxxxxxx

0000 0080-0000 07FF

110xxxxx 10xxxxxx

0000 0800-0000 FFFF

1110xxxx 10xxxxxx 10xxxxxx

0001 0000-0010 FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx


UTF8 BOM: EFBBBF. Utf-8 does not have character sequences, but the BOM can be used to indicate that the file is a UTF-8 file. The file is saved as UTF-8 BE with BOM, which starts with EFBBBF in hexadecimal



1.7 GBK/GB2312

Definition and encoding: GB2312 is the earliest version of the Chinese character encoding only contains 6763 Chinese characters, GB2312 only supports simplified Chinese characters and incomplete, obviously not enough. GBK coding, is the extension of GB2312 coding, fully compatible with GB2312 standard, support simplified Chinese characters, including all Chinese characters. GBK encoding adopts single-byte and double-byte encoding scheme, single byte is consistent with Latin1, double-byte is part of Chinese characters, its coding range: 8140-Fefe, excluding xx7F code points, a total of 23940 code points.


Relationship between GBK and Latin1: GBK single-byte coding area is consistent with Latin1 coding area.


GBK and Unicode: GBK and Unicode character set encoding is different but compatible. For example, although the Unicode value of “Han” is different from that of GBK, assuming that Unicode is A040 and GBK is B030, they can be converted accordingly. Unicode area for Chinese characters: 4e00-U9FA5.


GBK and UTF-8: GBK characters are encoded in double bytes, smaller than the three bytes in UTF-8. But UTF-8 is more versatile. GBK and UTF-8 conversion: GBK – > Unicode – > UTF8


2. Coding in the front end

So if you have a coding base, you can kind of get a sense of the code in the front end, so you can really get a sense of Payload. I think I have the most comprehensive summary here.


2.1 Base64

Base64 can be used to encode binary byte sequence data into text composed of ASCII character sequences. When used, specify Base64 in transport encoding mode. The characters used include 26 uppercase and lowercase Latin letters, 10 digits, plus sign (+), and slash (/), a total of 64 characters, and equal sign (=) as suffixes. So 65 characters.


Put 3 bytes of data into a buffer of 24 bits, with the first byte taking the highest place. If the data is less than 3 bytes, the remaining bits in the buffer are made up with zeros. Take out 6 bits each time to use Base64 characters as the output after encoding to the original data. Encoding if the length of the original data is not a multiple of 3 and there is only one input data left, add two = after the encoding result; If two input data are left, add one = after the encoded result. You can see that the Base64 encoded data is about three-quarters of the original data.


Standard Base64 is not suitable for direct transmission in urls because the URL encoder converts the/and + characters in standard Base64 into %XX, and these % numbers need to be converted when stored in the database because % numbers are already used as wildcards in ANSI SQL. To solve this problem, an improved Base64 encoding for URL can be adopted. It does not fill the = sign at the end, and changes the + and/in standard Base64 to – and _, respectively, so as to avoid the conversion needed in URL codec and database storage, and avoid the increase in the length of encoded information in this process. And unify the format of object identifier in database, form, etc.


Window. btoa/window.atob base64 encoding (Binary to ASCII) and decoding only support Latin1 character set.


2.2 JS Escape Characters

Js character strings contain special escape characters starting with backslashes. They are used to represent non-prints. characters for other purposes can be escaped to represent Unicode and Latin1 characters.


Escape character

meaning

\ ‘

Single quotes

\”

Double quotation marks

\ &

And no.

\ \

The backslash

\n

A newline

\r

A carriage return

\t

tabs

\b

Back space

\f

Page identifier

\ n… \nnn

Latin-1 characters specified by one – to three-digit octal numbers (1 to 377)

\xnn

The hexadecimal nn (n:0 to F) indicates a Latin1 character. \x41 represents the character A

\unnnn

The hexadecimal NNNN (n:0~F) represents a Unicode character. Only code points in the range \u0000~\uFFFF

\ u {n}… \u{nnnnnn}

Unicode code point values represent a Unicode character

Special attention:

1. The newline character \n is used in innerHTML to show only a single space and does not wrap a line.

2. Any Unicode and Latin1 characters can be represented by \n, \u, and \x. Through this you can encrypt JS to ensure JS security and covert attacks.


Example:

functionToUnicode (theString) {// convert theString to a unicode encoded string. Remember that theString is copied, not executed directly. var unicodeString =' '; for (var i = 0; i < theString.length; i++) {  var theUnicode = theString.charCodeAt(i).toString(16).toUpperCase();  while (theUnicode.length < 4) {    theUnicode = '0' + theUnicode;  }  theUnicode = '\\u' + theUnicode;  unicodeString += theUnicode; } returnunicodeString; }var xssStr ="alert('xss')"; var xssStrUnicode = toUnicode(xssStr); / / output:"\u0061\u006C\u0065\u0072\u0074\u0028\u0027\u0078\u0073\u0073\u0027\u002"eval("\u0061\u006C\u0065\u0072\u0074\u0028\u0027\u0078\u0073\u0073\u0027\u002"); // Pop up the XSS windowCopy the code


2.3 the URL encoding

RFC 1738 specifies “only letters and numbers [0-9A-zA-z], and some special symbols”. * ‘(), “[excluding double quotes], and reserved words that can be used unencoded directly in urls”. Therefore, when links contain Chinese characters or other characters that do not meet the requirements, they need to be encoded. However, due to the large number of browser manufacturers, the FORM of URL encoding is diverse, if not unified processing of coding, it will have a great impact on the code development, the phenomenon of garbled code.


URL encoding rules: The characters that need to be encoded are converted to UTF-8 encoding, and each byte is preceded by %.


Such as:

'cattle'Utf-8 encoding E7899B-->URL encoding is %E7%89%9BCopy the code


JS provides three URL encoding methods for strings: escape, encodeURI, and encodeURIComponent


Escape: Do not use eccape since it has been recommended to be abandoned


EncodeURI: 82 characters not encoded by encodeURI:! # $& ‘() *, +, / :; =? @-._~ 0-9a-zA-z indicates that reserved characters in the URL are not encoded, so it is suitable for overall URL encoding


EncodeURIComponent: This is the most useful encoding function for us. EncodeURIComponent does not encode 71 characters:! ‘, *, -., _, ~ 0-9, a-z, a-z.

You can see the encoding of the reserved word in the URL, so when passing the parameter

Containing the reserved words (@, &,=) in these urls can be encoded and transmitted using this method


The three corresponding decoding methods: unescape, decodeURI, decodeURIComponent

2.4 HTML character entities

Reserved characters in HTML must be replaced with character entities. This will be displayed as characters, otherwise it will be parsed as HTML.



Character entity encoding rules: escape character = &#+ ASCII code; = & entity name;


XSS strings need to defend against character entity conversion table:




Transformation method:


function encodeHTML (a) { return String(a)   .replace(/&/g, "&")   .replace(/</g, "<")   .replace(/>/g, ">")   .replace(/"/g, """)   .replace(/'/g, "'"); };Copy the code


2.5 Page Coding



Page coding Settings:

<meta charset=”UTF-8″>

<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″ />



Note: In order for JS to work in UTF-8 and GBK, you can escape all strings containing Chinese characters.


Example:

alert("Network error"); // Network error alert("\u7f51\u7edc\u9519\u8bef"); // A network error is displayedCopy the code


3. Payload types

So now you know what the Payload is, and I have to say that this classification of Payload gives you a pretty good idea of what the Payload is. It also helps you better map to execution points.


3.1 atomic Payload

The lowest level Payload.


Javascript snippets

It can be executed in eval, setTimeout, and setInterval. It can also constitute high-order Payload through HTML


Javascript: javascript pseudo-protocol

Structure: javascript:+ JS code. Can be executed when the href attribute of the a tag is clicked and window.location.href is assigned.


DATA URI protocol

DATA URI structure: DATA :[][;base64],. The DATA URI becomes an executable Payload in the SRC attribute and object DATA attribute contained in the iframe.


String escape mutant javascript code snippets

Unicode or Latin-1 represents strings.

eval("\u0061\u006C\u0065\u0072\u0074\u0028\u0027\u0078\u0073\u0073\u0027\u002"); // Executable JSCopy the code


3.2 pure HTMLPayload

This type of Payload does not have executable JS, but it has the risk of spreading and can inject other sites into the attacked site.


The HTML fragment that contains the link jump

Mainly spreading harm

<a href="http://ha.ck"< p style = "max-width: 100%; clear: both; min-height: 1em;Copy the code


3.3 THE HTML fragment Payload containing the atomic Payload

Script tag fragment

Script tag fragments This Payload can import external JS or scripts that can be executed directly. This Payload can’t normally be executed by copying it directly to innerHTML, but it can on IE. It can be done with Document.write.

Example:

// Data :text/ HTML,<script>alert('xss'); </script>var inputStr ="; document.write(inputStr);Copy the code


Contains an HTML fragment for event handling

For example, an HTML fragment containing an IMG onerror, an SVG onload, and an input onfocus can be turned into an executable Payload.


var inputStr =""; var inputStr ="<svg/onload=alert('xss')>"; var inputStr ="<input autofocus onfocus=alert('xss')>"; xssDom.innerHTML = inputStr;Copy the code


HTML snippets containing executable JS properties

1. Javascript pseudo-protocol

xssLink.setAttribute("href"."javascript:alert('xss')"Var inputStr = var inputStr ="javascript:alert('xss')"; window.location.href = inputStr;Copy the code


2. DATA URI

Example:

// Data :text/ HTML,<script>alert('xss'); </script>//var inputStr =''; // var inputStr =''; xssDom.innerHTML = inputStr; / / the pop-up alert ("xss")Copy the code


This is just the main Payload, but there are a lot of unusual payloads.



Part four: XSS attack model analysis



In this part, we analyze the execution point and injection point of XSS according to the vulnerability attack model. Analyzing these two points is actually a process of finding loopholes.


1. XSS vulnerability execution point

1. Page straight out of Dom

2. Client jump link: location.href/location.replace()/location.assign()

3. Write values to pages: innerHTML, document.write, and various variations. This is basically writing an HTML fragment that carries an executable Payload.

4. Script dynamic execution: eval, setTimeout(), setInterval()

5. Set unsafe attributes: setAttribute. Unsafe attributes: href of a tag, SRC of iframe, data of object

6. HTML5 postMessage data from insecure domains.

7. Defective third party libraries.


2. XSS vulnerability injection point

Let’s see where we can inject our Payload


1. The server returns data

2. User input data

Href, search, search

4. Client storage :cookie, localStorage, and sessionStorage

5. Cross-domain calls: postMessage data, Referer, window.name


This basically covers all execution points and injection points. XSS vulnerability attack and defense for everyone is very helpful.



The fifth part XSS attack defense strategy



1. Tencent internal public security defense and emergency response

1. Access the common DOM XSS defense JS

2. Internal vulnerability scanning System scanning

3. Tencent Security Emergency Response Center: Security workers can submit tencent-related vulnerabilities through this platform and get rewards according to vulnerability ratings.

4. Major fault emergency response system.


2. Security coding



2.1 Execution Point Defense Method

Enforcement point

defense

The page goes straight out of the Dom

Server XSS filtering

Client jump link

Domain name whitelist (for example, only the QQ.com domain is allowed) and link address XSS filtering

Value write page

Client XSS filtering

Dynamic script execution

Ensure that the execution Js string comes from a trusted source

Unsafe property setting

Content XSS filtering, including links and client jump links

HTML5 postMessage

Origin Limit source

Defective third party library

Do not use

2.2 Other Security Defense Means

1. Use httpOnly for cookies

2. Use the Content Security Policy in the HTTP Header


3. Code review

Summarize the XSS checklist for code self-test and review


4. Tools for automated detection of XSS vulnerabilities

It takes a lot of time to detect XSS vulnerabilities manually. Can we write a set of automatic detection tools for XSS? It was completely possible for me to know about injection points, execution points, Payload automation processes.


The difficulty of AUTOMATIC detection of XSS lies in the detection of DOM XSS. Because of the high complexity of front-end JS, including static code analysis, dynamic execution analysis is not easy.



Part VI Summary



The above content text is more, read or very tired, summed up in a word: safety is greater than everything, do not take any chances, I hope the above content is helpful to you, but the above content only represents personal understanding, if not welcome to discuss.



About WeTest security protection



Tencent WeTest security team has experienced many years of exploration and technology precipitation in the security field, covering all Tencent mobile games and a large number of applications, creating industry-leading security testing technology solutions.

In order to enable more users to experience the security services, the current mobile games and applications of the “security scan” service, free for a limited time! For more information, please visit Tencent WeTest official website (wetest.qq.com).


At present, Tencent WeTest offers all kinds of discounts, click: wj.qq.com/s/2689175/c… Fill in the questionnaire to receive a discount.


If you have any questions, please contact Tencent WeTest QQ: 2852350015