SARIF is the realization of deep requirements in the application process

Abstract:To reduce the cost and complexity of aggregating the Results of various Analysis tools into a common workflow, the industry has begun to address these issues by adopting the Static Analysis Results Interchange Format (SARIF).

This article is shared from the Huawei Cloud Community “A Bridge Between DevSecops Tools and Platform — Sarif Advanced”, the original author is uncle_Tom.

1. The introduction

At present, DevSecOps has become an important model to construct enterprise R & D security. Static scanning tools are integrated into the development process of DevSecops, which plays an important role in improving the overall security level of the product. In order to maximize the coverage of security checking capabilities, development teams often introduce multiple security scanning tools. But it also creates more problems for developers and platforms, To reduce the cost and complexity of aggregating the Results of various Analysis tools into a common workflow, the industry has begun to address these issues by adopting the Static Analysis Results Interchange Format (SARIF). This article is an introduction to the application of SARIF and an advanced part of this article. It introduces the implementation of the deeper requirements in the application process of SARIF. For a basic introduction to Sarif, see “Bridging the Interaction between DevSecops Tools and Platforms — Getting Started with Sarif.”

2. SARIF advanced

Last time we looked at some of the basic uses of SARIF. Here we look at some of the uses of SARIF in more complex scenarios, so that we can provide a complete reporting solution for static scanning tools.

Among the new features in the latest version 2021.03 of Coverity, the industry’s leading static analysis tool, is the ability to display Coverity’s scans in Sarif format in the GitHub codebank. So Covreity has also adapted the SARIF format.

2.1. Use of Metadata

In order to avoid the scan report is too large, some of the repeated use of information, need to be extracted as metadata. For example: rules, the message of rules, the content of scans, etc.

In the following example, rule and rule information are defined in Tool.driver.rules. The results are scanned directly using the rule number ruleId to get the rule information, and the message is also message.id to get the alarm information. This can reduce the size of the report by avoiding a large number of duplicate messages that produce the same alarm.

Vscode shows the following:

{" version ":" 2.1.0 ", "runs" : [{" tool ": {" driver" : {" name ":" CodeScanner ", "rules" : [{" id ": "CS0001", "messageStrings": { "default": { "text": "This is the message text. It might be very long." } } } ] } }, "results": [ { "ruleId": "CS0001", "ruleIndex": 0, "message": { "id": "default" } } ] } ] }

2.2. Use of message parameters

Alarms of scan results often require that, according to the specific code problem, the relevant information of the specific variable or function is given in the prompt message to facilitate the user’s understanding of the problem. At this point, you can provide mutable defect messages in the form of message parameters.

In the following example, the information template is provided as a placeholder (“{0}”) in the message for the rule, and in the Results scan, the arguments are provided via the Arguments array. This is shown in VS Code as follows:

{" version ":" 2.1.0 ", "runs" : [{" tool ": {" driver" : {" name ":" CodeScanner ", "rules" : [{" id ": "CS0001", "messageStrings": { "default": { "text": "Variable '{0}' was used without being initialized." } } } ] } }, "results": [ { "ruleId": "CS0001", "ruleIndex": 0, "message": { "id": "default", "arguments": [ "x" ] } } ] } ] }

2.3. Use of associated information in messages

In some cases, in order to better explain the cause of the alarm, it is necessary to provide the user with more reference information to help them understand the problem. For example, give the definition location of the variable, the entry point of the pollutant source, or other auxiliary information.

In the following example, the introduction location of the pollution source is given by defining the relatedLocations of the problem. This is shown in vscode, but when the user clicks “here”, the tool jumps to where the variable expr was introduced.

{
  "ruleId": "PY2335",
  "message": {
    "text": "Use of tainted variable 'expr' (which entered the system [here](1)) in the insecure function 'eval'."
  },
  "locations": [
    {
      "physicalLocation": {
        "artifactLocation": {
          "uri": "3-Beyond-basics/bad-eval.py"
        },
        "region": {
          "startLine": 4
        }
      }
    }
  ],
  "relatedLocations": [
    {
      "id": 1,
      "message": {
        "text": "The tainted data entered the system here."
      },
      "physicalLocation": {
        "artifactLocation": {
          "uri": "3-Beyond-basics/bad-eval.py"
        },
        "region": {
          "startLine": 3
        }
      }
    }
  ]
}

2.4. Use of defect classification information

The classification of defects is very important for the analysis of tools and scan results. The tool can manage the rules based on the classification of defects and facilitate users to select the rules they need. On the other hand, when users view the analysis report, they can also filter the analysis results quickly by classifying the defects. Sarif supports tools that can refer to industry standards, such as Common Weakness Enumeration (CWE), or you can customize your own categories.

Examples of defect classification

{" version ":" 2.1.0 ", "runs" : [{" taxonomies ": [{" name" : "CWE", "version" : "3.2", "releaseDateUtc" : "2019-01-03", "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82", "informationUri": "Https://cwe.mitre.org/data/published/cwe_v3.2.pdf/", "downloadUri" : "Https://cwe.mitre.org/data/xml/cwec_v3.2.xml.zip", "organization" : "MITRE", "shortDescription" : {" text ": "The MITRE Common Weakness Enumeration" }, "taxa": [ { "id": "401", "guid": "10F28368-3A92-4396-A318-75B9743282F6", "name": "Memory Leak", "shortDescription": { "text": "Missing Release of Memory After Effective Lifetime" }, "defaultConfiguration": { "level": "warning" } } ], "isComprehensive": false } ], "tool": { "driver": { "name": "CodeScanner", "supportedTaxonomies": [ { "name": "CWE", "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82" } ], "rules": [ { "id": "CA2101", "shortDescription": { "text": "Failed to release dynamic memory." }, "relationships": [ { "target": { "id": "401", "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82", "toolComponent": { "name": "CWE", "guid": "10F28368-3A92-4396-A318-75B9743282F6" } }, "kinds": [ "superset" ] } ] } ] } }, "results": [ { "ruleId": "CA2101", "message": { "text": "Memory allocated in variable 'p' was not released." }, "taxa": [ { "id": "401", "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82", "toolComponent": { "name": "CWE", "guid": "10F28368-3A92-4396-A318-75B9743282F6" } } ] } ] } ] }

2.4.1. Introduction of industry classification standards (Runs. Taxonomies)

The definition of taxonomies

 "taxonomies": {
    "description": "An array of toolComponent objects relevant to a taxonomy in which results are categorized.",
    "type": "array",
    "minItems": 0,
    "uniqueItems": true,
    "default": [],
    "items": {
      "$ref": "#/definitions/toolComponent"
    }
  },

A Taxonomies node is an array node that can define multiple classification criteria. The definition of Taxonomies refers to the definition of ToolComponent under the definition group node definitions. This is consistent with our previous tool scanning engine (Tool. driver) and tool extensions. The reason for this design is the strong correlation between the engine and the result, and it is possible to keep the properties consistent in this way.

In the definition example of Standard Taxonomy, the industry’s classification standard CWE is declared through the Runs. Taxonomies node. In Node Taxonomies, the specification is described by attribute node. The following is just a sample. For details, refer to Sarif’s specification:
Name: the name of the specification;
Version: version;
ReleaseDateUTC: Release date;
GUID: Unique identification to facilitate reference to the specification elsewhere;
InformationURI: Document information for a rule;
DownloadUri: the download address;
Organization C
ShortDescription: A shortDescription of the specification.

2.4.2. Introduction of Custom Classification (Runs.Taxonomies. Taxa)

Taxa is an array node. In order to reduce the size of the report, it is not necessary to put all the customized classification information under Taxa node. It is enough to list the classification information related to this scan. This is why the default value for the ISComprehensive node is false.

The example introduces a taxa node that the tool needs: a CE-401 memory leak, and uniquely identifies this taxa with a GUID and ID so that the tool can refer to it later in a rule or defect.

2.4.3. Tools associated with the industry classification standard (tool. Driver. SupportedTaxonomies)

Tool object through the tool. The driver. SupportedTaxonomies node and the definition of the industry classification standard. The array element of SupportedTaxonomies is a ToolComponentReference object, because taxonomies themselves are ToolComponent objects. ToolComponentReference. Guid attributes and run. Taxonomies [] the guid attributes defined in the classification of object matching.

Example supportedTaxonomies. Name: CWE, it said the tool support CWE taxonomy, and cited the taxonomies guid in the [0] : A9282C88-F1FE-4A01-8137-E8D2A037AB82, making it associated with the industry classification standard CWE.

2.5. Rule and Defect Classification Associations

Rules are defined under the Tool.driver.rules node, which is an array node, and rules are defined by the reportingDescriptor object in the array element.
Each rule (ReportingDescriptor) relationships is an array of elements, each element is a reportingDescriptorRelationship object, This object establishes a relationship from that rule to another ReportingDescriptor object. The target of a relationship can be a taxon in a taxonomy (as shown in this example), or it can be another rule in another tool component;
(ReportingDescriptorRelationship) the relation between the target attribute identification of target, its value is a reportingDescriptorReference object, This references the reportingDescriptor in the object ToolComponent;
The toolComponent reportingDescriptorReference object is a toolComponentReference object, pointing tool supportedTaxonomies defined in the classification.

The following diagram shows the correlation between the rule and defect classification in the example:

2.5.1. Classification in Scanning Results (Result.Taxa)

In the scan results (run. The results), each result (result), with an attribute classification (taxa), taxa is an array elements, each element of the array to reportingDescriptorReference object, is used to specify the defect classification. This is the same way the rule corresponds to the classification. It can also be seen from this point that Taxa under Result can be omitted, but the classification of defects can be corresponding to rules.

2.6. Code Flow

Some tools detect problems by simulating program execution, sometimes across multiple threads of execution. Sarif simulates execution with a set of location information, like a Code Flow. Sarif code flows contain one or more thread flows, each of which describes the chronological location of code on a single thread of execution.

2.6.1. Defect Code Flow Group (Result.codeflows)

Since there may be more than one code stream in the defect, the optional result.codeflows property is an array of Codeflow objects.

"result": { "description": "A result produced by an analysis tool.", "additionalProperties": false, "type": "object", "properties": { ... . "codeFlows": { "description": "An array of 'codeFlow' objects relevant to the result.", "type": "array", "minItems": 0, "uniqueItems": false, "default": [], "items": { "$ref": "#/definitions/codeFlow" } }, } }

Codeflow. Threadflows

As you can see from the definition of Codeflow, each code stream has and consists of a single ThreadFlows group, which is required.

 "codeFlow": {
      "description": "A set of threadFlows which together describe a pattern of code execution relevant to detecting a result.",
      "additionalProperties": false,
      "type": "object",
      "properties": {

        "message": {
          "description": "A message relevant to the code flow.",
          "$ref": "#/definitions/message"
        },

        "threadFlows": {
          "description": "An array of one or more unique threadFlow objects, each of which describes the progress of a program through a thread of execution.",
          "type": "array",
          "minItems": 1,
          "uniqueItems": false,
          "items": {
            "$ref": "#/definitions/threadFlow"
          }
        },
      },

      "required": [ "threadFlows" ]
    },

2.6.3. ThreadFlow and ThreadFlowLocation

Within each threadFlow, an array of locations describes the process of code analysis by the tool.

Definition of ThreadFlow:

 "threadFlow": {
      "description": "Describes a sequence of code locations that specify a path through a single thread of execution such as an operating system or fiber.",
      "type": "object",
      "additionalProperties": false,
      "properties": {

        "id": {
        ...

        "message": {
        ...  

        "initialState": {
        ...

        "immutableState": {
        ...

        "locations": {
          "description": "A temporally ordered array of 'threadFlowLocation' objects, each of which describes a location visited by the tool while producing the result.",
          "type": "array",
          "minItems": 1,
          "uniqueItems": false,
          "items": {
            "$ref": "#/definitions/threadFlowLocation"
          }
        },

        "properties": {
        ...
      },

      "required": [ "locations" ]
    },

ThreadflowLocation: Each element in a location group, which in turn represents a tool’s access to a code location through a threadflowLocation. Finally, the location information for analysis is given through the location attribute of the location type. Location can contain both physical and logical location information, so Codeflow can also be used for the representation of binary analysis streams.

There is also a node in the ThreadFlowLocation with a state attribute that can be used to store variables, expression values, or symbol table information, or for state machine representations.

"threadFlowLocation": { "description": "A location visited by an analysis tool while simulating or monitoring the execution of a program.", "additionalProperties": false, "type": "object", "properties": { "index": { "description": "The index within the run threadFlowLocations array.", ... "location": { "description": "The code location.", "$ref": "#/definitions/location" }, "state": { "description": "A dictionary, each of whose keys specifies a variable or expression, the associated value of which represents the variable or expression value. For an annotation of kind 'continuation', for example, this dictionary might hold the current assumed values of a set of global variables.", "type": "object", "additionalProperties": { "$ref": "#/definitions/multiformatMessageString" } }, ... }},

2.6.4. Code flow sample

Reference code

1. # 3-Beyond-basics/bad-eval-with-code-flow.py 2. 3. print("Hello, world!" ) 4. expr = input("Expression> ") 5. use_input(expr) 6. 7. def use_input(raw_input): 8. print(eval(raw_input))

The above is an example of code injection for a Python code.

In the fourth line, input information is assigned to the variable expr;
In line 5, expr enters use_input via the first argument to the function.
In line 8, the function print is used to print the input result, but here the function eval() is used to process the input parameter. Since the parameter is not checked after the input, it is used to process the function eval() directly, which may introduce the safety issue of code injection.

This analysis process can be represented by the following scan results, so that users can understand the process of the problem.

Scan results

{" version ":" 2.1.0 ", "runs" : [{" tool ": {" driver" : {" name ":" PythonScanner "}}, "results" : [{" ruleId ": "PY2335", "message": { "text": "Use of tainted variable 'raw_input' in the insecure function 'eval'." }, "locations": [ { "physicalLocation": { "artifactLocation": { "uri": "3-Beyond-basics/bad-eval-with-code-flow.py" }, "region": { "startLine": 8 } } } ], "codeFlows": [ { "message": { "text": "Tracing the path from user input to insecure usage." }, "threadFlows": [ { "locations": [ { "message": { "text": "The tainted data enters the system here." }, "location": { "physicalLocation": { "artifactLocation": { "uri": "3-Beyond-basics/bad-eval-with-code-flow.py" }, "region": { "startLine": 4 } } }, "state": { "expr": { "text": "42" } }, "nestingLevel": 0 }, { "message": { "text": "The tainted data is used insecurely here." }, "location": { "physicalLocation": { "artifactLocation": { "uri": "3-Beyond-basics/bad-eval-with-code-flow.py" }, "region": { "startLine": 8 } } }, "state": { "raw_input": { "text": "42" } }, "nestingLevel": 1 } ] } ] } ] } ] } ] }

This is just a simple example. With SARIF’s Codeflow, we can adapt to a more complex analysis process so that users can better understand the problem and make judgments and corrections quickly.

2.7. Fingerprint of Defect

In large software projects, analysis tools can produce thousands of results at a time. In order to deal with so many results, in defect management we need to document the existing defects, develop a scan baseline, and then address the existing problems. At the same time, in the later scan, the new scan results need to be compared with the baseline to distinguish whether new problems have been introduced. To determine whether the results of subsequent runs are logically the same as those of the baseline, an algorithm must be used to construct a stable identifier, which we call a fingerprint, using the unique information contained in the defect result. This fingerprint is used to identify the characteristics of the defect to distinguish it from other defects. We also call this fingerprint the defect fingerprint of the defect.

The defect fingerprint should contain relatively stable defect information:

The name of the tool that produced the result;
Rule number;
Analyze the file system path of the target; The path should be relative to the project itself. Do not include the location of the project in front of the path, because each machine may store the project in a different location;
Defect eigenvalues (Partialfingerprints).

Each scan result (result) of SARIF provides a set of such attribute nodes, which are used to store the defect fingerprint, so that the defect management system can identify the uniqueness of the defect through these identifications.

"result": { "description": "A result produced by an analysis tool.", "additionalProperties": false, "type": "object", "properties": { ... . "guid": { "description": "A stable, unique identifier for the result in the form of a GUID.", "type": "string", "pattern": "^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$" }, "correlationGuid": { "description": "A stable, unique identifier for the equivalence class of logically identical results to which this result belongs, in the form of a GUID.", "type": "string", "pattern": "^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$" }, "occurrenceCount": { "description": "A positive integer specifying the number of times this logically unique result was observed in this run.", "type": "integer", "minimum": 1 }, "partialFingerprints": { "description": "A set of strings that contribute to the stable, unique identity of the result.", "type": "object", "additionalProperties": { "type": "string" } }, "fingerprints": { "description": "A set of strings each of which individually defines a stable, unique identity for the result.", "type": "object", "additionalProperties": { "type": "string" } }, ... . }}

Only through the inherent information characteristics of the defect, in some cases, it is not easy to obtain the information that uniquely identifies the result. At this time, we need to add some attribute values strongly related to the defect as additional information to the calculation of the defect fingerprint, so that the fingerprint obtained in the final calculation is unique. This is a little like the salt value when we do the encryption algorithm, but the salt value needs to ensure that the generated unique value is repeatable, so as to ensure that the next scan can get the same input value for the same defect, so as to get the same fingerprint as last time. For example, when the tool checks for sensitive words in the document, the warning message is: “XXX should not be used in the document.” At this point, the word can be used as an eigenvalue for the defect.

The SARIF format provides such a PartialFingerprints attribute to hold this eigenvalue to allow analytics tools and other components in the SARIF ecosystem to use this information. The defect management system can attach it to the fingerprint that is constructed for each result. In the previous example, the tool would have set the value of the property in the PartialFingerprints object to: Prohibited word. The Defect Management System should include this information in its PartialFingerprints calculation.

For PartialFingerprints, only those attributes that are strongly related to the defect feature should be added, and the values of the attributes should be relatively stable. Defects number of lines of code, for example, is not suitable for join the fingerprint of the logical operation, because the line is a often can change the value of the next scan, probably because the developer in question before to add or remove some of the lines of code, and make the same problem in different new scan report lines of code, the calculated value of defects that affect the fingerprint, Resulting in a difference when comparing.

Although we tried to find a unique identifying feature for each defect and added some variable feature attributes, it was difficult to design an algorithm to construct a truly stable fingerprint result. For example, if there are several sensitive words in the same file, we will not be able to give a unique identifier for each warning defect. Of course, the function name can also be added as a calculation factor of fingerprint at this time, because the function name is relatively stable in a program. The addition of the function name is helpful to distinguish the scope of occurrence of the same problem in the same file, but there will still be multiple same defects of the same problem in the same function. So although we tried to distinguish each alarm, the same defect fingerprint scenario would still exist in the actual scan.

Fortunately, fingerprints don’t have to be absolutely stable for practical purposes. It only needs to be stable enough to reduce the number of results reported as “new” to a low enough level that the development team can manage the results of error reports without much effort.

3. Summary

Sarif provides a general format of the standard output of static scanning tools, which can meet the various requirements of static scanning tools report output;
With the integration of various static scanning tools into the DevSeCops platform, SARIF will reduce the cost and complexity of aggregating scan results into a common workflow;
SARIF will also make it possible for the IDE to integrate various scans and provide a unified defect handling module; Scanning results in the IDE for bug displays, fixes, etc. This allows tool developers to focus on finding problems and reduce the workload of adapting various IDEs.
SARIF has become one of the OASIS standards and is supported in the tool by Microsoft, GrammaTech and other important static scanning tool vendors; At the same time, U.S. DHS, U.S. NIST in some static inspection tool evaluation and competition, also require the scan report format to use SARIF;
SARIF has been designed primarily for static scan results, but due to its versatility, some dynamic analysis tool vendors have also found successful use of SARIF.

4. Reference

Industry leaders collaborate to define SARIF interoperability standard for detecting software defects and vulnerabilities
OASIS Awards 2018 Open Standards Cup to KMIP for Key Management Security and SARIF for Static Analysis Tools
OASIS Static Analysis Results Interchange Format (SARIF) Technical Committee
SARIF Specification
SARIF Tutorials
Vscode Extension: Sarif Viewer
SARIF-SDK
Fortify FPR to SARIF
GrammaTech SARIF integration for GitHub
Static Analysis Results: A Format and a Protocol: SARIF & SASP
On Language Server & LSIF & Sarif & Babelfish & Semantic & Tree-sitter & Kythe & Glean etc

Click on the attention, the first time to understand Huawei cloud fresh technology ~