Preaching to the Converted – Breaching AWS using PDF4ML

APPLICATION SECURITY

Preaching to the Converted – Breaching AWS using PDF4ML

Application security is now more than ever, paramount to an organisation’s security posture. Verizon’s 2020 Data Breach Investigation Report found that 43% of security breaches originated from web applications, doubling the number of compromises from 2019.

Serious application security flaws which often result in the exposure of sensitive data and underlying infrastructure such as SQL injection, code execution and command injection are (or should be) well known by security engineers, developers, and management personnel. However, subtler flaws which are less disclosed and publicised such as those present in HTML to PDF document conversion functions can have a serious impact to the security of an application.

For applications which need to export various datasets into universally excepted formats—which is pretty much all modern applications, common frameworks and open-source libraries are relied upon. For instance, an application may need to convert a series of data inputs provided by a user via a common HTML form to PDF in order to then send to their client(s)/customer(s). The potential security impacts of this example where HTML data is converted into a PDF document is the focus of this article.

Application flaws resulting from HTML to PDF (and other) conversion functions are often overlooked as they are mistakenly viewed by some as benign and relatively trivial functionalities. However, when the libraries handling the underlying data conversion do not adequately sanitise input data, attackers can successfully cause Cross-Site-Scripting (XSS), Server-Side Request Forgery (SSRF) and in some cases code execution.

It is likely that whenever a user clicks ‘Export to PDF’ in your application—form data previously supplied to the application by a user will be converted into a PDF document. Another common scenario is when raw HTML is captured as part of a WYSIWYG editor. The formatted content captured by the editor is usually sent as HTML markup in the request body, for example:

If a user were to complete the ‘messages’ field as outline above. The resulting request to the application server would often look like the following:

More broadly, applications are expecting to convert simple strings of text or HTML markup to a PDF format but fail to adequately sanitise user input.

So how can attackers compromise AWS tenancies via vulnerable PDF export functionalities? Well, in a nutshell—if an attacker managed to supply the application HTML/JavaScript which is subsequently parsed when exporting to a PDF document, the underlying JavaScript code will execute on the server side, which depending on the given payload could result in Server-Side Request Forgery (SSRF). Server-Side Request Forgery (SSRF) is a data validation flaw resulting in arbitrary requests (attacker controlled) being made by an application to retrieve a resource at a separate domain. A common attack vector for SSRF flaws in AWS is to access the AWS instance metadata service and obtain temporary credentials for the underlying EC2 instance running the application.

Aurian conducted two recent application penetration tests in which vulnerable HTML to PDF libraries were used. Both vulnerable applications happen to be held in Amazon Web Services (AWS) and as such, Aurian executed the attack path outlined above and successfully compromised the underlying EC2 infrastructure.

Attack Methodology

To identify and exploit vulnerable HTML to PDF converters, Aurian relies on the following testing methodology:

1. Enumerate the PDF export locations within the application. Understand what user form or application parameters are supplied by the user to be exported.
2. Export a PDF document with test data (non-malicious). Inspect the metadata of the exported PDF document. It is common that the underlying library responsible for conducting the HTML to PDF function is outlined here.
3. Cross-reference known vulnerable libraries (list supplied below).
4. Attempt to provide HTML markup as user input to determine if the underlying library is parsing and rendering the input in the exported PDF document. For example, if <b>testing123</b> is provided and the subsequent PDF export has ‘testing123’ in bold typeface, this is a good indication of potential problems. The same should apply to a similar payload such as <script>document.write(testing123)</script>.
5. Attempt to provide additional HTML/JavaScript payloads to trigger an arbitrary HTTP request (SSRF). Some common payloads are:

<iframe src=http://[aws metadata service]>

<img src=x onerror="location.href='http://[aws meta dataservice]'">

<link rel=attachment href="http://[aws metadata service]">

<object data="http://[aws metadata service]">

<portal src="http://[aws metadata service]" id=SSRF>

Assuming a successful payload is used, temporary credentials from the AWS instance metadata service can be retrieved. Commonly, savvy attackers will enumerate the underlying permissions assigned to the EC2 instance role and subsequently access further AWS resources enforced by the given permission policy.

Using the methodology above, of the two vulnerable applications tested by Aurian, application A was found to be using the Prince10 PDF library. Using similar payloads to the ones shown above, Arian successfully received temporary credentials via the metadata service. Application B was running the PD4ML PDF library. None of the above payloads seemed to result in successful execution.

After further research on the PD4ML library and reading existing security research, Aurian consultants used a customised HTML tag which is uniquely supported by the PD4ML library. Namely, the <pd4ml:attachment> tag which allows a user to add attachments to a PDF document. An example payload would be:

<pd4ml:attachment description="attachment.txt" icon="something">file:///etc/passwd</pd4ml:attachment>

Aurian used this tag to attach sensitive local files such as SSH keys and database/application configuration files.

Many commonly used PDF libraries allow unrestricted JavaScript execution by default, including the following:

Node-HTML-PDF — up to version 2.2.0

Go-wkHTML — up to version 1.5.0

DinkToPDF — up to version 1.0.8

wkHTML — up to version 0.12.5

PDFKit — up to version 0.8.2

Prince 10

PD4ML

Remediation and Conclusion

The implications of an attacker successfully exploiting a vulnerable HTML to PDF library can range from information disclosure to the complete compromise of the application and underlying AWS cloud tenancy. Aurian recommends organisations consider the following:

All data within a user’s control (file uploads, form fields, URL parameters, and HTTP headers etc) should be thoroughly sanitised and strictly validated to determine if the provided input is benign or malicious; that is, designed at a fundamental level to make the application perform in unexpected ways.
Ensure that for AWS tenancies, the AWS metadata service version 2 is supported and version 1 disabled. If the metadata service is not required, it is recommended to disable it.
Ensure that all PDF export libraries used are the most up-to-date stable release.
If the PD4ML PDF library is in use, developers should whitelist the local directories or remote resources the library is allowed to access.
For AWS tenancies, ensure that the IAM permissions assigned are done so according to the principle of least privilege.