Pastebin Scraping: Golang vs. Python

Pastebin and scraping

Websites where you can post data anonymously are very popular in the IT world. Quick sharing of code snippets, pieces of datasets or other information makes working together very easy. The data is also available to everyone connected to the internet so maybe you will even help a stranger out! Pastebin is one of the most popular paste services and will be the focus in this post for analysis.

Adversaries are also very keen on using the services of these paste sites, because it’s anymous and usually offers an API for automation. E-mail and password datasets or encoded binaries are uploaded reguarly. All this together makes Pastebin and alike services an interesting source of Open Source Intelligence (OSINT) for cyber security analysts.

What can you get out of pastes, you wonder? Some interesting use cases are:

  • Analytics on pastes
  • Credential dump monitoring
  • Corporate e-mail monitoring
  • Keywords of interest monitoring
  • Malware hunting (base64 encoded binaries for the 2nd stage)

Some projects that do Pastebin scraping too:

These project are Dockerized or expect to run in a VM. In this article we will do it with AWS Serverless and compare two popular programming languages for Cloud deployments to see performance differences, Python and Go language.

Why Go and Python?

When starting developing the Pastebin scraper, Python was the only programming language I knew. Not that long ago, I picked up Golang for it’s popularity in Cloud and performance. Paste scraping and parsing was the perfect project to compare both languages on performance so I wrote the scraper in Go later on.

Let’s first look at how much pastes are created in a normal week to get a feel of the volume we are going to analyze. Below is a screenshot of the last two weeks in July, 2020.

pastes-per-day

This shows that around 15000 pastes are created per day, +/- 10 per minute. There are hours that the service is used more of course and weekends have reduced numbers. We’ll use the same timeframe for the programming languages to compare performance.

How scraping works

The scraping API allows the developer to query for the last X of pastes to collect, up to 250. We will execute the collection every 2 minutes, so 50 pastes per retrieval should do. This balance makes sure we don’t miss any new pastes, but also don’t collect too many pastes we already analyzed.

The request returns a JSON array with the 50 latest paste ID’s. To increase speed of the application, the paste ID are pushed to the SQS queue for the parser Lambda to pick up and collect the contents, instead of having the collection Lambda function do that. Before pushing the pastes to the parser Lambda, the paste ID is checked in a text file stored on an S3 bucket, to see if it’s parsed already. If so, the item will not be pushed for analysis and the text file is updated with the latest 50 pastes processed.

The parser Lambda is triggered by the SQS queue and will download the contents of the paste and analyses the contents.

Visual

In the findings chapter, regex is used to find Hotmail e-mail addresses. If an e-mail address is found, the parser Lambda function will print the e-mail address so it will be logged to Cloudwatch.

Caveat: Pastebin only allows a whitelisted IP address to connect with the API and with serverless that is a challenge. Lambda functions have a different source IP depending on the underlying EC2 they are running on when triggered. To tackle this, a proxy needs to be configured to connect to the Pastebin service. You whitelist the proxy in the Pastebin web console and let the Lambda functions connect over the proxy.

Storing the proxy credentials can be done in the AWS SSM Parameter Store with a SecureString if you want more security. This encrypts your password and is only decrypted and used on runtime of the Lambda.

Performance

Let’s get into the performance differences. Below the average execution time by each programming language of the parser Lambda.

py-vs-go-comparsion py-vs-go-comparsion

On average, the Python code takes 322 ms to finish, as seen in the screenshot from Cloudwatch. the Go code takes 83.7 ms on average to complete, almost four times faster compared to Python.

This is seen over time as well when checking the average runtime within a one week timespan in Cloudwatch. py-vs-go-timeline

Based on this, it seems lucrative to develop in Go going forward as Lambda functions are charged to customers based on execution time. Of course not every project is now going to be written in Go. The rich library of Python and the ease of coding make it an easy and quick language to get stuff done. But once costs and operational speed come in to play, Go can be very interesting for production workloads.

Some findings

This post is all about the speed differences between Python code and Go code in Lambda. For completeness, I want to add a screenshot of identified hotmail accounts not that long ago (April 2020). Results

Conclusion

We discussed services that are used to share text and why these services are used by adversaries. We then looked into two different programming languages and their performance differences within AWS Lambda. Finally, we took a little peak at some of the findings. Let’s find some interesting stuff, in automated fashion :)