Scrape-and-Spray Email Hacking Tutorial
Step 2: Writing a Web Scraping Bot
Step 3 (Recommended): Cleaning Scraped Data
Step 4: Guessing Passwords
Step 5: Automate Login Process (Password Spraying)
Both web scraping and password cracking are far from new concepts in the information security world. However, rarely is it brought to light just how easy it is to carry out large-scale email hacking campaigns with these two practices used in tandem. This method requires very little knowledge of the target, and is effective due to the probability of encountering weak passwords. In this article, I will be detailing my process in creating a program to collect large amounts of emails and then attempting to exploit them via common password guesses. In sharing this process, I hope to urge all email service providers to switch to 2 factor authentication by default, as relying on a password alone is simply not enough to prevent intruders given the technological capabilities of today's computers (looking at you, Microsoft).
Step 1: Finding Source(s) for Email Harvesting
First things first. Before you dive into coding a web scraping bot, you need to know where you’re harvesting your data from. With no prior access to email lists, the strategy becomes finding websites that are rich in publically listed email addresses using the same email provider/domain extension. Going into this project, I knew that a large percentage of companies and educational institutions utilized Microsoft Outlook for email networks. With this in mind, I elected to target university websites. With a single google search, I was able to find a comprehensive list of website URLs for every college in the United States and Canada (1,930 websites in total). Jackpot.
Step 2: Writing a Web Scraping Bot
Perhaps the most mentally-challenging and enjoyable aspect of this attack plan is creating a bot that pivots across the pages of a website identifying and collecting email addresses. This is not a difficult task for a seasoned programmer, especially given the amount of reusable code at the yield of a google search or a github link. While there are countless optimizations that can be made to improve the efficiency and accuracy of a web scraping bot, at the base level it consists of four elements/functions:
1. Grabbing the html page source from a provided URL
2. Identifying and collecting the desired information from that page source
3. Pivoting to another page via href elements on the page
4. Limiting the scope of the bot’s movement (the internet is a pretty big place, keep your spiders on a leash!)
With a web bot utilizing up to 100 threads per site, I was able to yield a hefty sum of 167,000 email addresses in ~16 hours of processing from the 1,930 target websites. I should note that the security measures put in place by these sites varied greatly, ranging anywhere from blocking my bot entirely, to giving me full access to faculty directories containing hundreds of email contacts.
If you’re like me, you can never write a perfect regex the first time around. After over half a day of letting my computer slowly fry, crawling over every nook and cranny of almost two-thousand unique websites, there were quite a few format and obfuscation exception cases that had slipped into my data. When you're cleaning, look for trends of consistently obfuscated data and then write a second-pass regex to fix them. When you’re working with hundreds of thousands of entries, it won’t be practical to fix every address, so just be as efficient as possible. Furthermore, you can create a term-blacklist to exclude specific types of email addresses that you may not want to target. For example, I omitted all email addresses containing the terms ‘upd’, ‘police’, ‘webmaster’, ‘robot’, and ‘emergency’ amongst many others as my goal was to target personal emails with minimal risk. This same method could be used to narrow down what email extensions and providers you don’t want included (ex: add ‘@gmail’ to the blacklist to exclude gmail accounts, ‘@smu’ to exclude Southern Methodist University emails, etc). After cleaning my email list, I was left with roughly 135,000 target email addresses.
(my second-pass email regex - use at your own risk)
Now that an extensive list of usable emails have been stored and cleaned, it’s time to think about how to gain account access. With 50% of the login information already at your fingertips, all it takes is guessing the right password once to consider this attack a success. In order to form the most educated guesses possible, I consulted a number of articles and lists concerning the most common passwords from data breaches. To little surprise, ‘123456’, ‘password’, and ‘qwerty’ were list toppers. However, I knew that most if not all large email providers required more complexity than that, which lead me to consult the Microsoft Outlook signup form. Acting as a new user trying to create an account, I tried a handful of obvious passwords to discover that Microsoft had explicitly forbid them, despite meeting complexity requirements. However, it was bittersweet to discover that simply adding enough ‘1’s to any common password satisfied their password strength test. I’m sure no legitimate user has ever done that! I ended up deciding on ‘password111’, ’qwerty11’, and ‘qwertyuiop11’ as my best three options amongst a list of 10 painfully easy-to-guess passwords.
|According to these rules, adding an extra ‘1’ to any common password makes it acceptable…|
At this point, all that stands between a malicious adversary and their potential target is trying a single easy-to-guess password against every email account in their list. With hundreds of thousands (and potentially many millions) of email addresses being tried, the chance of at least one login working is very very good. However, as stated in the introduction, password spraying and email hacking are by no means unknown to email providers, and is also no small offense in the eyes of the law. With this in mind, an adversary could go about automating this process with tools such as Burp Suite to monitor and imitate a legitimate http login request, or Selenium to imitate an actual user by opening a browser instance (much slower, but considerably harder for websites to detect robots). If an intruder did intend to remain anonymous, it would be in their best interest to obtain a proxy list and cycle through it while sending http requests as to not alert a server that a large amount of login requests are coming from the same IP. Here’s a great article that goes into more detail: http://www.blackhillsinfosec.com/?p=4694
I am not authorized for penetration testing on Microsoft Outlook or any of the organizations that utilize their services. Finding security flaws in software is a hobby of mine driven purely out of curiosity. For this reason, I did not execute the attack on the list of well over a hundred-thousand emails. Instead, I created 10 dummy outlook accounts and tried the password spray attack on them using Selenium for proof of concept. To my delight, the simulated attack was successful and gave me access to the two accounts that I had assigned intentionally easy-to-guess passwords. While I’ll never know for sure, I am very confident that this attack would yield well on a list of 135,000 emails.
|But that's none of my business|
In the age of information, email addresses have become so tightly linked with user identities that they simply cannot afford to be this easy to compromise. Think of all the services attached to your email that are willing to send you a password reset. Think about how many of those services have the same password as your email. Oh, and that thing called personal privacy? Forget about it. Picking an 8-digit password comprised of two different character classes does not remotely cut it in 2017. However, the solution is not just making passwords longer and harder to guess, it’s having another means of authentication. It is for this reason that 2 factor authentication needs to be adopted as an industry standard, not just a poorly supported extra security option for the paranoid.
- List of Colleges/Universities: http://www.searchenginesmarketer.com/list-of-university-and-college-websites/
- Get Selenium: http://www.seleniumhq.org/
- Get Burp Suite: https://portswigger.net/burp/