Building a 250,000+ Email Address Database in 10 Minutes
In this post we will be building a simple email database by scraping a website for names, cleaning them up, and then putting them together with common email domains.
This is just a matter of getting 50 first names, 50 last names, and 10 popular email domains. As a bonus, we will be adding initials and numbers to the mix as well. We will generate over 250,000 email addresses in about 10 minutes.
This is for demonstration purposes only. Do not harass these common name folk.
What We Are Using
2. PyCharm – Community Edition
I enjoy the PyCharm IDE, but you can use any IDE you want. You can get PyCharm HERE
This is not meant to be an in-depth tutorial on setting up Python or PyCharm. However, we will walk through every step of how to build the script to make the database.
Coding the Database
Let’s plan on how this database builder will work:
- It will need to scrape data from a website
- It will need to store a list of first names, last names, and domain names
- It will need to combine the first names with the last names and then put that with the domain names
- As an extra step, it will also have to combine the numbers 0-9 with the first/last name combinations
- As an extra, extra step, it will have to get the first letter of each first name and combine them with the numbers and domains
Pretty basic stuff, so it should be a quick build.
1. Open your IDE and Import Requests and BeautifulSoup
The only two libraries we will need for this are:
- Requests – For accessing the site
- BeautifulSoup – For parsing the site’s data
2. Find a Site to Scrape
Unless you have lists of names and domains on your PC, you will need to acquire them from somewhere.
To make things easier, I compiled a list of all the data we need and put in on this site. Go ahead and open the page in a new tab if you are following along:
Back to the IDE, copy and paste the above URL into your code. This is the URL we will have Requests get the information from.
Now we can make the main function that will do the work.
3. Build the Function
We are going to make this very simple and make one function that will do everything (since there is not much to do).
Since we are making one function, I’m going to title mine main():
Now we will make our lists and set up Requests with BeautifulSoup.
temp_list = temporary list we use to store the data we scrape from the site
combined_names_list = list used to store the combined first and last names
final_list = the list where we will store our final strings (firstname+lastname+@+domain)
response = the variable to store what Requests gets with “requests.get(URL).” The URL is the rechor site as declared above.
soup = using BeautifulSoup to take the content of response and find what we need
At this point we still have to figure out what we need from the rechor URL, so let’s go there now.
Right-Click on any of the first names on the page and select “Inspect Element” (Firefox) or “Inspect” (Chrome).
After clicking “Inspect Element” on the name “Emma” on the page, Firefox points me to a list item (“li”) on the page where “Emma” is housed. Right above the list items you can see the tag “ol” — which stands for ordered list.
I checked with the last name and domain sections on the page as well and confirmed all of the data is structured the same way (“ol” -> “li”)
What this tells us is that all of the data we need is encapsulated within the “ol” tags, so let’s use BeautifulSoup to get them.
lists = using the soup data to find all occurrences of “ol” on the page
for name_list in lists… = We are iterating over the “lists” variable (which is a list) to pull out the names in text
We then print the temp_list to make sure it’s working properly:
Looks like it is working fine. This temp_list contains all of the first names, last names, and email domain names.
Now we have to sort them into their own lists…
We are creating three lists out of the one temp_list. Let me explain what is happening here…
Since it is a static page and it won’t change, we can do it this way:
- The first 50 items in the temp_list are first names, so we slice temp_list from 0-49 and send those to first_names_list.
- The second 50 items are last names, so we slice temp_list from 50-99 and move them to last_names_list.
- The last 10 items are the domain names so, as you guessed, we slice temp_list from 100-109 and shoot them to domain_list.
We then print out each list to make sure they are correct:
Looking good. We now have three separate lists containing our information.
Now we have to put them all together to make usable email addresses.
All we are doing is iterating over first_names_list while we are iterating over last_names_list and then putting the result in combined_names_list.
In plain English, for every first name we have, put that with every last name we have, and then save every one of those combinations in combined_names_list.
Let’s print out a snippet of what it looks like real quick…
Since we have 50 first names, and we also have 50 last names, the size of our list should equal 50 x 50 (2,500). Let’s check…
We still be flying high.
Almost done. Now we just have to add in the domain names to the combined names we have. This will be done almost exactly as we put the first names and last names together.
Again, for each combined name we have (first name + last name), we are going over the domain names and adding each domain name to the end of the combined names. We are then putting them in final_list.
The only significant difference here is that we are adding the “@” symbol between the combined names and the domain names.
Finally, let’s print out the results to make sure our formatting is swell:
This is the same thing I see in my dreams.
Now that we’ve added 10 domain names to the list of 2,500 combined names we have, our list should have expanded 10-fold. Let’s check to be sure:
25,000 email addresses and going up.
We can also throw some numbers into the mix and bump that 25,000 up. All we have to do is put another iteration over a number set before we combine the names with the domain names.
If you are following along, I suggest you make a separate list, as I’m about to, to store the names with numbers variations. If you append the combined names + numbers to the final_list, you may run out of memory and the IDE will stop.
I created a new list called combined_names_list_numbers where I will store the names with the numbers. I still have the original list, combined_names_list, where the names without the numbers will be stored.
I also created a separate final list called final_list_numbers where the combined_names_list_numbers + domains will be stored for the rest of their lives.
After running the above code, we get 250,000 for the length of final_list_numbers and 25,000 for the length of final_list.
And here’s a little taste of the 250,000 final_list_numbers list:
Here is the full code thus far:
If you really want to begin to be thorough, you can create a new list with the first letter of the first name + last name + number + domain. An example output would be “DJohnson9@yahoo.com”
I’m willing to bet you know at least one person who has an email address with this format.
Very quickly, here’s one way you could do that:
As you see above, we have added another 80,000 email addresses to our database with only a few more lines of code.
Now, this function is becoming a bit convoluted so I would suggest breaking it up into multiple functions or methods within a class.
4. Save it to a File
We now have a working email generator that can be modified to our sweet little heart’s desire.
To finish it off, let’s save the email addresses to a .txt file so they can be permanently stored and taken into a mail client.
Head to the end of your main() function and add the code to save the results to a .txt file. If you added the final_list_initial_numbers list above, don’t forget to add that to this write block:
Here we are appending a file called EmailFile.txt located at C:/EmailDatabaseTest. Just make sure the directory exists — the .txt file does not have to exist as long as you open it in append (“a”) mode.
We are writing every email address from both lists to the .txt file. Additionally, we are separating each address with a comma. If your email client requires another type of separator between email addresses, such as a semi-colon, just replace the comma with that.
After we run the script, EmailFile.txt gets written with all of our data. Head to the file and open it and you should see this:
That’s all there is to it. In less than a second, our script made a database of over 250,000 (over 300,000 if you added the first name initial iterations) email addresses and saved them to this file. You can improve this script by utilizing more numbers than just 0-9, adding more names, or anything else your situation dictates.
Now you may say, “But rechor, we have no idea if these email addresses are even real.” Well, we are playing the game of probability — that’s why we went with common email domains and the most common first/last names.
Just to test it, I sent out around 300 emails using addresses from this list asking if their address was still in use. Only five were automatically rejected by daemons due to the address not existing and I had real people responding within a few minutes.
And apologies to anybody reading this post that has an email address listed here.
– The script we just created generates a database of over 250,000 email address in less than a second (0.778 seconds, to be somewhat exact). The entire process to build this script took about 10 minutes.
– It doesn’t take much to send a mass email campaign to the email addresses we generated. Better yet, it takes nothing for a daemon to reject an email if the address isn’t valid. There is very little work to be done to generate a massive email database. This is a good example of “throwing everything against a wall and see what sticks.”
– This was just a short, quick example of how spammers and scammers can build an email database in a matter of minutes.
– As a reminder, don’t respond to emails if you don’t know who the sender is. Once you do, the sender then knows that your email is an active target.
– Again, this was for demonstration purposes only. Do not harass the John and Michael Smiths of the world.
Note: rechor has no affiliation with any tools/sources listed.
Python 3.7: https://www.python.org/downloads/release/python-370/