How to Extract Email Addresses using Web Scraper

How to Extract Email Addresses using Web Scraper?

As a startup, if you want to get more prospective leads, you might need to get the maximum number of business email addresses possible. This blog shows how to extract email scraping using a web scraper and how 3i Data Scraping can help you in that.

June 20, 2022

Our achievements in the field of business digital transformation.

As a startup, if you want to get more prospective leads, you might need to get maximum business email addresses possible.

Although there are countless email scraping tools available, majority of them come with free usage limits. This tutorial would help you find emails addresses from websites anytime without any limits!

Looking for extracting email address data?

Data

Stage 1: Import Modules

We have imported six modules for the given project.

import re
import requests
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup
import pandas as pd

Re for consistent expression corresponding operations
Requests to send HTTP requests
Urlsplit to break URLs down into parts
Deque is the list-like container having quick appends as well as pops at either end
BeautifulSoup to pull data out from website HTML files
Pandas to format emails in a DataFrame for more manipulation

Stage 2: Set Variables

After that, we reset a deque to save unscraped URLs, with a set for extracted URLs and a set to save emails extracted successfully from a website.

# read url from input
original_url = input("Enter the website url: ") 

# to save urls to be scraped
unscraped = deque([original_url])

# to save scraped urls
scraped = set()

# to save fetched emails
emails = set()

Elements given in the Set are distinctive. Duplicate elements won’t be allowed.

Stage 3: Start Extracting

1. Initially, move the URL to scraped from unscraped.

while len(unscraped):
    # move unsraped_url to scraped_urls set
    url = unscraped.popleft()  # popleft(): Remove and return an element from the left side of the deque
    scraped.add(url)

2. After that, we utilize urlsplit to scrap various URL parts.

parts = urlsplit(url)

urlsplit()refunds a 5-tuple: (addressing network location, scheme, path, query, and fragment identifier).

Sample inputs and outputs for urlsplit()

Input: 
"https://www.google.com/example"

Output:
SplitResult(scheme='https', netloc='www.google.com', path='/example', query='', fragment='')

That way, we can get a base & path part for a website URL.

base_url = "{0.scheme}://{0.netloc}".format(parts)
    if '/' in parts.path:
      path = url[:url.rfind('/')+1]
    else:
      path = url

3. Send an HTTP GET request for a website.

print("Crawling URL %s" % url) # Optional
    try:
        response = requests.get(url)
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        # ignore pages with errors and continue with next url
        continue

4. Scrape all the email addresses from a response using regular expressions and add them to an email set.

# You may edit the regular expression as per your requirement
    new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.com", 
                  response.text, re.I)) # re.I: (ignore case)
    emails.update(new_emails)

In case, you are unfamiliar with Python’s regular regression, observe Python RegEx to get more details.

5. Discover all the linked URLs in a Website.

To do so, we need to make a BeautifulSoup for parsing an HTML document.

# create a beutiful soup for the html document
    soup = BeautifulSoup(response.text, 'lxml')

After that, we can get all linked URLs within a document through getting an tag that shows a hyperlink.

for anchor in soup.find_all("a"): 
        
        # extract linked url from the anchor
        if "href" in anchor.attrs:
          link = anchor.attrs["href"]
        else:
          link = ''
        
        # resolve relative links (starting with /)
        if link.startswith('/'):
            link = base_url + link
            
        elif not link.startswith('http'):
            link = path + link

Add a new url to unscraped queue in case, it was not with unscraped nor within scraped yet.

Also, we need to dismiss links including http://www.medium.com/file.gz which are incapable to get scraped.

      if not link.endswith(".gz"):
          if not link in unscraped and not link in scraped:
              unscraped.append(link)

Stage 4: Export Emails in the CSV File

After effectively extracting emails from a website, we could export emails to the CSV file.

df = pd.DataFrame(emails, columns=["Email"]) # replace with column name you prefer
df.to_csv('email.csv', index=False)

In case, you are utilizing Google Colaboratory, you could download a file to local machines through:

from google.colab import files
files.download("email.csv")

Here is the sample output file in CSV format:

Complete Code

Here is the complete code of all the procedure done:

import re
import requests
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import files

original_url = input("Enter the website url: ") 

unscraped = deque([original_url])  

scraped = set()  

emails = set()  

while len(unscraped):
    url = unscraped.popleft()  
    scraped.add(url)

    parts = urlsplit(url)
        
    base_url = "{0.scheme}://{0.netloc}".format(parts)
    if '/' in parts.path:
      path = url[:url.rfind('/')+1]
    else:
      path = url

    print("Crawling URL %s" % url)
    try:
        response = requests.get(url)
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        continue

    new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.com", response.text, re.I))
    emails.update(new_emails) 

    soup = BeautifulSoup(response.text, 'lxml')

    for anchor in soup.find_all("a"):
      if "href" in anchor.attrs:
        link = anchor.attrs["href"]
      else:
        link = ''

        if link.startswith('/'):
            link = base_url + link
        
        elif not link.startswith('http'):
            link = path + link

        if not link.endswith(".gz"):
          if not link in unscraped and not link in scraped:
              unscraped.append(link)

df = pd.DataFrame(emails, columns=["Email"])
df.to_csv('email.csv', index=False)

files.download("email.csv")

Looking for extracting email address data?

Request A Data!

Data

How to Extract Email Addresses using Web Scraper