Extract domain from URL in Python

·

2 min read

Extracting the domain from a URL in Python can be accomplished using several methods, ranging from utilizing the standard library to leveraging third-party packages for more complex URL parsing. Below is a comprehensive guide that covers different approaches to achieve this task.

urllib

The urllib.parse module provides functions for manipulating URLs and extracting different parts of a URL, including the domain.

from urllib.parse import urlparse

def extract_domain(url):
    parsed_url = urlparse(url)
    domain = parsed_url.netloc
    return domain

url = "https://www.example.com/path/page.html?query=argument"
domain = extract_domain(url)
print(domain)  # Output: www.example.com

Extracting Domain Without Subdomains

If you need just the second-level domain (SLD) and top-level domain (TLD) without subdomains, you can use the tldextract library, which handles various edge cases and is more reliable for complex URLs.

pip install tldextract

Use it as follows:

import tldextract

def extract_sld_tld(url):
    tld_ext = tldextract.extract(url)
    domain = f"{tld_ext.domain}.{tld_ext.suffix}"
    return domain

url = "https://subdomain.example.com/path/page.html?query=argument"
domain = extract_sld_tld(url)
print(domain)  # Output: example.com

Regular Expressions

For simple URL structures, you might consider using regular expressions (regex). However, this method is less reliable for complex URLs and might not handle all edge cases well.

import re

def extract_domain_regex(url):
    pattern = r'^(?:http[s]?://)?([^/]+)'
    match = re.search(pattern, url)
    if match:
        return match.group(1)
    return None

url = "http://www.example.com/path/page.html"
domain = extract_domain_regex(url)
print(domain)  # Output: www.example.com

Handling Edge Cases

When extracting domains, consider the following edge cases:

The urllib.parse and tldextract methods handle these cases gracefully, making them preferable for most applications.

For most use cases, using urllib.parse or tldextract is recommended due to their robustness and ability to handle a wide range of URL formats. Regular expressions can be used for simpler tasks but require careful handling to avoid common pitfalls.