How to extract top-level domain name (TLD) from URL?

Answer

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds:
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:])
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

results in:

abcde.co.uk
    


 

All dns Questions

Ask your interview questions on dns

Write Your comment or Questions if you want the answers on dns from dns Experts
Name* :
Email Id* :
Mob no* :
Question
Or
Comment* :
 





Disclimer: PCDS.CO.IN not responsible for any content, information, data or any feature of website. If you are using this website then its your own responsibility to understand the content of the website

--------- Tutorials ---