lime icon

Phosphorus and Lime

A Developer's Broadsheet

This blog has been deprecated. Please visit my new blog at klenwell.com/press.
PHP: Get Base Domain
The problem: take a url (in this case, we'll specify an absolute url, http and all) and return only the base domain. For instance, given 'www.domain.com' or 'sub.subdomain.domain.com', it should return 'domain.com'.

Simple enough, but now consider: 'http://www.example_site.com.pk' or 'http://damnlimies.co.uk' or 'http://username:password@this.is.a.worst.shortly.subdomain.thisIsMyMainWebsite.com.cl'

It is an anvil upon which many a hammer has been broken:

http://lists.evolt.org/archive/Week-of-Mon-20031201/152316.html
http://www.webmasterworld.com/forum88/10656.htm

Anyway, I think I have a solution. I won't bother with the details of how I arrived at it. Suffice to say, you just need to break down your own url parsing process. It passed the battery of tests at the end:

// get base domain (domain.tld)
/*____________________________________________________________________________*/
function get_base_domain($url)
{
$debug = 0;
$base_domain = '';

// generic tlds (source: http://en.wikipedia.org/wiki/Generic_top-level_domain)
$G_TLD = array(
'biz','com','edu','gov','info','int','mil','name','net','org',
'aero','asia','cat','coop','jobs','mobi','museum','pro','tel','travel',
'arpa','root',
'berlin','bzh','cym','gal','geo','kid','kids','lat','mail','nyc','post','sco','web','xxx',
'nato',
'example','invalid','localhost','test',
'bitnet','csnet','ip','local','onion','uucp',
'co' // note: not technically, but used in things like co.uk
);

// country tlds (source: http://en.wikipedia.org/wiki/Country_code_top-level_domain)
$C_TLD = array(
// active
'ac','ad','ae','af','ag','ai','al','am','an','ao','aq','ar','as','at','au','aw','ax','az',
'ba','bb','bd','be','bf','bg','bh','bi','bj','bm','bn','bo','br','bs','bt','bw','by','bz',
'ca','cc','cd','cf','cg','ch','ci','ck','cl','cm','cn','co','cr','cu','cv','cx','cy','cz',
'de','dj','dk','dm','do','dz','ec','ee','eg','er','es','et','eu','fi','fj','fk','fm','fo',
'fr','ga','gd','ge','gf','gg','gh','gi','gl','gm','gn','gp','gq','gr','gs','gt','gu','gw',
'gy','hk','hm','hn','hr','ht','hu','id','ie','il','im','in','io','iq','ir','is','it','je',
'jm','jo','jp','ke','kg','kh','ki','km','kn','kr','kw','ky','kz','la','lb','lc','li','lk',
'lr','ls','lt','lu','lv','ly','ma','mc','md','mg','mh','mk','ml','mm','mn','mo','mp','mq',
'mr','ms','mt','mu','mv','mw','mx','my','mz','na','nc','ne','nf','ng','ni','nl','no','np',
'nr','nu','nz','om','pa','pe','pf','pg','ph','pk','pl','pn','pr','ps','pt','pw','py','qa',
're','ro','ru','rw','sa','sb','sc','sd','se','sg','sh','si','sk','sl','sm','sn','sr','st',
'sv','sy','sz','tc','td','tf','tg','th','tj','tk','tl','tm','tn','to','tr','tt','tv','tw',
'tz','ua','ug','uk','us','uy','uz','va','vc','ve','vg','vi','vn','vu','wf','ws','ye','yu',
'za','zm','zw',
// inactive
'eh','kp','me','rs','um','bv','gb','pm','sj','so','yt','su','tp','bu','cs','dd','zr'
);


// get domain
if ( !$full_domain = get_url_domain($url) )
{
return $base_domain;
}

// now the fun

// break up domain, reverse
$DOMAIN = explode('.', $full_domain);
if ( $debug ) print_r($DOMAIN);
$DOMAIN = array_reverse($DOMAIN);
if ( $debug ) print_r($DOMAIN);

// first check for ip address
if ( count($DOMAIN) == 4 && is_numeric($DOMAIN[0]) && is_numeric($DOMAIN[3]) )
{
return $full_domain;
}

// if only 2 domain parts, that must be our domain
if ( count($DOMAIN) <= 2 ) return $full_domain;

/*
finally, with 3+ domain parts: obviously D0 is tld
now, if D0 = ctld and D1 = gtld, we might have something like com.uk
so, if D0 = ctld && D1 = gtld && D2 != 'www', domain = D2.D1.D0
else if D0 = ctld && D1 = gtld && D2 == 'www', domain = D1.D0
else domain = D1.D0
these rules are simplified below
*/
if ( in_array($DOMAIN[0], $C_TLD) && in_array($DOMAIN[1], $G_TLD) && $DOMAIN[2] != 'www' )
{
$full_domain = $DOMAIN[2] . '.' . $DOMAIN[1] . '.' . $DOMAIN[0];
}
else
{
$full_domain = $DOMAIN[1] . '.' . $DOMAIN[0];;
}

// did we succeed?
return $full_domain;
}
/*____________________________________________________________________________*/


// get domain from url
/*____________________________________________________________________________*/
function get_url_domain($url)
{
$domain = '';

$_URL = parse_url($url);

// sanity check
if ( empty($_URL) || empty($_URL['host']) )
{
$domain = '';
}
else
{
$domain = $_URL['host'];
}

return $domain;
}
/*____________________________________________________________________________*/


// Testbed
/*____________________________________________________________________________*/

if ( 1 )
{
// test code here
$TESTURL[] = 'http://127.0.0.1';
$TESTURL[] = 'http://www.examplesite.com.pk';
$TESTURL[] = 'http://domain.tv.com';
$TESTURL[] = 'http://domain.com.tv';
$TESTURL[] = 'http://domain.tv';
$TESTURL[] = 'http://domain.com';
$TESTURL[] = 'http://secure.email.website.co.uk';
$TESTURL[] = 'http://username:password@this.is.a.worst.shortly.subdomain.thisIsMyMainWebsite.com.cl';

foreach ( $TESTURL as $url )
{
echo $url . ' -> ' . get_base_domain($url) . '
';
}
}

/*____________________________________________________________________________*/


results:
http://127.0.0.1 -> 127.0.0.1
http://www.examplesite.com.pk -> examplesite.com.pk
http://domain.tv.com -> tv.com
http://domain.com.tv -> domain.com.tv
http://domain.tv -> domain.tv
http://domain.com -> domain.com
http://secure.email.website.co.uk -> website.co.uk
http://username:password@this.is.a.worst.shortly.subdomain.thisIsMyMainWebsite.com.cl -> thisIsMyMainWebsite.com.cl


The source will eventually end up in the bafflegate repository.
Can I use this in my application? What license are you distributing it under?
Can I use this in my application? What license are you distributing it under?

Thanks for asking. I should add a note here. It's GPL (v2). So feel free to use it.
Or in my case not to. GPL's a bit to restrictive.
Actually, how does the GPL apply here? If I want to use it for a commercial application that I won't distribute.
Actually, how does the GPL apply here? If I want to use it for a commercial application that I won't distribute.

I don't think it's an issue if you're not distributing it. In any event, I've released the code under an LGPL license at my Google code site:

http://code.google.com/p/klenwell/
It still cannot handle such URL:
www.aaa.or.id
www.aaa.ac.id
Where can I find such "or" and "ac" list in Wikipedia?
Another case:
aaa.or.id (without www)

Just want to let you know that "aaa" is just a name just like "google" etc. Thanks.
It still cannot handle such URL:
www.aaa.or.id
www.aaa.ac.id


What is it returning? or.id? ac.id? That would be the correct response, at least according to how it's currently programmed and configured.

I suspect that the base domains are or.id and ac.id and that the owner is reselling subdomains like aaa.or.id as if they are base domains. 'or', 'ac', and 'id' are all country domains. So saying mydomain.or.id is like saying mydomain.cn.us. I don't think they're official TLDs.

In any event, adding 'or' and 'ac' to the $G_TLD array (as is the case with 'co') should solve your problem.
Yes it returns "or.id" and so on. Btw sorry for not giving real world examples before. Now here it is (all sites given are legitimate):
http://www.indonesia.go.id/
http://www.wwf.or.id/
http://www.itb.ac.id/

"go" means government.
"or" means organization.
"ac" means academy.

I only gave examples for second level domain of my country. I guess other countries also have the same thing as well.
Hey I had to make a regex to handle this

//body of domain ex: iam.bacon.com we want bacon
$domainBody=preg_replace('/^[a-z0-9.-]*?[.]{0,1}([a-z0-9-]*?)\.[a-z.]{2,6}$/i',"$1",$serverHost);
it's pretty simple to change to look for the domain base + tld

more : http://gregsidberry.com/2008/05/11/php-domain-base-regex/
Greg,

You may want to look at what this is doing again and test your regex solution with my test cases. A regex alone won't work because a self-contained solution requires knowledge of the valid TLD set.
Maybe it will be better to use this list - there are exports to txt, serialized array... http://labs.lucien144.net/domains/
http://labs.lucien144.net/domains/Interesting. I think mine is cleaner, but this is a useful alternative.