The problem: take a url (in this case, we'll specify an absolute url, http and all) and return only the base domain. For instance, given 'www.domain.com' or 'sub.subdomain.domain.com', it should return 'domain.com'.
Simple enough, but now consider: 'http://www.example_site.com.pk' or 'http://damnlimies.co.uk' or 'http://username:password@this.is.a.worst.shortly.subdomain.thisIsMyMainWebsite.com.cl'
It is an anvil upon which many a hammer has been broken:
http://lists.evolt.org/archive/Week-of-Mon-20031201/152316.htmlhttp://www.webmasterworld.com/forum88/10656.htmAnyway, I think I have a solution. I won't bother with the details of how I arrived at it. Suffice to say, you just need to break down your own url parsing process. It passed the battery of tests at the end:
// get base domain (domain.tld)
/*____________________________________________________________________________*/
function get_base_domain($url)
{
$debug = 0;
$base_domain = '';
// generic tlds (source: http://en.wikipedia.org/wiki/Generic_top-level_domain)
$G_TLD = array(
'biz','com','edu','gov','info','int','mil','name','net','org',
'aero','asia','cat','coop','jobs','mobi','museum','pro','tel','travel',
'arpa','root',
'berlin','bzh','cym','gal','geo','kid','kids','lat','mail','nyc','post','sco','web','xxx',
'nato',
'example','invalid','localhost','test',
'bitnet','csnet','ip','local','onion','uucp',
'co' // note: not technically, but used in things like co.uk
);
// country tlds (source: http://en.wikipedia.org/wiki/Country_code_top-level_domain)
$C_TLD = array(
// active
'ac','ad','ae','af','ag','ai','al','am','an','ao','aq','ar','as','at','au','aw','ax','az',
'ba','bb','bd','be','bf','bg','bh','bi','bj','bm','bn','bo','br','bs','bt','bw','by','bz',
'ca','cc','cd','cf','cg','ch','ci','ck','cl','cm','cn','co','cr','cu','cv','cx','cy','cz',
'de','dj','dk','dm','do','dz','ec','ee','eg','er','es','et','eu','fi','fj','fk','fm','fo',
'fr','ga','gd','ge','gf','gg','gh','gi','gl','gm','gn','gp','gq','gr','gs','gt','gu','gw',
'gy','hk','hm','hn','hr','ht','hu','id','ie','il','im','in','io','iq','ir','is','it','je',
'jm','jo','jp','ke','kg','kh','ki','km','kn','kr','kw','ky','kz','la','lb','lc','li','lk',
'lr','ls','lt','lu','lv','ly','ma','mc','md','mg','mh','mk','ml','mm','mn','mo','mp','mq',
'mr','ms','mt','mu','mv','mw','mx','my','mz','na','nc','ne','nf','ng','ni','nl','no','np',
'nr','nu','nz','om','pa','pe','pf','pg','ph','pk','pl','pn','pr','ps','pt','pw','py','qa',
're','ro','ru','rw','sa','sb','sc','sd','se','sg','sh','si','sk','sl','sm','sn','sr','st',
'sv','sy','sz','tc','td','tf','tg','th','tj','tk','tl','tm','tn','to','tr','tt','tv','tw',
'tz','ua','ug','uk','us','uy','uz','va','vc','ve','vg','vi','vn','vu','wf','ws','ye','yu',
'za','zm','zw',
// inactive
'eh','kp','me','rs','um','bv','gb','pm','sj','so','yt','su','tp','bu','cs','dd','zr'
);
// get domain
if ( !$full_domain = get_url_domain($url) )
{
return $base_domain;
}
// now the fun
// break up domain, reverse
$DOMAIN = explode('.', $full_domain);
if ( $debug ) print_r($DOMAIN);
$DOMAIN = array_reverse($DOMAIN);
if ( $debug ) print_r($DOMAIN);
// first check for ip address
if ( count($DOMAIN) == 4 && is_numeric($DOMAIN[0]) && is_numeric($DOMAIN[3]) )
{
return $full_domain;
}
// if only 2 domain parts, that must be our domain
if ( count($DOMAIN) <= 2 ) return $full_domain;
/*
finally, with 3+ domain parts: obviously D0 is tld
now, if D0 = ctld and D1 = gtld, we might have something like com.uk
so, if D0 = ctld && D1 = gtld && D2 != 'www', domain = D2.D1.D0
else if D0 = ctld && D1 = gtld && D2 == 'www', domain = D1.D0
else domain = D1.D0
these rules are simplified below
*/
if ( in_array($DOMAIN[0], $C_TLD) && in_array($DOMAIN[1], $G_TLD) && $DOMAIN[2] != 'www' )
{
$full_domain = $DOMAIN[2] . '.' . $DOMAIN[1] . '.' . $DOMAIN[0];
}
else
{
$full_domain = $DOMAIN[1] . '.' . $DOMAIN[0];;
}
// did we succeed?
return $full_domain;
}
/*____________________________________________________________________________*/
// get domain from url
/*____________________________________________________________________________*/
function get_url_domain($url)
{
$domain = '';
$_URL = parse_url($url);
// sanity check
if ( empty($_URL) || empty($_URL['host']) )
{
$domain = '';
}
else
{
$domain = $_URL['host'];
}
return $domain;
}
/*____________________________________________________________________________*/
// Testbed
/*____________________________________________________________________________*/
if ( 1 )
{
// test code here
$TESTURL[] = 'http://127.0.0.1';
$TESTURL[] = 'http://www.examplesite.com.pk';
$TESTURL[] = 'http://domain.tv.com';
$TESTURL[] = 'http://domain.com.tv';
$TESTURL[] = 'http://domain.tv';
$TESTURL[] = 'http://domain.com';
$TESTURL[] = 'http://secure.email.website.co.uk';
$TESTURL[] = 'http://username:password@this.is.a.worst.shortly.subdomain.thisIsMyMainWebsite.com.cl';
foreach ( $TESTURL as $url )
{
echo $url . ' -> ' . get_base_domain($url) . '
';
}
}
/*____________________________________________________________________________*/
results:
http://127.0.0.1 -> 127.0.0.1
http://www.examplesite.com.pk -> examplesite.com.pk
http://domain.tv.com -> tv.com
http://domain.com.tv -> domain.com.tv
http://domain.tv -> domain.tv
http://domain.com -> domain.com
http://secure.email.website.co.uk -> website.co.uk
http://username:password@this.is.a.worst.shortly.subdomain.thisIsMyMainWebsite.com.cl -> thisIsMyMainWebsite.com.cl
The source will eventually end up in the
bafflegate repository.
Thanks for asking. I should add a note here. It's GPL (v2). So feel free to use it.
I don't think it's an issue if you're not distributing it. In any event, I've released the code under an LGPL license at my Google code site:
http://code.google.com/p/klenwell/
www.aaa.or.id
www.aaa.ac.id
Where can I find such "or" and "ac" list in Wikipedia?
aaa.or.id (without www)
Just want to let you know that "aaa" is just a name just like "google" etc. Thanks.
www.aaa.or.id
www.aaa.ac.id
What is it returning? or.id? ac.id? That would be the correct response, at least according to how it's currently programmed and configured.
I suspect that the base domains are or.id and ac.id and that the owner is reselling subdomains like aaa.or.id as if they are base domains. 'or', 'ac', and 'id' are all country domains. So saying mydomain.or.id is like saying mydomain.cn.us. I don't think they're official TLDs.
In any event, adding 'or' and 'ac' to the $G_TLD array (as is the case with 'co') should solve your problem.
http://www.indonesia.go.id/
http://www.wwf.or.id/
http://www.itb.ac.id/
"go" means government.
"or" means organization.
"ac" means academy.
I only gave examples for second level domain of my country. I guess other countries also have the same thing as well.
//body of domain ex: iam.bacon.com we want bacon
$domainBody=preg_replace('/^[a-z0-9.-]*?[.]{0,1}([a-z0-9-]*?)\.[a-z.]{2,6}$/i',"$1",$serverHost);
it's pretty simple to change to look for the domain base + tld
more : http://gregsidberry.com/2008/05/11/php-domain-base-regex/
You may want to look at what this is doing again and test your regex solution with my test cases. A regex alone won't work because a self-contained solution requires knowledge of the valid TLD set.