This is a discussion on "Links a crawler should ignore" within the Web Page Design section. This forum, and the thread "Links a crawler should ignore are both part of the Design Your Website category.
|
|
|
|
|
![]() |
||
Links a crawler should ignore
|
||
| Notices |
![]() |
|
|
LinkBack | Thread Tools |
|
|||
|
Links a crawler should ignore
Hi, I have developed some code that crawls web pages looking for links. I need to filter out irrelevant links such as those that refer to css, javascript functions, favicons, this is simple enough to achieve with regex. What i need to know is what other irrelevant links am i likely to find on web pages?
Also is there a name for links of the following form: - http://www. bbc.co.uk/go/homepage/www/lht/h2/t/-/http://www.tvlicensing.co.uk/index.jsp Cheers Don |
|
|
![]() |
| Tags |
| unusual links, web crawler |
| Thread Tools | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Getting Google to ignore pages not in sitemap? | nate2099 | Search Engine Optimization (SEO) | 12 | Jul 9th, 2008 08:35 |
| web crawler not following links | nate2099 | Web Page Design | 1 | Feb 17th, 2008 01:59 |
| My pet crawler | jhappeal | Webforumz Cafe | 12 | Mar 16th, 2007 18:51 |