Opened 3 years ago

Last modified 3 years ago

#1781 assigned defect

Crawling: i2p2.i2p recursive source loops

Reported by: k1773r Owned by: str4d
Priority: minor Milestone: undecided
Component: www/i2p Version: 0.9.24
Keywords: Cc:
Parent Tickets:

Description

While crawling www.i2p2.i2p i get recursive links which lead to a "page not found" site, but the HTTP status is 200. On those pages i get further nested links and it starts all over. Eventually it will hit a 404 (as shown below).

crawler logs:
first link is the site crawled, second link is where it came from.

    2016-04-06T**:19:40.798Z   404      22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_ru.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_ru.html text/html #044 20160406**1940424+346 sha1:66374BVL4IQZ3HBJXFVOAYAZBWU6VGEQ - -		
    2016-04-06T**:19:39.700Z   404      22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_nl.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_nl.html text/html #018 20160406**1939082+603 sha1:JWWJX7KEBMZCBJSEZW6C3TQPEEA6VG32 - -				
    2016-04-06T**:19:38.583Z   404      22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_it.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_it.html text/html #047 20160406**1938203+365 sha1:TNDZLJEXSFWTE3UZ3FX4BHELNBQSAW3F - -				
    2016-04-06T**:19:37.853Z   404      22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_fr.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_fr.html text/html #029 20160406**1937490+336 sha1:UIIBTTZBEW2LHC5TIWALY33YBZPQ4Y5C - -				
    2016-04-06T**:19:37.081Z   404      22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_zh.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_zh.html text/html #018 20160406**1936671+397 sha1:P6IKCGRG77YEY3U3QGET6JQICO2M274M - -		
    2016-04-06T**:19:36.201Z   404      22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_es.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_es.html text/html #047 20160406**1935726+448 sha1:GWBZFXTRMUQZQIPJ4EKA3FW4ERRRLYHS - -		
    2016-04-06T**:19:35.361Z   404      22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_de.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_de.html text/html #040 20160406**1934995+353 sha1:M56A3Y62E7AJYUEURZ224EEEYXS3GYCP - -	
    2016-04-06T**:19:34.526Z   404      22318 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index.html text/html #048 20160406**1934130+372 sha1:MAW4ZNR2RB4RFR6XG2UECOZCKQFT4TFW - -

The Crawler would detect the loop after some nested loops, but for now i just created a exclude regex.

Subtickets (add)

Change History (2)

comment:1 Changed 3 years ago by k1773r

  • Version set to 0.9.24

The status code also varies depending on which host is being used:

geti2p.net is 404
i2p-projekt.i2p is 302
i2p2.i2p is 200

for example on i2p2.i2p:

2016-04-06T**:39:16.911Z 200 6436 http://www.i2p2.i2p/feeds/p/i2p/downloads/_static/_static/styles/_static/styles/_static/_static/styles/_static/_static/donate.html LEEEEEEEEL http://www.i2p2.i2p/feeds/p/i2p/downloads/_static/_static/styles/_static/styles/_static/_static/styles/_static/_static/favicon.ico text/html #003 20160406**3916031+862 sha1:6QTBW2PILK47WKXLN6SKFFE5KWWH2AZR - -

that each site links to the other site with the invalid/nonexisting page makes it even worse.

comment:2 Changed 3 years ago by zzz

  • Owner set to str4d
  • Status changed from new to assigned
Note: See TracTickets for help on using tickets.