euscan: robots.txt, timeout, user-agent, ...

- Add a blacklist for robots.txt, we *want* to scan sourceforge
- Set a user-agent that doesn't looks like a browser
- Handle timeouts more carefully
- If brute force detect too much versions, avoid infinite loops
- Handle redirections more carefully

Signed-off-by: Corentin Chary <corentincj@iksaif.net>
This commit is contained in:
Corentin Chary
2011-09-21 10:09:50 +02:00
parent 8c40a1795c
commit 14971584af
6 changed files with 95 additions and 17 deletions

9
TODO
View File

@ -4,16 +4,12 @@ TODO
euscan
------
- respect robots.txt (portscout)
- check other distros (youri)
- clean blacklist system
- add a way to blacklist versions using standard package tokens
- =x11-drivers/xf86-video-intel-2.14.90*
- >=x11-base/xorg-server-1.10.900
Site Handlers
-------------
- sourceforge: http://sourceforge.net/api/file/index/project-name/vboxgtk/mtime/desc/limit/20/rss http://sourceforge.net/api/release/index/project-id/264534/rss
- ftp.kde.org: doesn't scan the "unstable" tree
- mysql: should use http://downloads.mysql.com/archives/
- mariadb: should use http://downloads.askmonty.org/MariaDB/+releases/
@ -22,3 +18,6 @@ euscanwww
---------
- add progress options for each command
- add last scan in the footer
- add json/xml for each page
- rss scan world + post ?