Search
in english whole site
»
Home·Programming·MatLab·Webbotdeutsch·english·français
UserEnlargeHigherLower
StatisticsMinimizeHigherLower
» Hits: 13842  (details)
» Distinct visits: 7478 (1 Hit/visit)
» Bots visits: 6143 (Hits: 44.4%)
» Your hits in this session: 1
Contact & CommentsMinimizeHigherLower
» Visitor's comments
» Contact
Webbot

WebBot v.2.0

Internet page analysis/download utility

This function is a Java-based "web browser" that extract all links from a web-page, and display them. The resulting documents can be downloaded. Its conception can be used as a basic example for using the java.net.url Java Class, as well as PERL-originating regular expressions.

Content and usage

webbot(URL)URL is a string indicating the base page address; the url must link to an html file. The function lists all links in the file. URL can also be a cell vector of url-strings.
Selection
webbot(URL, WHAT)displays only specific links. WHAT is a string:
 'all_links'displays all links (default).
 'page_links'displays all links to an html web page*.
 'local_links'displays all local links on the server*.
 'external_links'displays all links to external websites.
 'image_links'displays all links to an image file*.
 'image_tags'displays all image tags <img src="xxx">.
 '.xxx.yyyy.zz'displays all links to each specific .xxx files; the case is ignored ('zip' will find 'ZiP'); Example: '.zip.gz.gzip.tar.Z'.
Action
webbot(URL, WHAT, ACT)performs an action on found links. ACT is a string:
 'noaction'just display links (default)
 'download'downloads all links found.
 'cartoons'downloads all image links found on linked pages. This is usefull e.g. for cartoons websites where each cartoon (e.g. "01.gif") is on its own html page (e.g. "c01.html")*.
 'follow.x'follows links to html pages and recursively performs the same action on the resulting page. 'x' is an integer indicating the recursivity depth (0 is equivalent to 'noaction').
Return value
lks = webbot(URL, ...)returns an cell-array with links of URL{end}.


Version and download

Version 2.0 - Release 20.12.03 [Legal Stuff below]
Download the function as a .zip (6 Kb).

Version 1.0 - Release 15.10.03 [Legal Stuff below]
Download the function as a .m matlab file (18 Kb).

Limitations/Prerequisites

As a standard MatLab installation comes with a Java Virtual Machine, there are no pre-requisites. This has been developed and solely tested on Release 13. There are some limitations in this early version, that may, or may not be corrected/completed:
  • Only english international characters are recongnized. A shame for a french-speaking living in a german-speaking region, é, à, ç or other ï are not any more supported than ä, ö or ü. This is usually not a problem as URLs are usually given names in standard "ANSI" characters...
  • Links are recognized in the flags 'page_links' and 'local_links' only if they are explicitely pointing to a .htm or .html url, i.e. not if they point on a folder with implicit call to index.html or index.htm. This has been partially fixed in version 2.0, but it is not sure that it will work seamlessly for lack of testing...
  • Image links (in the flag 'image_links') are recognized solely by file endings, namely the following file types: .jpg .jpeg .gif .pict .bmp .tif .tiff .ras .png (.giff)
  • The return value should be a consolidated list of all links found, not just the ones of the last url visited in the first call.

Changes from 1.0 to 2.0:

In version 1.0, the binary download was unbuffered, i.e. one byte is read at a time. It was mentioned that when I would understand how I can send a java array reference pointer to java, this can be modified (for an expected speed gain of at least factor 5).
 Note that the problem was not in using:
jA = javaArray('java.lang.Byte',n)
but in the fact that the 'read(byte[])' method of an 'inputStream' object does not accept the above defined jA as parameter.
If somebody can help, please do!
Now this has been modified, having learned that:
  • It is impossible in R13 to pass arrays by reference in Java functions.
  • Matworks provide a cool class (.getInterruptibleStreamCopier) that does the job perfectly. Indeed Mathworks even provide a built-in function to download files: urlwrite. Well, always nice to re-invent the wheel.
So now the download is buffered, and the speed gain is much higher than the expected factor 5!

Example:

Try this out with Dilbert's archives to download a full month's value of Dilbert's comics of the day!
webbot('http://www.unitedmedia.com/comics/dilbert/archive/', 'local_links', 'cartoons');
Remember these pictures are copyrighted and support the making of new comics by buying Scott Adam's books!





Legal blahblah & Conditions of use
© CSE - L.Cavin, 2003, 2004
These libraries and functions (THE PROGRAM) are provided "as is" without warranty, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of the program is with the person downloading or running the program (THE USER). Should the program prove defective, the user assumes the cost of all necessary servicing, repair or correction.
In no event will the Author or the Copyright holder be liable to the user for damages, including any general, special, incidental or consequential damages arising out of the use, peruse or inability to use the program (including, but not limited to loss of data or data being rendered inaccurate or losses sustained by the user or third parties or a failure of the program to operate with any other programs), even if the Author or the Copyright holder has been advised of the possibility of such damages.
The fact of downloading or running the program implies acceptation of the present liability limitation by the user.
These libraries and functions are free to use for non-commercial purposes and can be distributed (free of charge) as long as the copyrights notices are kept intact. In particular, if the program is distributed further by the user, the user is responsible for including this legal warning and liability limitation to the distribution.
It is also encouraged to improve these functions; please send to the author any improvements - we may want to include them in this distribution under the same conditions.


Do not be afraid, I am confident that the program will work - without warranties of course ;-)
Thanks to the Free Software Foundation for inspiring this nice little legal blahblah... My favorite part is the "inability to use"... even if the "general, special, incidental or consequential" part is also quite fun!

Made with CSE_CMS, © CSE, 2008-2021