Saving results from Dogpile's Search Spy
Description:
Update: Dogpile Searchspy is dead and gone, this script doesn't work anymore. However, I have 4,023.42 megabytes of output ;)
Dogpile uses an XML feed to insert data into its flash-based "Search Spy" application. This script runs in the background, contacts the XML feed manually, and archives the results in a text file.
Platform:
Background
Dogpile's Search Spy is very handy if you are trying to understand in a general way what kind of things people search for on the Internet. The application itself is written in flash, and pulls data from one of two XML feeds:
Clicking on either feed will load the xml file in your browser. Refreshing will pull in a different set of keywords.
The 'retriever' script
This is a script that automates the scraping of the XML feed, stripping the extraneous XML data as it goes, and saving the results to an ongoing logfile. It defaults to the 'filtered' feed, but is adjustable if you want to log unfiltered results. It will run until killed, pulling a new copy of the XML feed down every 3 seconds or so. The is roughly the same frequency which the Search Spy application itself contacts the feed.
#!/bin/bash # We are, of course, IE6 under Windows XP USERAGENT="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" # Adult version # DOGPILE="lynx --dump http://www.dogpile.com/info.dogpl/searchspy/inc/data.xml" # Safe version DOGPILE="http://www.dogpile.com/info.dogpl/searchspy/inc/data.xml?filter=1"; # default output file is 'dogpile.out' OUTFILE="dogpile.out" # Check to make sure that we have our necessary bits if which lynx > /dev/null; then echo -n else echo Error, 'lynx' is required. exit 1; fi if which sed > /dev/null; then echo -n else echo Error, 'sed' is required. exit 1; fi if which date > /dev/null; then echo -n else echo Error, 'date' is required. exit 1; fi echo Retriever: Digging for bones. Output appends to $OUTFILE.mm_dd_yy while [ 1 -eq 1 ]; do lynx --dump -useragent=$USERAGENT $DOGPILE 2> /dev/null | sed -e :a -e 's/<[^>]*>/\n/g;/> $OUTFILE.`date +%m_%d_%y` sleep 3; done;
Example Output:
container gardening sector zero virus ali frazier hotmail high school musical irs.gov best legs cds in microwave ovens cesarean ingredients of success buycom promotion code free psp downloads dating advice disney mermaid wand large pendant silver star real estate famous quotes about the holocaust michigan golfing bath room color ideas dewalt dw705r "cab by train lyrics" fun job opening in phoenix das es freud picture frames buck buchanan mount vernon texas employee benefit job crazy frog free ringtone ... and so on and so forth ...