Perl Sitemap Generator

I think everyone has the problem to make a sitemap if you have more than 1M videos …this script counts the packets (40k) in line 19 and then make the packets to generate a sitemap.

Perl FTP

If you want to Upload files via Perl and FTP. you could use this script:

Javascript XPATH and Sleep

If you need to select an element via JavaScript and XPATH you could use this code.In line 4 I make a sleep about 10 Seconds and then open the link in the same window.

 

Hitfaker

I use two types of faker, the first one is to build a new query cache in my Database, you cant see the hits in Google-Analytics.

First one :

I use wget and a shell script, the -r Parameter is for recursive be carefully with that :

 Second one :

For this Script I use Phantomjs with Selenium, and for faking hits the Tor Proxy, if you want a http Tor Proxy look here.

Line 7 and 8 check Phantomjs and start.

Line 14 create a new Selenium Server, in Line 18 you could use the default Proxy option from Selenium.

Line 21 and 22 got to the Page and wait there.

Line 26 check if our Site was loaded.

Line 28 is a script to kill our Phantomjs. At first we take all process IDs and process arguments with PS, then we make a egrep to get our webdriver after this we cut of to get the PID and then we kill the Process.

 

Now we need a little script to start our faker script, this Bash Script starts our faker.pl Script 10 Times and then cleans up.

 

 

Tor Socks Proxy to http Proxy

If you use Tor and need a http Proxy for crawling you can use Polipo for that.

Install polipo:

Edit: “/etc/polipo/config”:

Line 2 is our new Proxy Port,Line 3 and 4 is the Parent Proxy Address and Port.

main.pl

the main script gets triggered from the WATCHDOG.pl script and is able to use the keywords on the page.

Line 6 start the phantom js

Line 11 Keywords

Line 15 Main Page crawl

Line 22 Keyword crawl

 

WATCHDOG

This is my Watchdog script to regulate the crawling scripts …..

Line 5-6 is to reset the scripts

Line 8 is to start the selenium server

Line 11 is my proxy port pointer … i user several ports ant to set the port I let the pointer iterate ….

Line 13-12 i count the phantomjs and main.pl

Line 15-16 is to remove the line break ….

Line 18-21 check the maximum values the start the script and let the pointer iterate ….

then sleep and redo the work ….

 

Apache-Cache

If you want to cache the Data in the user cache to avoid high traffic you could enable the cache, I choose for Images 60 Days and for Javascript and CSS 2 Days.

Edit: “/etc/apache2/apache2.conf”:

 

Maybe this helps you: apache2 Invalid command ‘Header’, perhaps misspelled or defined by a module not included in the server configuration.