Perl Sitemap Generator

I think everyone has the problem to make a sitemap if you have more than 1M videos …this script counts the packets (40k) in line 19 and then make the packets to generate a sitemap.

use strict;
use warnings;
use DBI;

my $host = "";
my $database = "db";
my $tablename = "table";
my $user = "root";
my $pw = "pwd1234";

my $dbh = DBI->connect('DBI:mysql:'.$database , $user, $pw
	           ) || die "Could not connect to database: $DBI::errstr";

my $sitemap_counter = 0;
my $th = $dbh->prepare('SELECT round(count(*)/40000+0.5) FROM `video` WHERE `bw` <=80');
while (my @row = $th->fetchrow_array()) {
   $sitemap_counter = $row[0];

for my $SMP (0..$sitemap_counter){
	print "build midd".$SMP.".xml ...".$/;
	open (DATEI, "> /var/www/MAP/midd".$SMP.".xml") or die $!;
	print DATEI '<?xml version="1.0" encoding="UTF-8"?>'.$/;
	print DATEI '<urlset xmlns="">'.$/;
	$th = $dbh->prepare('SELECT CONCAT( "<url><loc>", `id` , "</loc><priority>0.8</priority></url>" ) FROM `video` WHERE `bw` <=80 ORDER BY `id` LIMIT '.(40000*$SMP).' , 40000;');
	while (my @row = $th->fetchrow_array()) {
	   print DATEI $row[0].$/;
	print DATEI '</urlset>';
	close (DATEI);
	print "[DONE]".$/;


Perl FTP

If you want to Upload files via Perl and FTP. you could use this script:

use File::Basename;
use Net::FTP;
$ordner =time();
my $directory =  "backupordnerpfad";
my @parts = split(/\/, $directory);
my $length = $parts;
my $ordnerDir = $parts[$length-1];
$ftp = Net::FTP->new("", Debug => 1)
    or die "Cannot connect to hostname: $@";
$ftp->login("username", "passwort")
    or die "Cannot login ", $ftp->message;
    or die "Cannot change working directory ", $ftp->message;
# set binary mode which is needed for image upload
my @files = readdir(DIR);
foreach my $file (@files)
    if (not -d $file)

Javascript XPATH and Sleep

If you need to select an element via JavaScript and XPATH you could use this code.In line 4 I make a sleep about 10 Seconds and then open the link in the same window.

var XPATH =('//div[@class="nfo"]/pre/a[contains(@href,"")]');
var link = document.evaluate(XPATH, document.body, null, 6, null).snapshotItem(0);



I use two types of faker, the first one is to build a new query cache in my Database, you cant see the hits in Google-Analytics.

First one :

I use wget and a shell script, the -r Parameter is for recursive be carefully with that :

wget --spider -r ''
wget --spider -r ''

 Second one :

For this Script I use Phantomjs with Selenium, and for faking hits the Tor Proxy, if you want a http Tor Proxy look here.

use strict;
use warnings;
use Selenium::Remote::Driver;

if(-e "./phantomjs-1.9.7-linux-x86_64/bin/phantomjs"){
system('./phantomjs-1.9.7-linux-x86_64/bin/phantomjs --webdriver='.($ARGV[1]//4444).' --proxy= --proxy-type=socks5 >> /dev/null 2>&1 &');
die "cant find phantomjs !";

my $driver = new Selenium::Remote::Driver('remote_server_addr' => 'localhost',
                                          'port' 	       => ($ARGV[1]//4444),
                                          'browser_name'       => 'chrome',
                                          'platform'           => 'VISTA',
                                          #'proxy' 	       =>  {'proxyType' => 'manual', 'httpProxy' => ''}

if( $driver->get_current_url() ne "about:blank"){
system('ps -e -o pid,args -dd | egrep 'phantomjs.+webdriver='.($ARGV[1]//4444).'' | grep -v egrep | cut -d " " -f1 | xargs kill');
system('ps -e -o pid,args -dd | egrep 'phantomjs.+webdriver='.($ARGV[1]//4444).'' | grep -v egrep | cut -d " " -f1 | xargs kill');

Line 7 and 8 check Phantomjs and start.

Line 14 create a new Selenium Server, in Line 18 you could use the default Proxy option from Selenium.

Line 21 and 22 got to the Page and wait there.

Line 26 check if our Site was loaded.

Line 28 is a script to kill our Phantomjs. At first we take all process IDs and process arguments with PS, then we make a egrep to get our webdriver after this we cut of to get the PID and then we kill the Process.


Now we need a little script to start our faker script, this Bash Script starts our Script 10 Times and then cleans up.

for i in {4440..4449};
do ./ '' $i '&';pidof tor | xargs kill -HUP >/dev/null 2>&1;sleep 5;

sleep 4;

killall phantomjs;

the main script gets triggered from the script and is able to use the keywords on the page.

Line 6 start the phantom js

Line 11 Keywords

Line 15 Main Page crawl

Line 22 Keyword crawl

use strict;
use warnings;

system('./phantomjs-1.9.7-linux-x86_64/bin/phantomjs --webdriver='.$ARGV[0].' >> /dev/null 2>&1 &');sleep(4);
print 'working '.$$.(time()-$^T).$/;

my @words = ("keyword1","keyword2");

my $run ="";

                        $run ="./ '$_' '$ARGV[0]'";
                        print $run.$/;
                        system($run." && sleep 1");sleep(10);
for (0..$#words){
my $word = $words[$_];
                        $run ="./ '".$word."&page=$_' '$ARGV[0]'";
                        print $run.$/;
                        system($run." && sleep 1");sleep(10);

system("ps -e -o pid,args -dd | egrep '--webdriver=$ARGV[0]' | grep -v egrep | awk '{print $1}' | xargs kill -s 9");
print "normal exit !".$/;



This is my Watchdog script to regulate the crawling scripts …..

Line 5-6 is to reset the scripts

Line 8 is to start the selenium server

Line 11 is my proxy port pointer … i user several ports ant to set the port I let the pointer iterate ….

Line 13-12 i count the phantomjs and

Line 15-16 is to remove the line break ….

Line 18-21 check the maximum values the start the script and let the pointer iterate ….

then sleep and redo the work ….

use strict;
use warnings;

`killall phantomjs`;

system('java -jar ./selenium-server-standalone-2.40.0.jar >> /dev/null 2>&1 &');
#`killall -s 9 phantomjs && sleep 1`;
#system('./phantomjs-1.9.7-linux-x86_64/bin/phantomjs --webdriver=8888 >> /dev/$
my $pp=10;
my $count = `ps aux | grep phantomjs | grep -v grep | wc -l`//0;
my $skript = `ps aux | grep | grep -v grep | wc -l`//0;
$count  =~ s/n//og;
$skript =~ s/n//og;
print "Main : $count - Gesamt :$skript".$/;
if($count < 4 && $skript < 4){
system("./ '88$pp' > /dev/null 2>&1 &");
++$pp;$pp =10 if($pp>=20);



If you want to cache the Data in the user cache to avoid high traffic you could enable the cache, I choose for Images 60 Days and for Javascript and CSS 2 Days.

Edit: “/etc/apache2/apache2.conf”:


#cach control 60day
<FilesMatch ".(jpg|jpeg|gif)$">

Header set Cache-Control "max-age=5184000, public"


#cach control 2day
<FilesMatch ".(js|css)$">

Header set Cache-Control "max-age=172800, public"


Maybe this helps you: apache2 Invalid command ‘Header’, perhaps misspelled or defined by a module not included in the server configuration.