Perl Sitemap Generator

I think everyone has the problem to make a sitemap if you have more than 1M videos …this script counts the packets (40k) in line 19 and then make the packets to generate a sitemap.

#!/usr/bin/perl
use strict;
use warnings;
use DBI;

# MYSQL CONFIG VARIABLES
my $host = "127.0.0.1";
my $database = "db";
my $tablename = "table";
my $user = "root";
my $pw = "pwd1234";


my $dbh = DBI->connect('DBI:mysql:'.$database , $user, $pw
	           ) || die "Could not connect to database: $DBI::errstr";


my $sitemap_counter = 0;
my $th = $dbh->prepare('SELECT round(count(*)/40000+0.5) FROM `video` WHERE `bw` <=80');
$th->execute();
while (my @row = $th->fetchrow_array()) {
   $sitemap_counter = $row[0];
}



for my $SMP (0..$sitemap_counter){
	print "build midd".$SMP.".xml ...".$/;
	open (DATEI, "> /var/www/MAP/midd".$SMP.".xml") or die $!;
	print DATEI '<?xml version="1.0" encoding="UTF-8"?>'.$/;
	print DATEI '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'.$/;
	$th = $dbh->prepare('SELECT CONCAT( "<url><loc>http://www.example.com/view.php?ID=", `id` , "</loc><priority>0.8</priority></url>" ) FROM `video` WHERE `bw` <=80 ORDER BY `id` LIMIT '.(40000*$SMP).' , 40000;');
	$th->execute();
	while (my @row = $th->fetchrow_array()) {
	   print DATEI $row[0].$/;
	}
	print DATEI '</urlset>';
	close (DATEI);
	print "[DONE]".$/;
}


$dbh->disconnect;
exit;

Perl FTP

If you want to Upload files via Perl and FTP. you could use this script:

#!/usr/bin/perl
use File::Basename;
use Net::FTP;
$ordner =time();
my $directory =  "backupordnerpfad";
my @parts = split(/\/, $directory);
my $length = $parts;
my $ordnerDir = $parts[$length-1];
$ftp = Net::FTP->new("www.myftpserver.at", Debug => 1)
    or die "Cannot connect to hostname: $@";
$ftp->login("username", "passwort")
    or die "Cannot login ", $ftp->message;
$ftp->cwd("/www")
    or die "Cannot change working directory ", $ftp->message;
$ftp->mkdir($ordner);
$ftp->cwd($ordner);
# set binary mode which is needed for image upload
$ftp->binary();
opendir(DIR,$directory);
my @files = readdir(DIR);
foreach my $file (@files)
    {
    if (not -d $file)
        {
        $ftp->put("$directory/$file");
        }
    }
$ftp->quit();
sleep(1000);

Javascript XPATH and Sleep

If you need to select an element via JavaScript and XPATH you could use this code.In line 4 I make a sleep about 10 Seconds and then open the link in the same window.

var XPATH =('//div[@class="nfo"]/pre/a[contains(@href,"imagecurl.org")]');
//alert(XPATH);
var link = document.evaluate(XPATH, document.body, null, 6, null).snapshotItem(0);
    setTimeout(function(){window.open(link,"_self")},10000);

 

Hitfaker

I use two types of faker, the first one is to build a new query cache in my Database, you cant see the hits in Google-Analytics.

First one :

I use wget and a shell script, the -r Parameter is for recursive be carefully with that :

wget --spider -r 'http://www.example.com/index.php?M_TOP=send&SEARCH='
wget --spider -r 'http://www.example.com/index.php?M_LAST=send&SEARCH='

 Second one :

For this Script I use Phantomjs with Selenium, and for faking hits the Tor Proxy, if you want a http Tor Proxy look here.

#!/usr/bin/perl
use strict;
use warnings;
use Selenium::Remote::Driver;


if(-e "./phantomjs-1.9.7-linux-x86_64/bin/phantomjs"){
system('./phantomjs-1.9.7-linux-x86_64/bin/phantomjs --webdriver='.($ARGV[1]//4444).' --proxy=127.0.0.1:9050 --proxy-type=socks5 >> /dev/null 2>&1 &');
sleep(2);
}else{
die "cant find phantomjs !";
}

my $driver = new Selenium::Remote::Driver('remote_server_addr' => 'localhost',
                                          'port' 	       => ($ARGV[1]//4444),
                                          'browser_name'       => 'chrome',
                                          'platform'           => 'VISTA',
                                          #'proxy' 	       =>  {'proxyType' => 'manual', 'httpProxy' => '127.0.0.1:9050'}
                                          ); 
sleep(2);
$driver->get($ARGV[0]);
$driver->set_implicit_wait_timeout(1000);
sleep(1);


if( $driver->get_current_url() ne "about:blank"){
$driver->quit();
system('ps -e -o pid,args -dd | egrep 'phantomjs.+webdriver='.($ARGV[1]//4444).'' | grep -v egrep | cut -d " " -f1 | xargs kill');
exit(1);
}
$driver->quit();
system('ps -e -o pid,args -dd | egrep 'phantomjs.+webdriver='.($ARGV[1]//4444).'' | grep -v egrep | cut -d " " -f1 | xargs kill');
exit(0);

Line 7 and 8 check Phantomjs and start.

Line 14 create a new Selenium Server, in Line 18 you could use the default Proxy option from Selenium.

Line 21 and 22 got to the Page and wait there.

Line 26 check if our Site was loaded.

Line 28 is a script to kill our Phantomjs. At first we take all process IDs and process arguments with PS, then we make a egrep to get our webdriver after this we cut of to get the PID and then we kill the Process.

 

Now we need a little script to start our faker script, this Bash Script starts our faker.pl Script 10 Times and then cleans up.

#!/bin/bash
for i in {4440..4449};
do ./faker.pl 'http://www.example.com' $i '&';pidof tor | xargs kill -HUP >/dev/null 2>&1;sleep 5;
done

sleep 4;

killall phantomjs;
killall faker.pl;

 

 

main.pl

the main script gets triggered from the WATCHDOG.pl script and is able to use the keywords on the page.

Line 6 start the phantom js

Line 11 Keywords

Line 15 Main Page crawl

Line 22 Keyword crawl

#!/usr/bin/perl
use strict;
use warnings;
$|=0;

system('./phantomjs-1.9.7-linux-x86_64/bin/phantomjs --webdriver='.$ARGV[0].' >> /dev/null 2>&1 &');sleep(4);
print 'working '.$$.(time()-$^T).$/;



my @words = ("keyword1","keyword2");

my $run ="";

                for(1..6){
                        $run ="./download.pl 'http://www.example.com/new/$_' '$ARGV[0]'";
                        print $run.$/;
                        system($run." && sleep 1");sleep(10);
                }
for (0..$#words){
my $word = $words[$_];
                for(1..1){
                        $run ="./download.pl 'http://www.example.com/search.php?what=".$word."&page=$_' '$ARGV[0]'";
                        print $run.$/;
                        system($run." && sleep 1");sleep(10);
                }
}

system("ps -e -o pid,args -dd | egrep '--webdriver=$ARGV[0]' | grep -v egrep | awk '{print $1}' | xargs kill -s 9");
print "normal exit !".$/;

 

WATCHDOG

This is my Watchdog script to regulate the crawling scripts …..

Line 5-6 is to reset the scripts

Line 8 is to start the selenium server

Line 11 is my proxy port pointer … i user several ports ant to set the port I let the pointer iterate ….

Line 13-12 i count the phantomjs and main.pl

Line 15-16 is to remove the line break ….

Line 18-21 check the maximum values the start the script and let the pointer iterate ….

then sleep and redo the work ….

#!/usr/bin/perl
use strict;
use warnings;

`killall phantomjs`;
`killall main.pl`;

system('java -jar ./selenium-server-standalone-2.40.0.jar >> /dev/null 2>&1 &');
#`killall -s 9 phantomjs && sleep 1`;
#system('./phantomjs-1.9.7-linux-x86_64/bin/phantomjs --webdriver=8888 >> /dev/$
my $pp=10;
do{
my $count = `ps aux | grep phantomjs | grep -v grep | wc -l`//0;
my $skript = `ps aux | grep main.pl | grep -v grep | wc -l`//0;
$count  =~ s/n//og;
$skript =~ s/n//og;
print "Main : $count - Gesamt :$skript".$/;
if($count < 4 && $skript < 4){
system("./main.pl '88$pp' > /dev/null 2>&1 &");
++$pp;$pp =10 if($pp>=20);
}
sleep(20);
}while(1);

 

Apache-Cache

If you want to cache the Data in the user cache to avoid high traffic you could enable the cache, I choose for Images 60 Days and for Javascript and CSS 2 Days.

Edit: “/etc/apache2/apache2.conf”:

 

#cach control 60day
<FilesMatch ".(jpg|jpeg|gif)$">

Header set Cache-Control "max-age=5184000, public"

</FilesMatch>


#cach control 2day
<FilesMatch ".(js|css)$">

Header set Cache-Control "max-age=172800, public"

</FilesMatch>

Maybe this helps you: apache2 Invalid command ‘Header’, perhaps misspelled or defined by a module not included in the server configuration.