CosHtmlCache静态化的自动化访问

CosHtmlCache静态化的自动化访问

为了提升访问速度并利于SEO,博主的博客使用CosHtmlCache插件进行静态化处理,该插件的一个不足就是就是必须在非登录状态下访问页面才会自动生成缓存,在后台的插件设置中只有删除缓存而没有一次性静态化的选项,这非常不方便,特别是当博文比较多达到几百篇时,显然人工点击是让人无法忍受的。

第一个想到的是直接修改插件,对文章内容进行变更后自动生成缓存,但是这里有一个问题:如果插件升级,就必须重新修改代码,另一方面这也破坏了封装原则。

鉴于此,我写了两个程序尝试解决这个问题,思路是用程序模拟GET请求自动访问所有页面,这样达到了与人工点击一样的效果。至于如何得到博客的所有文章和所有页面的地址,我使用了Google Sitemap XML插件。当然并不是单纯为了得到网址才安装的这个插件,之前为了优化Google的搜索,用这个插件自动生成了Sitemap然后在Google的Webmaster Tools里提交就可以起到比较好的收录效果。这个插件当然也可以提供博客所有的网址。

Java版本采用DOM对XML文件进行处理,PHP版本采用正则表达式解析出网址。

Java版本和PHP版本都需要用户填写Sitemap的XML地址

Java版本源码下载:点我下载

PHP版本源码下载:点我下载

你也可以自己实现一个这样的程序,为便于参考,将源代码贴出如下:

Java Edition

Visitor.java

import java.util.List;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class Visitor {
    public static CountDownLatch latch;

    // 在命令行下输入XML文件所在地址,例如:java Visitor http://yourdomain.com/yourxml.xml
    public static void main(String[] args) {
        if (args.length == 0) {
            System.out.println("没有输入XML文件所在地址");
        } else {
            String XMLaddr = args[0];
            try {
                visit(XMLaddr);
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    public static void visit(String XMLaddr) throws Exception {
        XMLAnalyser xa = new XMLAnalyser(XMLaddr);
        List<String> URLs = xa.parseXML();

        latch = new CountDownLatch(URLs.size());

        ExecutorService exec = Executors.newCachedThreadPool();
        for (int i = 0; i < lt; URLs.size(); i++) {
            exec.execute(new MailMan(URLs.get(i)));
        }
        exec.shutdown();

        latch.await();
        System.out.println("任务完成");
    }
}

XMLAnalyser.java

import java.util.LinkedList;
import java.util.List;

import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.w3c.dom.NodeList;

public class XMLAnalyser {
    private String XMLaddr;

    public XMLAnalyser(String XMLaddr) {
        this.XMLaddr = XMLaddr;
    }

    public List<String> parseXML() throws Exception {
        Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(XMLaddr);
        NodeList list = doc.getElementsByTagName("loc");
        List<String> URLs = getTagElements(list);
        return URLs;
    }

    private List<String> getTagElements(NodeList list) {
        List<String> URLs = new LinkedList<String>();
        for (int i = 0; i < list.getLength(); i++) {
            URLs.add(list.item(i).getChildNodes().item(0).getNodeValue());
        }
        return URLs;
    }
}

MailMan.java

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.net.InetAddress;
import java.net.MalformedURLException;
import java.net.Socket;
import java.net.URL;
import java.util.LinkedList;
import java.util.List;

public class MailMan implements Runnable {
    private String URL;
    private String Domain;
    private Socket socket;
    private BufferedReader br;
    private PrintWriter pw;
    private List<String> output;

    public MailMan(String URL) throws MalformedURLException {
        this.URL = URL;
        this.Domain = new URL(URL).getHost();
        output = new LinkedList<String>();
    }

    public void run() {
        initialize();
        sendToServer();
        receiveFromServer();
        closeConnection();
    }

    private void initialize() {
        String Get = "GET " + URL + " HTTP/1.1";
        String Host = "Host: " + Domain;

        output.add(Get);
        output.add(Host);

        try {
            InetAddress addr = InetAddress.getByName(Domain);
            socket = new Socket(addr, 80);
            br = new BufferedReader(new InputStreamReader(socket.getInputStream(), "UTF-8"));
            pw = new PrintWriter(new OutputStreamWriter(socket.getOutputStream(), "UTF-8"));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private void sendToServer() {
        for (int i = 0; i < output.size(); i++) {
            pw.println(output.get(i));
        }
        pw.println();
        pw.flush();
    }

    private String receiveFromServer() {
        String temp;
        StringBuilder infoReceived = new StringBuilder();
        try {
            while ((temp = br.readLine()) != null) {
                // Just receive but do nothing
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        return infoReceived.toString();
    }

    private void closeConnection() {
        try {
            br.close();
            pw.close();
            socket.close();
            System.out.println(URL + " Received.");
            Visitor.latch.countDown();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

PHP Edition

<?php
$url = ""; // 填写你的XML地址
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
curl_close($ch);

$pattern = "/(<loc>(.*)<\/loc>)/";
preg_match_all($pattern, $data, $matches);

$mh = curl_multi_init();

for ($i=0;$i<count($matches[2]);$i++){
    $arr[] = curl_init();
}

for ($i=0;$i<count($arr);$i++){
    echo "Get the page: ".$matches[2][$i]."<br/>";
    curl_setopt($arr[$i], CURLOPT_URL, $matches[2][$i]);
    curl_setopt($arr[$i], CURLOPT_HEADER, 0);
    curl_setopt($arr[$i], CURLOPT_RETURNTRANSFER, 1);
    curl_multi_add_handle($mh,$arr[$i]);
}

$active = null;
do {
    $mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);

while ($active && $mrc == CURLM_OK) {
    if (curl_multi_select($mh) != -1) {
        do {
            $mrc = curl_multi_exec($mh, $active);
        } while ($mrc == CURLM_CALL_MULTI_PERFORM);
    }
}
echo "<b><font color=\"#FF0000\">Success.</font></b>\n<br/>";
for ($i=0;$i<count($arr);$i++){
    curl_multi_remove_handle($mh, $arr[$i]);
}

curl_multi_close($mh);
?>

2 thoughts on “CosHtmlCache静态化的自动化访问

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注