#Python爬虫#--使用Goose+phantomjs爬取煎蛋网妹子图

user

雨橙

中国.四川.成都

世界之上、唯有远见、惟爱不变。


最近想爬取一些妹子图片。以煎蛋网为例:
目标网址:http://jandan.net/ooxx/page-1#comments
page-1为分页码,手工浏览器测试发现最大分页码为page-95
 
这里主要使用Goose+BeautifulSoup+phantomjs+requests实现爬取任务
 
首先按常规方法。通过Goose获取HTML文档内容,然后管道给BeautifulSoup来解析HTML节点
g = Goose()
url = "http://jandan.net/ooxx/page-1#comments"
article = g.extract(url=url)
soup = BeautifulSoup(article.raw_html,"html.parser")
post_list = soup.find_all(id="comments")
 
通过以上代码发现无法获取我们想要的内容。猜测煎蛋网使用了反爬技术。
接下来打开浏览器F12。查看源文件.发现在源文件下并没有我们想要的图片。只有通过调试器Elements查看图片
 
然后我们打开XHR。查看也没有Ajax请求。到这里很明显网站使用了js加载了图片。因为这个原因我们无法按常规方法爬取。
 
然后我们打开源文件。查看HTML源码分析发现如下js代码。
 
 
打开后是一个压缩过的JS。我们使用工具解压缩源码如下图:
 
我们找到了红色标注处部分。这一部分正好和我们在调试器Elements输出是节点吻合。
 
通过分析代码发现。页面里面通过后台生产了一个图片url的HASH码。
然后通过JS获取HASH码然后解码。解码后替换图片节点URL。这样做的目的就是反爬。
通过JS文件分析。里面写了几个函数来解码HASH。到这里已经很清楚了。
 
下面我们要想办法怎么获取真实源图片地址:
思路如下:
1: 获取页面图片HASH码,放入列表中。
2: 遍历HASH码然后去解码HASH获取真实图片地址。
3:最后将真实图片地址远程获取生产为本地图片。
 
上面大致三步就能解决问题(如果煎蛋网做了IP访问限制。或者虚拟浏览器有限制。可能还要更加麻烦一些。)
第一步很好实现:遍历URL获取节点HASH码。然后放入LIST。代码如下:
#获取妹子图hash函数
def gethashlist(urllist):
    hashlist = []
    for url in urllist:
        g = Goose()
        article = g.extract(url=url)
        soup = BeautifulSoup(article.raw_html,"html.parser")
        hash_list = soup.find_all(class_="img-hash")
        for hash in hash_list:
            hashlist.append(hash.text)
        print("成功获取hash:"+url)
    return hashlist


第二步稍微麻烦一些:这里涉及到一个问题。我们需要调用js或者说访问js解析出我们真实图片地址。
通过Baidu搜索方法。Python爬虫处理js有如下三种方法:
1, 使用selenium+phantomjs来模拟浏览器访问JS
2, 用python实现一个JS解码的逻辑函数
3, 利用第三方JS上下文环境来解析JS。(这里有很多可以使用的第三方库:比如pyexecjs和PyV8等。。)
 
综合上面的方法。我决定使用phantomjs来解析JS。然后将phantomjs解析的结果返回给python来处理。
由于图片量大。请求资源耗损相对较大。暂时不使用selenium。
 
我的方法如下:
在服务器端构造一个asp程序。其结果通过js返回一个真实URL:
服务端ASP完整代码如下(文件:getURL.asp):
<script src="https://cdn.bootcss.com/jquery/3.3.1/jquery.min.js"></script>
<script src="md5.js"></script>
<script src="base64.js"></script>
<script>
var f_o7H53DOKTmj6uHE0paqwaOGxFxrtEQi3 = function(n, x, f) {
		var k = "DECODE";
		var x = x ? x : "";
		var f = f ? f : 0;
		var g = 4;
		x = md5(x);
		var w = md5(x.substr(0, 16));
		var u = md5(x.substr(16, 16));
		if (g) {
			if (k == "DECODE") {
				var b = md5(microtime());
				var d = b.length - g;
				var t = b.substr(d, g)
			}
		} else {
			var t = ""
		}
		var r = w + md5(w + t);
		var m;
		if (k == "DECODE") {
			f = f ? f + time() : 0;
			tmpstr = f.toString();
			if (tmpstr.length >= 10) {
				n = tmpstr.substr(0, 10) + md5(n + u).substr(0, 16) + n
			} else {
				var e = 10 - tmpstr.length;
				for (var p = 0; p < e; p++) {
					tmpstr = "0" + tmpstr
				}
				n = tmpstr + md5(n + u).substr(0, 16) + n
			}
			m = n
		}
		var h = new Array(256);
		for (var p = 0; p < 256; p++) {
			h[p] = p
		}
		var q = new Array();
		for (var p = 0; p < 256; p++) {
			q[p] = r.charCodeAt(p % r.length)
		}
		for (var o = p = 0; p < 256; p++) {
			o = (o + h[p] + q[p]) % 256;
			tmp = h[p];
			h[p] = h[o];
			h[o] = tmp
		}
		var l = "";
		m = m.split("");
		for (var v = o = p = 0; p < m.length; p++) {
			v = (v + 1) % 256;
			o = (o + h[v]) % 256;
			tmp = h[v];
			h[v] = h[o];
			h[o] = tmp;
			l += chr(ord(m[p]) ^ (h[(h[v] + h[o]) % 256]))
		}
		if (k == "DECODE") {
			l = base64_encode(l);
			var c = new RegExp("=", "g");
			l = l.replace(c, "");
			l = t + l
		}
		return l
	};
(function() {
	var b = typeof exports != "undefined" ? exports : typeof self != "undefined" ? self : $.global;
	var c = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=";

	function a(d) {
		this.message = d
	}
	a.prototype = new Error;
	a.prototype.name = "InvalidCharacterError";
	b.btoa || (b.btoa = function(g) {
		var j = String(g);
		for (var i, e, d = 0, h = c, f = ""; j.charAt(d | 0) || (h = "=", d % 1); f += h.charAt(63 & i >> 8 - d % 1 * 8)) {
			e = j.charCodeAt(d += 3 / 4);
			if (e > 255) {
				throw new a("'btoa' failed: The string to be encoded contains characters outside of the Latin1 range.")
			}
			i = i << 8 | e
		}
		return f
	});
	b.atob || (b.atob = function(g) {
		var j = String(g).replace(/[=]+$/, "");
		if (j.length % 4 == 1) {
			throw new a("'atob' failed: The string to be decoded is not correctly encoded.")
		}
		for (var i = 0, h, e, d = 0, f = ""; e = j.charAt(d++);~e && (h = i % 4 ? h * 64 + e : e, i++ % 4) ? f += String.fromCharCode(255 & h >> (-2 * i & 6)) : 0) {
			e = c.indexOf(e)
		}
		return f
	})
}());

function base64_encode(a) {
	return window.btoa(a)
}





function base64_decode(a) {
	return window.atob(a)
}(function(g) {
	function o(u, z) {
		var w = (u & 65535) + (z & 65535),
			v = (u >> 16) + (z >> 16) + (w >> 16);
		return (v << 16) | (w & 65535)
	}
	function s(u, v) {
		return (u << v) | (u >>> (32 - v))
	}
	function c(A, w, v, u, z, y) {
		return o(s(o(o(w, A), o(u, y)), z), v)
	}
	function b(w, v, B, A, u, z, y) {
		return c((v & B) | ((~v) & A), w, v, u, z, y)
	}
	function i(w, v, B, A, u, z, y) {
		return c((v & A) | (B & (~A)), w, v, u, z, y)
	}
	function n(w, v, B, A, u, z, y) {
		return c(v ^ B ^ A, w, v, u, z, y)
	}
	function a(w, v, B, A, u, z, y) {
		return c(B ^ (v | (~A)), w, v, u, z, y)
	}
	function d(F, A) {
		F[A >> 5] |= 128 << (A % 32);
		F[(((A + 64) >>> 9) << 4) + 14] = A;
		var w, z, y, v, u, E = 1732584193,
			D = -271733879,
			C = -1732584194,
			B = 271733878;
		for (w = 0; w < F.length; w += 16) {
			z = E;
			y = D;
			v = C;
			u = B;
			E = b(E, D, C, B, F[w], 7, -680876936);
			B = b(B, E, D, C, F[w + 1], 12, -389564586);
			C = b(C, B, E, D, F[w + 2], 17, 606105819);
			D = b(D, C, B, E, F[w + 3], 22, -1044525330);
			E = b(E, D, C, B, F[w + 4], 7, -176418897);
			B = b(B, E, D, C, F[w + 5], 12, 1200080426);
			C = b(C, B, E, D, F[w + 6], 17, -1473231341);
			D = b(D, C, B, E, F[w + 7], 22, -45705983);
			E = b(E, D, C, B, F[w + 8], 7, 1770035416);
			B = b(B, E, D, C, F[w + 9], 12, -1958414417);
			C = b(C, B, E, D, F[w + 10], 17, -42063);
			D = b(D, C, B, E, F[w + 11], 22, -1990404162);
			E = b(E, D, C, B, F[w + 12], 7, 1804603682);
			B = b(B, E, D, C, F[w + 13], 12, -40341101);
			C = b(C, B, E, D, F[w + 14], 17, -1502002290);
			D = b(D, C, B, E, F[w + 15], 22, 1236535329);
			E = i(E, D, C, B, F[w + 1], 5, -165796510);
			B = i(B, E, D, C, F[w + 6], 9, -1069501632);
			C = i(C, B, E, D, F[w + 11], 14, 643717713);
			D = i(D, C, B, E, F[w], 20, -373897302);
			E = i(E, D, C, B, F[w + 5], 5, -701558691);
			B = i(B, E, D, C, F[w + 10], 9, 38016083);
			C = i(C, B, E, D, F[w + 15], 14, -660478335);
			D = i(D, C, B, E, F[w + 4], 20, -405537848);
			E = i(E, D, C, B, F[w + 9], 5, 568446438);
			B = i(B, E, D, C, F[w + 14], 9, -1019803690);
			C = i(C, B, E, D, F[w + 3], 14, -187363961);
			D = i(D, C, B, E, F[w + 8], 20, 1163531501);
			E = i(E, D, C, B, F[w + 13], 5, -1444681467);
			B = i(B, E, D, C, F[w + 2], 9, -51403784);
			C = i(C, B, E, D, F[w + 7], 14, 1735328473);
			D = i(D, C, B, E, F[w + 12], 20, -1926607734);
			E = n(E, D, C, B, F[w + 5], 4, -378558);
			B = n(B, E, D, C, F[w + 8], 11, -2022574463);
			C = n(C, B, E, D, F[w + 11], 16, 1839030562);
			D = n(D, C, B, E, F[w + 14], 23, -35309556);
			E = n(E, D, C, B, F[w + 1], 4, -1530992060);
			B = n(B, E, D, C, F[w + 4], 11, 1272893353);
			C = n(C, B, E, D, F[w + 7], 16, -155497632);
			D = n(D, C, B, E, F[w + 10], 23, -1094730640);
			E = n(E, D, C, B, F[w + 13], 4, 681279174);
			B = n(B, E, D, C, F[w], 11, -358537222);
			C = n(C, B, E, D, F[w + 3], 16, -722521979);
			D = n(D, C, B, E, F[w + 6], 23, 76029189);
			E = n(E, D, C, B, F[w + 9], 4, -640364487);
			B = n(B, E, D, C, F[w + 12], 11, -421815835);
			C = n(C, B, E, D, F[w + 15], 16, 530742520);
			D = n(D, C, B, E, F[w + 2], 23, -995338651);
			E = a(E, D, C, B, F[w], 6, -198630844);
			B = a(B, E, D, C, F[w + 7], 10, 1126891415);
			C = a(C, B, E, D, F[w + 14], 15, -1416354905);
			D = a(D, C, B, E, F[w + 5], 21, -57434055);
			E = a(E, D, C, B, F[w + 12], 6, 1700485571);
			B = a(B, E, D, C, F[w + 3], 10, -1894986606);
			C = a(C, B, E, D, F[w + 10], 15, -1051523);
			D = a(D, C, B, E, F[w + 1], 21, -2054922799);
			E = a(E, D, C, B, F[w + 8], 6, 1873313359);
			B = a(B, E, D, C, F[w + 15], 10, -30611744);
			C = a(C, B, E, D, F[w + 6], 15, -1560198380);
			D = a(D, C, B, E, F[w + 13], 21, 1309151649);
			E = a(E, D, C, B, F[w + 4], 6, -145523070);
			B = a(B, E, D, C, F[w + 11], 10, -1120210379);
			C = a(C, B, E, D, F[w + 2], 15, 718787259);
			D = a(D, C, B, E, F[w + 9], 21, -343485551);
			E = o(E, z);
			D = o(D, y);
			C = o(C, v);
			B = o(B, u)
		}
		return [E, D, C, B]
	}
	function p(v) {
		var w, u = "";
		for (w = 0; w < v.length * 32; w += 8) {
			u += String.fromCharCode((v[w >> 5] >>> (w % 32)) & 255)
		}
		return u
	}
	function j(v) {
		var w, u = [];
		u[(v.length >> 2) - 1] = undefined;
		for (w = 0; w < u.length; w += 1) {
			u[w] = 0
		}
		for (w = 0; w < v.length * 8; w += 8) {
			u[w >> 5] |= (v.charCodeAt(w / 8) & 255) << (w % 32)
		}
		return u
	}
	function k(u) {
		return p(d(j(u), u.length * 8))
	}
	function e(w, z) {
		var v, y = j(w),
			u = [],
			x = [],
			A;
		u[15] = x[15] = undefined;
		if (y.length > 16) {
			y = d(y, w.length * 8)
		}
		for (v = 0; v < 16; v += 1) {
			u[v] = y[v] ^ 909522486;
			x[v] = y[v] ^ 1549556828
		}
		A = d(u.concat(j(z)), 512 + z.length * 8);
		return p(d(x.concat(A), 512 + 128))
	}
	function t(w) {
		var z = "0123456789abcdef",
			v = "",
			u, y;
		for (y = 0; y < w.length; y += 1) {
			u = w.charCodeAt(y);
			v += z.charAt((u >>> 4) & 15) + z.charAt(u & 15)
		}
		return v
	}
	function m(u) {
		return unescape(encodeURIComponent(u))
	}
	function q(u) {
		return k(m(u))
	}
	function l(u) {
		return t(q(u))
	}
	function h(u, v) {
		return e(m(u), m(v))
	}
	function r(u, v) {
		return t(h(u, v))
	}
	function f(v, w, u) {
		if (!w) {
			if (!u) {
				return l(v)
			}
			return q(v)
		}
		if (!u) {
			return r(w, v)
		}
		return h(w, v)
	}
	if (typeof define === "function" && define.amd) {
		define(function() {
			return f
		})
	} else {
		g.md5 = f
	}
}(this));
(function() {
	var b = typeof exports != "undefined" ? exports : typeof self != "undefined" ? self : $.global;
	var c = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=";

	function a(d) {
		this.message = d
	}
	a.prototype = new Error;
	a.prototype.name = "InvalidCharacterError";
	b.btoa || (b.btoa = function(g) {
		var j = String(g);
		for (var i, e, d = 0, h = c, f = ""; j.charAt(d | 0) || (h = "=", d % 1); f += h.charAt(63 & i >> 8 - d % 1 * 8)) {
			e = j.charCodeAt(d += 3 / 4);
			if (e > 255) {
				throw new a("'btoa' failed: The string to be encoded contains characters outside of the Latin1 range.")
			}
			i = i << 8 | e
		}
		return f
	});
	b.atob || (b.atob = function(g) {
		var j = String(g).replace(/[=]+$/, "");
		if (j.length % 4 == 1) {
			throw new a("'atob' failed: The string to be decoded is not correctly encoded.")
		}
		for (var i = 0, h, e, d = 0, f = ""; e = j.charAt(d++);~e && (h = i % 4 ? h * 64 + e : e, i++ % 4) ? f += String.fromCharCode(255 & h >> (-2 * i & 6)) : 0) {
			e = c.indexOf(e)
		}
		return f
	})
}());

function time() {
	var a = new Date().getTime();
	return parseInt(a / 1000)
}
function microtime(b) {
	var a = new Date().getTime();
	var c = parseInt(a / 1000);
	return b ? (a / 1000) : (a - (c * 1000)) / 1000 + " " + c
}
function chr(a) {
	return String.fromCharCode(a)
}
function ord(a) {
	return a.charCodeAt()
}
function md5(a) {
	return hex_md5(a)
}
var f_o7H53DOKTmj6uHE0paqwaOGxFxrtEQi3 = function(m, r, d) {
		var e = "DECODE";
		var r = r ? r : "";
		var d = d ? d : 0;
		var q = 4;
		r = md5(r);
		var o = md5(r.substr(0, 16));
		var n = md5(r.substr(16, 16));
		if (q) {
			if (e == "DECODE") {
				var l = m.substr(0, q)
			}
		} else {
			var l = ""
		}
		var c = o + md5(o + l);
		var k;
		if (e == "DECODE") {
			m = m.substr(q);
			k = base64_decode(m)
			/*
			var base64_decode = new Base64();			
			k = base64_decode.decode(m)
			window.alert(m);
			*/
		}
		var h = new Array(256);
		for (var g = 0; g < 256; g++) {
			h[g] = g
		}
		var b = new Array();
		for (var g = 0; g < 256; g++) {
			b[g] = c.charCodeAt(g % c.length)
		}
		for (var f = g = 0; g < 256; g++) {
			f = (f + h[g] + b[g]) % 256;
			tmp = h[g];
			h[g] = h[f];
			h[f] = tmp
		}
		var t = "";
		k = k.split("");
		for (var p = f = g = 0; g < k.length; g++) {
			p = (p + 1) % 256;
			f = (f + h[p]) % 256;
			tmp = h[p];
			h[p] = h[f];
			h[f] = tmp;
			t += chr(ord(k[g]) ^ (h[(h[p] + h[f]) % 256]))
		}
		if (e == "DECODE") {
			if ((t.substr(0, 10) == 0 || t.substr(0, 10) - time() > 0) && t.substr(10, 16) == md5(t.substr(26) + n).substr(0, 16)) {
				t = t.substr(26)
			} else {
				t = ""
			}
		}
		return t
};
	

//var e = "a49dTrzUP8zUu7anKYvfB6evYJxvJYJ3bfufE2mfRve0HCyvyFxuOs5Z/chtenCvLPqsylNPyWO/TmPUkTr3zLK64o1S6st2Gc9xD4HKoqdpNRklRjHGdA";
var c = f_o7H53DOKTmj6uHE0paqwaOGxFxrtEQi3("<%=replace(request("hash")," ","+")%>", "edpeV6OPfaEk6Oreb2dU2sYskzAiem8t");
document.urls = "http:"+c;
</script>
 
这里服务端程序接受一个HASH串。然后返回一个解码的URL地址。把他保存在document当中。
下面我们需要写一个客户端phantomjs程序。
完整代码如下(文件:geturl.js):
var page = require('webpage').create();
var system = require('system');
hash = system.args[1];
page.open("http://test.top/api/getURL.asp?hash="+hash, function(status) {
  var picurl = page.evaluate(function() {
    return document.urls;
  });
  console.log(picurl);
  phantom.exit();
});
 
这里通过命令行运行phantomjs geturl.js hash码得到真实的图片URL地址。
 
下面我们封装一个函数来解析HASH码并生成真实的图片地址。
代码如下:
#使用phantomjs解析js加密hash生成图片url
def getpic(hashlist):
    for hash_str in hashlist:
        result = os.popen("phantomjs geturl.js {0}".format(hash_str))
        url= result.read()
        filename = url.split("/")
        filename = filename[-1].replace("\n","")
        r = requests.get(url)
        with open("mzt_pic/"+filename, "wb") as code:
            code.write(r.content)
        print("生成hash图片成功:"+url)
 
到这里功能就实现完成了。完整代码贴上:
#!/usr/bin/python
# _*_ coding:utf-8 _*_
# author: Robinn

import os
import requests
from goose import Goose
from bs4 import BeautifulSoup

#获取妹子图hash
def gethashlist(urllist):
    hashlist = []
    for url in urllist:
        g = Goose()
        article = g.extract(url=url)
        soup = BeautifulSoup(article.raw_html,"html.parser")
        hash_list = soup.find_all(class_="img-hash")
        for hash in hash_list:
            hashlist.append(hash.text)
        print("成功获取hash:"+url)
    return hashlist



#使用phantomjs解析js加密hash生成图片url
def getpic(hashlist):
    for hash_str in hashlist:
        result = os.popen("phantomjs geturl.js {0}".format(hash_str))
        url= result.read()
        filename = url.split("/")
        filename = filename[-1].replace("\n","")
        r = requests.get(url)
        with open("mzt_pic/"+filename, "wb") as code:
            code.write(r.content)
        print("生成hash图片成功:"+url)


if __name__ == "__main__":
    #总计95页
    pagenum = range(1,96)
    pageList = []
    for n in pagenum:
        url = "http://jandan.net/ooxx/page-"+str(n)+"#comments"
        pageList.append(url)
    hashlist = gethashlist(pageList)
    getpic(hashlist)


 
说明;
1: 这里实现获取HASH是放在LIST中。当数据量很大的时候。遍历LIST可能对内存消耗过大。建议使用生成器来迭代。
2: 之所以使用phantomjs方式来运行js。考虑到这段JS中有windows对象。有一些是execjs和pyv8无法访问的dom对象。如果完全将js解码函数翻译成python有很费时间。
3:在针对这种反爬网站时。我们尽量爬取不要太快。设置time.sleep。对有IP限制的网站还要准备好代理IP池。有这些基本可以解决大部分问题。

注:本文内容均系原创。如需转载分享请标明出处。
posted at