接入示例
一、爬虫代理接入Link
1、保持IP/切换IPLink
适合一些需要登陆、Cookie缓存处理等爬虫需要精确控制IP切换时机的业务。
- 每个TCP请求自动切换,是指爬虫代理为爬虫程序发出的每个TCP请求随机提供一个代理IP,同一个TCP会话中IP不变。
- 访问
HTTPS
目标,建议采用TCP请求切换IP模式。因为HTTPS
协议默认会保持会话
(默认开启KeepAlive
),同一个HTTPS会话
就会保持在一个IP不变。 - 通过设置
Proxy-Connection: Keep-Alive
和Connection: Keep-Alive
可以保持请求在一个TCP会话中,保持代理IP不变。
保持IP不变
如果用户需要多个请求在维持一个IP,比如:需要登录,获取数据两个请求在一个IP
下,只需保证该组请求在一个TCP(Keep-Alive)会话下, 该组请求在代理有效期内使用相同的代理IP
。
HTTPS
使用爬虫代理访问HTTPS网站时,HTTPS
协议默认开启KeepAlive,同一个Session(HTTPS会话)代理IP不变。 如需要每个请求强制切换IP,可以设置Proxy-Connection: Close
和Connection: Close
Session
请注意部分库使用了连接池技术,会始终保持一个TCP链接池实现链接复用,如需要每个请求强制切换IP,请关闭库的连接池功能。
- 爬虫程序可以通过设置HTTP头
Proxy-Tunnel: 随机数
, 当随机数
相同时,访问目标网站的代理IP相同。
例如
需要登录,获取数据两个请求在一个IP
下,只需对这组请求设置相同Proxy-Tunnel
,例如:Proxy-Tunnel: 12345
, 该组请求在代理有效期内使用相同的代理IP
。
注意
同一时间不同请求组可以设置不同Proxy-Tunnel: 随机数
,并发完成数据爬取。
使用相同IP访问HTTPS目标网站
- 因为代理对HTTPS请求采用
connect
模式,请确保connect
请求时候就发送Proxy-Tunnel
头,有些库封装比较高层次,请务必确认向代理发送了该HTTP头。 - 使用
Connection: keep-alive
和Proxy-Connection: keep-alive
方式访问目标网站,代理会确保在一个会话中的所有请求都通过一个IP
到达目标网站
2、用户密码认证Link
- 通过用户名和密码的形式进行身份认证,该认证信息最终会转换为
Proxy-Authorization
协议头跟随请求一起发出 - 如用户认证错误,系统会返回
401 Unauthorized
或407 Proxy Authentication Required
。
例如
在代码中使用HTTP隧道时,如果代码的 HTTP 请求方法不支持以用户名/密码的形式设置身份认证信息, 则需要手动为每个 HTTP 请求增加Proxy-Authorization
协议头, 其值为 Basic <base64>
。其中 <base64>
为 “用户名” 和 “密码” 通过 :
拼接后, 再经由 BASE64 编码得到的字符串。 正确设置后,发出的请求都将包含如下格式的 HTTP 协议头信息: Proxy-Authorization: Basic MTZZVU4xMjM6MTIzNDMyMw==
注意
建议使用Proxy-Authorization
进行用户密码认证。如果使用Authorization
,该HTTP头信息会随请求发送到目标网站。 访问HTTPS网站时,请使用库自带的代理认证方式,手动设置的Proxy-Authorization
协议头,在访问HTTPS网站的情况下,会被代理直接转发到目标网站,导致匿名失效。
域名解析失败
爬虫代理域名ttl时间比较短【多机多地热备】,如遇到解析爬虫代理的域名失败,建议使用 114.114.114.114 或运营商的dns来做DNS解析。
二、代码示例Link
1、PythonLink
requests
#! -*- encoding:utf-8 -*-
import requests
import random
# 要访问的目标页面
targetUrl = "http://httpbin.org/ip"
# 要访问的目标HTTPS页面
# targetUrl = "https://httpbin.org/ip"
# 代理服务器(产品官网 www.16yun.cn)
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "username"
proxyPass = "password"
proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
"host" : proxyHost,
"port" : proxyPort,
"user" : proxyUser,
"pass" : proxyPass,
}
# 设置 http和https访问都是用HTTP代理
proxies = {
"http" : proxyMeta,
"https" : proxyMeta,
}
# 设置IP切换头
tunnel = random.randint(1,10000)
headers = {"Proxy-Tunnel": str(tunnel)}
resp = requests.get(targetUrl, proxies=proxies, headers=headers)
print resp.status_code
print resp.text
#! -*- encoding:utf-8 -*-
import requests
import random
import requests.adapters
# 要访问的目标页面
targetUrlList = [
"https://httpbin.org/ip",
"https://httpbin.org/headers",
"https://httpbin.org/user-agent",
]
# 代理服务器(产品官网 www.16yun.cn)
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "username"
proxyPass = "password"
proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
"host": proxyHost,
"port": proxyPort,
"user": proxyUser,
"pass": proxyPass,
}
# 设置 http和https访问都是用HTTP代理
proxies = {
"http": proxyMeta,
"https": proxyMeta,
}
# 设置IP切换头
tunnel = random.randint(1, 10000)
headers = {"Proxy-Tunnel": str(tunnel)}
class HTTPAdapter(requests.adapters.HTTPAdapter):
def proxy_headers(self, proxy):
headers = super(HTTPAdapter, self).proxy_headers(proxy)
if hasattr(self, 'tunnel'):
headers['Proxy-Tunnel'] = self.tunnel
return headers
# 访问三次网站,使用相同的tunnel标志,均能够保持相同的外网IP
for i in range(3):
s = requests.session()
a = HTTPAdapter()
# 设置IP切换头
a.tunnel = tunnel
s.mount('https://', a)
for url in targetUrlList:
r = s.get(url, proxies=proxies)
print r.text
#! -*- encoding:utf-8 -*-
import requests
import random
import requests.adapters
# 要访问的目标页面
targetUrlList = [
"https://httpbin.org/ip",
"https://httpbin.org/headers",
"https://httpbin.org/user-agent",
]
# 代理服务器(产品官网 www.16yun.cn)
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "username"
proxyPass = "password"
proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
"host": proxyHost,
"port": proxyPort,
"user": proxyUser,
"pass": proxyPass,
}
# 设置 http和https访问都是用HTTP代理
proxies = {
"http": proxyMeta,
"https": proxyMeta,
}
# 访问三次网站,使用相同的Session(keep-alive),均能够保持相同的外网IP
s = requests.session()
# 设置cookie
# cookie_dict = {"JSESSION":"123456789"}
# cookies = requests.utils.cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True)
# s.cookies = cookies
for i in range(3):
for url in targetUrlList:
r = s.get(url, proxies=proxies)
print r.text
urllib2
#! -*- encoding:utf-8 -*-
from urllib import request
# 要访问的目标页面
targetUrl = "http://httpbin.org/ip"
# 代理服务器(产品官网 www.16yun.cn)
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "username"
proxyPass = "password"
proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
"host" : proxyHost,
"port" : proxyPort,
"user" : proxyUser,
"pass" : proxyPass,
}
proxy_handler = request.ProxyHandler({
"http" : proxyMeta,
"https" : proxyMeta,
})
opener = request.build_opener(proxy_handler)
request.install_opener(opener)
resp = request.urlopen(targetUrl).read()
print (resp)
#! -*- encoding:utf-8 -*-
import urllib2
import random
import httplib
class HTTPSConnection(httplib.HTTPSConnection):
def set_tunnel(self, host, port=None, headers=None):
httplib.HTTPSConnection.set_tunnel(self, host, port, headers)
if hasattr(self, 'proxy_tunnel'):
self._tunnel_headers['Proxy-Tunnel'] = self.proxy_tunnel
class HTTPSHandler(urllib2.HTTPSHandler):
def https_open(self, req):
return urllib2.HTTPSHandler.do_open(self, HTTPSConnection, req, context=self._context)
# 要访问的目标页面
targetUrlList = [
"https://httpbin.org/ip",
"https://httpbin.org/headers",
"https://httpbin.org/user-agent",
]
# 代理服务器(产品官网 www.16yun.cn)
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "username"
proxyPass = "password"
proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
"host": proxyHost,
"port": proxyPort,
"user": proxyUser,
"pass": proxyPass,
}
# 设置 http和https访问都是用HTTP代理
proxies = {
"http": proxyMeta,
"https": proxyMeta,
}
# 设置IP切换头
tunnel = random.randint(1, 10000)
headers = {"Proxy-Tunnel": str(tunnel)}
HTTPSConnection.proxy_tunnel = tunnel
proxy = urllib2.ProxyHandler(proxies)
opener = urllib2.build_opener(proxy, HTTPSHandler)
urllib2.install_opener(opener)
# 访问三次网站,使用相同的tunnel标志,均能够保持相同的外网IP
for i in range(3):
for url in targetUrlList:
r = urllib2.Request(url)
print(urllib2.urlopen(r).read())
urllib2无法使用Keep-alive
urllib2对于HTTP/1.1默认会关闭连接 请通过设置相同Proxy-Tunnel
来保持相同的外网IP.
scrapy
在项目中新建middlewares.py文件(./项目名/middlewares.py)
#! -*- encoding:utf-8 -*-
import base64
import sys
import random
PY3 = sys.version_info[0] >= 3
def base64ify(bytes_or_str):
if PY3 and isinstance(bytes_or_str, str):
input_bytes = bytes_or_str.encode('utf8')
else:
input_bytes = bytes_or_str
output_bytes = base64.urlsafe_b64encode(input_bytes)
if PY3:
return output_bytes.decode('ascii')
else:
return output_bytes
class ProxyMiddleware(object):
def process_request(self, request, spider):
# 代理服务器(产品官网 www.16yun.cn)
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "username"
proxyPass = "password"
# [版本>=2.6.2](https://docs.scrapy.org/en/latest/news.html?highlight=2.6.2#scrapy-2-6-2-2022-07-25)无需添加验证头,会自动在请求头中设置Proxy-Authorization
request.meta['proxy'] = "http://{0}:{1}@{2}:{3}".format(proxyUser,proxyPass,proxyHost,proxyPort)
# 版本<2.6.2 需要手动添加代理验证头
# request.meta['proxy'] = "http://{0}:{1}".format(proxyHost,proxyPort)
# request.headers['Proxy-Authorization'] = 'Basic ' + base64ify(proxyUser + ":" + proxyPass)
# 设置IP切换头(根据需求)
# tunnel = random.randint(1,10000)
# request.headers['Proxy-Tunnel'] = str(tunnel)
# 每次访问后关闭TCP链接,强制每次访问切换IP
request.headers['Connection'] = "Close"
修改项目配置文件 (./项目名/settings.py)
设置随机UserAgent
在start_requests给splash调用添加代理信息
def start_requests(self):
script = '''
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
headers = last_response.headers,
}
end
'''
# 代理服务器(产品官网 www.16yun.cn)
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "16111YVL"
proxyPass = "11111"
proxy = "http://{}:{}@{}:{}".format(proxyUser,proxyPass, proxyHost, proxyPort)
try:
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
endpoint="execute",
args={
"lua_source": script,
"wait": 5,
"timeout": 600,
"target_count": self.target_count,
'proxy': proxy
},
)
except:
raise CloseSpider("Could not load Lua script.")
在项目中新建middlewares.py文件(./项目名/middlewares.py)
#! -*- encoding:utf-8 -*-
import base64
import sys
import random
PY3 = sys.version_info[0] >= 3
def base64ify(bytes_or_str):
if PY3 and isinstance(bytes_or_str, str):
input_bytes = bytes_or_str.encode('utf8')
else:
input_bytes = bytes_or_str
output_bytes = base64.urlsafe_b64encode(input_bytes)
if PY3:
return output_bytes.decode('ascii')
else:
return output_bytes
class ProxyMiddleware(object):
def process_request(self, request, spider):
# 代理服务器(产品官网 www.16yun.cn)
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "username"
proxyPass = "password"
request.meta['proxy'] = "http://{0}:{1}".format(proxyHost,proxyPort)
# 添加验证头
encoded_user_pass = base64ify(proxyUser + ":" + proxyPass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
# 设置IP切换头(根据需求)
tunnel = random.randint(1,10000)
request.headers['Proxy-Tunnel'] = str(tunnel)
给splash设置代理修改项目配置文件 (./项目名/settings.py)
在项目中新建middlewares.py文件(./项目名/middlewares.py)
#! -*- encoding:utf-8 -*-
import websockets
from scrapy.http import HtmlResponse
from logging import getLogger
import asyncio
import pyppeteer
import logging
from concurrent.futures._base import TimeoutError
import base64
import sys
import random
pyppeteer_level = logging.WARNING
logging.getLogger('websockets.protocol').setLevel(pyppeteer_level)
logging.getLogger('pyppeteer').setLevel(pyppeteer_level)
PY3 = sys.version_info[0] >= 3
def base64ify(bytes_or_str):
if PY3 and isinstance(bytes_or_str, str):
input_bytes = bytes_or_str.encode('utf8')
else:
input_bytes = bytes_or_str
output_bytes = base64.urlsafe_b64encode(input_bytes)
if PY3:
return output_bytes.decode('ascii')
else:
return output_bytes
class ProxyMiddleware(object):
USER_AGENT = open('useragents.txt').readlines()
def process_request(self, request, spider):
# 代理服务器
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "username"
proxyPass = "password"
request.meta['proxy'] = "http://{0}:{1}".format(proxyHost, proxyPort)
# 添加验证头
encoded_user_pass = base64ify(proxyUser + ":" + proxyPass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
# 设置IP切换头(根据需求)
tunnel = random.randint(1, 10000)
request.headers['Proxy-Tunnel'] = str(tunnel)
request.headers['User-Agent'] = random.choice(self.USER_AGENT)
class PyppeteerMiddleware(object):
def __init__(self, **args):
"""
init logger, loop, browser
:param args:
"""
self.logger = getLogger(__name__)
self.loop = asyncio.get_event_loop()
self.browser = self.loop.run_until_complete(
pyppeteer.launch(headless=True))
self.args = args
def __del__(self):
"""
close loop
:return:
"""
self.loop.close()
def render(self, url, retries=1, script=None, wait=0.3, scrolldown=False, sleep=0,
timeout=8.0, keep_page=False):
"""
render page with pyppeteer
:param url: page url
:param retries: max retry times
:param script: js script to evaluate
:param wait: number of seconds to wait before loading the page, preventing timeouts
:param scrolldown: how many times to page down
:param sleep: how many long to sleep after initial render
:param timeout: the longest wait time, otherwise raise timeout error
:param keep_page: keep page not to be closed, browser object needed
:param browser: pyppetter browser object
:param with_result: return with js evaluation result
:return: content, [result]
"""
# define async render
async def async_render(url, script, scrolldown, sleep, wait, timeout, keep_page):
try:
# basic render
page = await self.browser.newPage()
await asyncio.sleep(wait)
response = await page.goto(url, options={'timeout': int(timeout * 1000)})
if response.status != 200:
return None, None, response.status
result = None
# evaluate with script
if script:
result = await page.evaluate(script)
# scroll down for {scrolldown} times
if scrolldown:
for _ in range(scrolldown):
await page._keyboard.down('PageDown')
await asyncio.sleep(sleep)
else:
await asyncio.sleep(sleep)
if scrolldown:
await page._keyboard.up('PageDown')
# get html of page
content = await page.content()
return content, result, response.status
except TimeoutError:
return None, None, 500
finally:
# if keep page, do not close it
if not keep_page:
await page.close()
content, result, status = [None] * 3
# retry for {retries} times
for i in range(retries):
if not content:
content, result, status = self.loop.run_until_complete(
async_render(url=url, script=script, sleep=sleep, wait=wait,
scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
else:
break
# if need to return js evaluation result
return content, result, status
def process_request(self, request, spider):
"""
:param request: request object
:param spider: spider object
:return: HtmlResponse
"""
if request.meta.get('render'):
try:
self.logger.debug('rendering %s', request.url)
html, result, status = self.render(request.url)
return HtmlResponse(url=request.url, body=html, request=request, encoding='utf-8',
status=status)
except websockets.exceptions.ConnectionClosed:
pass
@classmethod
def from_crawler(cls, crawler):
return cls(**crawler.settings.get('PYPPETEER_ARGS', {}))
给splash设置代理修改项目配置文件 (./项目名/settings.py)
DOWNLOADER_MIDDLEWARES = {
'项目名.middlewares.PyppeteerMiddleware': 543,
'项目名.middlewares.ProxyMiddleware': 100,
}
下载DEMO
通过设置环境变量,来使用爬虫代理
Windows
Linux/MacOS
新建文件 (./项目名/tunnel_fix.py) 在项目文件最开始位置from . import tunnel_fix
from scrapy.utils.python import to_bytes, to_unicode
import scrapy.core.downloader.handlers.http11
from random import randint
def tunnel_request_data(host, port, proxy_auth_header=None):
host_value = to_bytes(host, encoding='ascii') + b':' + to_bytes(str(port))
tunnel_req = b'CONNECT ' + host_value + b' HTTP/1.1\r\n'
tunnel_req += b'Host: ' + host_value + b'\r\n'
if proxy_auth_header:
tunnel_req += b'Proxy-Authorization: ' + proxy_auth_header + b'\r\n'
# 指定Proxy-Tunnel
proxy_tunnel = '{}'.format(randint(1,9999))
tunnel_req += b'Proxy-Tunnel: '+ to_bytes(proxy_tunnel) + b'\r\n'
tunnel_req += b'\r\n'
return tunnel_req
scrapy.core.downloader.handlers.http11.tunnel_request_data = tunnel_request_data
aiohttp
import aiohttp, asyncio
import random
def main():
targetUrl = "https://httpbin.org/headers"
# 代理服务器(产品官网 www.16yun.cn)
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "username"
proxyPass = "password"
proxyServer = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
"host" : proxyHost,
"port" : proxyPort,
"user" : proxyUser,
"pass" : proxyPass,
}
userAgent = "Chrome/83.0.4103.61"
# 所有请求使用同一个Proxy-Tunnel, 使用固定IP
# proxy_tunnel = "{}".format(random.randint(1,10000))
async def entry():
async with aiohttp.ClientSession(headers={"User-Agent": userAgent}) as session:
while True:
# 随机设Proxy-Tunnel,使用随机IP
proxy_tunnel = "{}".format(random.randint(1,10000))
async with session.get(targetUrl, proxy=proxyServer, proxy_headers={"Proxy-Tunnel":proxy_tunnel}) as resp:
body = await resp.read()
print(resp.status)
print(body)
loop = asyncio.get_event_loop()
loop.run_until_complete(entry())
loop.run_forever()
if __name__ == '__main__':
main()
#! -*- encoding:utf-8 -*-
import aiohttp, asyncio
targetUrl = "http://httpbin.org/ip"
# 代理服务器(产品官网 www.16yun.cn)
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "username"
proxyPass = "password"
proxyServer = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
"host" : proxyHost,
"port" : proxyPort,
"user" : proxyUser,
"pass" : proxyPass,
}
userAgent = "Chrome/83.0.4103.61"
async def entry():
while True:
conn = aiohttp.TCPConnector(verify_ssl=False)
async with aiohttp.ClientSession(headers={"User-Agent": userAgent}, connector=conn) as session:
async with session.get(targetUrl, proxy=proxyServer) as resp:
body = await resp.read()
print(resp.status)
print(body)
loop = asyncio.get_event_loop()
loop.run_until_complete(entry())
loop.run_forever()
注意
aiohttp库实现了TCP链接池功能,如果没设置随机Proxy-Tunnel或断开TCP链接,会导致多个请求始终没有切换IP。 如需切换IP,需每个请求新建一个session;同时设置connector_owner参数让session关闭后链接也关闭。 如需切换IP,也可以设置Proxy-Tunnel为随机数,使用随机IP。
httpx
import asyncio
import httpx
# 代理服务器(产品官网 www.16yun.cn)
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "username"
proxyPass = "password"
proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
"host": proxyHost,
"port": proxyPort,
"user": proxyUser,
"pass": proxyPass,
}
# 设置 http和https访问都是用HTTP代理
proxies = {
"http://": proxyMeta,
"https://": proxyMeta,
}
client = httpx.AsyncClient(proxies=proxies)
# 开启http2.0支持,请使用 pip install httpx[http2]
# client = httpx.AsyncClient(http2=True,proxies=proxies)
async def test():
resp = await client.get("https://httpbin.org/ip")
print(resp.http_version)
print(resp.text)
asyncio.run(test())
import httpx
import random
# 代理服务器(产品官网 www.16yun.cn)
proxy_host = "t.16yun.cn"
proxy_port = "31111"
# 代理验证信息
proxy_user = "username"
proxy_pwd = "password"
proxy_url = f"http://{proxy_user}:{proxy_pwd}@{proxy_host}:{proxy_port}"
proxy = httpx.Proxy(
url=proxy_url,
# 设置IP切换头,数不变,保持IP不变
headers={"Proxy-Tunnel": f"{random.randint(1, 10000)}"}
)
print(proxy_url)
proxies = {
"http://": proxy,
"https://": proxy,
}
target_url = "https://httpbin.org/ip"
async def test_async():
# 三次请求保持在同一个IP上
# 开启http2.0支持,请使用 pip install httpx[http2]
# client = httpx.AsyncClient(http2=True,proxies=proxies)
for _ in range(3):
async with httpx.AsyncClient(
proxies=proxies,
) as client:
response = await client.get(target_url)
print("test_async:", response.text)
def test():
# 三次请求保持在同一个IP上
for _ in range(3):
# 开启http2.0支持,请使用 pip install httpx[http2]
# client = httpx.Client(http2=True,proxies=proxies)
client = httpx.Client(
proxies=proxies,
)
response = client.get(target_url)
print("test:", response.text)
client.close()
if __name__ == '__main__':
import asyncio
test()
asyncio.run(test_async())
2、C SharpLink
// 要访问的目标页面
string targetUrl = "http://httpbin.org/ip";
// 代理服务器(产品官网 www.16yun.cn)
string proxyHost = "http://t.16yun.cn";
string proxyPort = "31111";
// 代理验证信息
string proxyUser = "username";
string proxyPass = "password";
// 设置代理服务器
WebProxy proxy = new WebProxy(string.Format("{0}:{1}", proxyHost, proxyPort), true);
ServicePointManager.Expect100Continue = false;
var request = WebRequest.Create(targetUrl) as HttpWebRequest;
request.AllowAutoRedirect = true;
request.KeepAlive = true;
request.Method = "GET";
request.Proxy = proxy;
//request.Proxy.Credentials = CredentialCache.DefaultCredentials;
request.Proxy.Credentials = new System.Net.NetworkCredential(proxyUser, proxyPass);
// 设置Proxy Tunnel
// Random ran=new Random();
// int tunnel =ran.Next(1,10000);
// request.Headers.Add("Proxy-Tunnel", String.valueOf(tunnel));
//request.Timeout = 20000;
//request.ServicePoint.ConnectionLimit = 512;
//request.UserAgent = "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36";
//request.Headers.Add("Cache-Control", "max-age=0");
//request.Headers.Add("DNT", "1");
//String encoded = System.Convert.ToBase64String(System.Text.Encoding.GetEncoding("ISO-8859-1").GetBytes(proxyUser + ":" + proxyPass));
//request.Headers.Add("Proxy-Authorization", "Basic " + encoded);
using (var response = request.GetResponse() as HttpWebResponse)
using (var sr = new StreamReader(response.GetResponseStream(), Encoding.UTF8))
{
string htmlStr = sr.ReadToEnd();
}
如果出现Section=ResponseStatusLine 异常
通过修改配置文件解决:在app.config(WinForm)或web.config(Web)文件里修改。 WinForm下的app.config默认不存在,手动在Debug文件夹所在的同级目录下新建一个XML配置文件,内容为:
编译以后会在Debug下面自动创建一个 程序名.exe.config 的配置文件3、PHPLink
<?php
// 要访问的目标页面
$url = "http://httpbin.org/ip";
$urls = "https://httpbin.org/ip";
// 代理服务器(产品官网 www.16yun.cn)
define("PROXY_SERVER", "tcp://t.16yun.cn:31111");
// 代理身份信息
define("PROXY_USER", "username");
define("PROXY_PASS", "password");
$proxyAuth = base64_encode(PROXY_USER . ":" . PROXY_PASS);
// 设置 Proxy tunnel
$tunnel = rand(1,10000);
$headers = implode("\r\n", [
"Proxy-Authorization: Basic {$proxyAuth}",
"Proxy-Tunnel: ${tunnel}",
]);
$sniServer = parse_url($urls, PHP_URL_HOST);
$options = [
"http" => [
"proxy" => PROXY_SERVER,
"header" => $headers,
"method" => "GET",
'request_fulluri' => true,
],
'ssl' => array(
'SNI_enabled' => true, // Disable SNI for https over http proxies
'SNI_server_name' => $sniServer
)
];
print($url);
$context = stream_context_create($options);
$result = file_get_contents($url, false, $context);
var_dump($result);
// 访问 HTTPS 页面
print($urls);
$context = stream_context_create($options);
$result = file_get_contents($urls, false, $context);
var_dump($result);
?>
<?php
function curlFile($url,$proxy_ip,$proxy_port,$loginpassw)
{
//$loginpassw = 'username:password';
//$proxy_ip = 't.16yun.cn';
//$proxy_port = '31111';
//$url = 'https://httpbin.org/ip';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_PROXYPORT, $proxy_port);
curl_setopt($ch, CURLOPT_PROXYTYPE, 'HTTP');
curl_setopt($ch, CURLOPT_PROXY, $proxy_ip);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $loginpassw);
// curl used to include a list of accepted CAs, but no longer bundles ANY CA certs. So by default it'll reject all SSL certificates as unverifiable.
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$data = curlFile('https://httpbin.org/ip','t.16yun.cn',31111,'username:password');
print($data);
?>
<?php
namespace App\Console\Commands;
use Illuminate\Console\Command;
class Test16Proxy extends Command
{
/**
* The name and signature of the console command.
*
* @var string
*/
protected $signature = 'test:16proxy';
/**
* The console command description.
*
* @var string
*/
protected $description = 'Command description';
/**
* Create a new command instance.
*
* @return void
*/
public function __construct()
{
parent::__construct();
}
/**
* Execute the console command.
*
* @return mixed
*/
public function handle()
{
$client = new \GuzzleHttp\Client();
// 要访问的目标页面
$targetUrl = "http://httpbin.org/ip";
// 代理服务器(产品官网 www.16yun.cn)
define("PROXY_SERVER", "t.16yun.cn:31111");
// 代理身份信息
define("PROXY_USER", "username");
define("PROXY_PASS", "password");
$proxyAuth = base64_encode(PROXY_USER . ":" . PROXY_PASS);
$options = [
"proxy" => PROXY_SERVER,
"headers" => [
"Proxy-Authorization" => "Basic " . $proxyAuth
]
];
//print_r($options);
$result = $client->request('GET', $targetUrl, $options);
var_dump($result->getBody()->getContents());
}
}
?>
4、JAVALink
通过代理访问HTTP2网站
需要保证JDK的版本支持HTTP2网站的访问,java9已经以上才能完整支持
407错误
// Change in Java 8 Update 111 以上版本需要下面代码 System.setProperty("jdk.http.auth.tunneling.disabledSchemes", "false"); System.setProperty("jdk.http.auth.proxying.disabledSchemes", "false");
import org.apache.commons.httpclient.Credentials;
import org.apache.commons.httpclient.HostConfiguration;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.UsernamePasswordCredentials;
import org.apache.commons.httpclient.auth.AuthScope;
import org.apache.commons.httpclient.methods.GetMethod;
import java.io.IOException;
public class Main {
# 代理服务器(产品官网 www.16yun.cn)
private static final String PROXY_HOST = "t.16yun.cn";
private static final int PROXY_PORT = 31111;
public static void main(String[] args) {
HttpClient client = new HttpClient();
HttpMethod method = new GetMethod("https://httpbin.org/ip");
HostConfiguration config = client.getHostConfiguration();
config.setProxy(PROXY_HOST, PROXY_PORT);
client.getParams().setAuthenticationPreemptive(true);
String username = "16ABCCKJ";
String password = "712323";
Credentials credentials = new UsernamePasswordCredentials(username, password);
AuthScope authScope = new AuthScope(PROXY_HOST, PROXY_PORT);
client.getState().setProxyCredentials(authScope, credentials);
try {
client.executeMethod(method);
if (method.getStatusCode() == HttpStatus.SC_OK) {
String response = method.getResponseBodyAsString();
System.out.println("Response = " + response);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
method.releaseConnection();
}
}
}
//*感谢 “情歌”提供的代码
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
import java.net.URI;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import org.apache.http.Header;
import org.apache.http.HeaderElement;
import org.apache.http.HttpHost;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.AuthCache;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.client.HttpRequestRetryHandler;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.config.AuthSchemes;
import org.apache.http.client.entity.GzipDecompressingEntity;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.methods.HttpRequestBase;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.config.Registry;
import org.apache.http.config.RegistryBuilder;
import org.apache.http.conn.socket.ConnectionSocketFactory;
import org.apache.http.conn.socket.LayeredConnectionSocketFactory;
import org.apache.http.conn.socket.PlainConnectionSocketFactory;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.impl.auth.BasicScheme;
import org.apache.http.impl.client.BasicAuthCache;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.client.ProxyAuthenticationStrategy;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.message.BasicHeader;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.NameValuePair;
import org.apache.http.util.EntityUtils;
public class Demo
{
// 代理服务器(产品官网 www.16yun.cn)
final static String proxyHost = "t.16yun.cn";
final static Integer proxyPort = 31000;
// 代理验证信息
final static String proxyUser = "username";
final static String proxyPass = "password";
private static PoolingHttpClientConnectionManager cm = null;
private static HttpRequestRetryHandler httpRequestRetryHandler = null;
private static HttpHost proxy = null;
private static CredentialsProvider credsProvider = null;
private static RequestConfig reqConfig = null;
static {
ConnectionSocketFactory plainsf = PlainConnectionSocketFactory.getSocketFactory();
LayeredConnectionSocketFactory sslsf = SSLConnectionSocketFactory.getSocketFactory();
Registry registry = RegistryBuilder.create()
.register("http", plainsf)
.register("https", sslsf)
.build();
cm = new PoolingHttpClientConnectionManager(registry);
cm.setMaxTotal(20);
cm.setDefaultMaxPerRoute(5);
proxy = new HttpHost(proxyHost, proxyPort, "http");
credsProvider = new BasicCredentialsProvider();
credsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials(proxyUser, proxyPass));
reqConfig = RequestConfig.custom()
.setConnectionRequestTimeout(5000)
.setConnectTimeout(5000)
.setSocketTimeout(5000)
.setExpectContinueEnabled(false)
.setProxy(new HttpHost(proxyHost, proxyPort))
.build();
}
public static void doRequest(HttpRequestBase httpReq) {
CloseableHttpResponse httpResp = null;
try {
setHeaders(httpReq);
httpReq.setConfig(reqConfig);
CloseableHttpClient httpClient = HttpClients.custom()
.setConnectionManager(cm)
.setDefaultCredentialsProvider(credsProvider)
.build();
//设置TCP keep alive,访问https网站时保持IP不切换
// SocketConfig socketConfig = SocketConfig.custom().setSoKeepAlive(true).setSoTimeout(3600000).build();
// CloseableHttpClient httpClient = HttpClients.custom()
// .setConnectionManager(cm)
// .setDefaultCredentialsProvider(credsProvider)
// .setDefaultSocketConfig(socketConfig)
// .build();
AuthCache authCache = new BasicAuthCache();
authCache.put(proxy, new BasicScheme());
// 如果遇到407,可以设置代理认证 Proxy-Authenticate
// authCache.put(proxy, new BasicScheme(ChallengeState.PROXY));
HttpClientContext localContext = HttpClientContext.create();
localContext.setAuthCache(authCache);
httpResp = httpClient.execute(httpReq, localContext);
int statusCode = httpResp.getStatusLine().getStatusCode();
System.out.println(statusCode);
BufferedReader rd = new BufferedReader(new InputStreamReader(httpResp.getEntity().getContent()));
String line = "";
while((line = rd.readLine()) != null) {
System.out.println(line);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (httpResp != null) {
httpResp.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
/**
* 设置请求头
*
* @param httpReq
*/
private static void setHeaders(HttpRequestBase httpReq) {
// 设置Proxy-Tunnel
// Random random = new Random();
// int tunnel = random.nextInt(10000);
// httpReq.setHeader("Proxy-Tunnel", String.valueOf(tunnel));
httpReq.setHeader("Accept-Encoding", null);
}
public static void doGetRequest() {
// 要访问的目标页面
String targetUrl = "https://httpbin.org/ip";
try {
HttpGet httpGet = new HttpGet(targetUrl);
doRequest(httpGet);
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
doGetRequest();
}
}
import java.io.IOException;
import java.net.Authenticator;
import java.net.InetSocketAddress;
import java.net.PasswordAuthentication;
import java.net.Proxy;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Demo
{
// 代理验证信息
final static String ProxyUser = "username";
final static String ProxyPass = "password";
// 代理服务器(产品官网 www.16yun.cn)
final static String ProxyHost = "t.16yun.cn";
final static Integer ProxyPort = 31111;
// 设置IP切换头
final static String ProxyHeadKey = "Proxy-Tunnel";
public static String getUrlProxyContent(String url)
{
Authenticator.setDefault(new Authenticator() {
public PasswordAuthentication getPasswordAuthentication()
{
return new PasswordAuthentication(ProxyUser, ProxyPass.toCharArray());
}
});
// 设置Proxy-Tunnel
Random random = new Random();
int tunnel = random.nextInt(10000);
String ProxyHeadVal = String.valueOf(tunnel);
Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(ProxyHost, ProxyPort));
try
{
// 处理异常、其他参数
Document doc = Jsoup.connect(url).timeout(3000).header(ProxyHeadKey, ProxyHeadVal).proxy(proxy).get();
if(doc != null) {
System.out.println(doc.body().html());
}
}
catch (IOException e)
{
e.printStackTrace();
}
return null;
}
public static void main(String[] args) throws Exception
{
// 要访问的目标页面
String targetUrl = "http://httpbin.org/ip";
getUrlProxyContent(targetUrl);
}
}
JSoup无法使用Keep-alive
JSoup默认会关闭连接 访问HTTP网站请通过设置相同Proxy-Tunnel
来保持相同的外网IP. 访问HTTPS网站请使用其他库,保持相同的外网IP.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Demo {
public static void main(String[] args) {
try{
// 代理服务器(产品官网 www.16yun.cn)
final static String ProxyHost = "t.16yun.cn";
final static String ProxyPort = "31111";
System.setProperty("http.proxyHost", ProxyHost);
System.setProperty("https.proxyHost", ProxyHost);
System.setProperty("http.proxyPort", ProxyPort);
System.setProperty("https.proxyPort", ProxyPort);
// 代理验证信息
final static String ProxyUser = "username";
final static String ProxyPass = "password";
System.setProperty("http.proxyUser", ProxyUser);
System.setProperty("http.proxyPassword", ProxyPass);
System.setProperty("https.proxyUser", ProxyUser);
System.setProperty("https.proxyPassword", ProxyPass);
// 设置IP切换头
final static String ProxyHeadKey = "Proxy-Tunnel";
// 设置Proxy-Tunnel
Random random = new Random();
int tunnel = random.nextInt(10000);
String ProxyHeadVal = String.valueOf(tunnel);
// 处理异常、其他参数
Document doc = Jsoup.connect(url).timeout(3000).header(ProxyHeadKey, ProxyHeadVal).get();
if(doc != null) {
System.out.println(doc.body().html());
}
}catch (IOException e)
{
e.printStackTrace();
}
}
}
JSoup无法使用Keep-alive
JSoup默认会关闭连接 访问HTTP网站请通过设置相同Proxy-Tunnel
来保持相同的外网IP. 访问HTTPS网站请使用其他库,保持相同的外网IP.
import java.io.ByteArrayOutputStream;
import java.io.InputStream;
import java.net.Authenticator;
import java.net.HttpURLConnection;
import java.net.InetSocketAddress;
import java.net.PasswordAuthentication;
import java.net.Proxy;
import java.net.URL;
import java.util.Random;
class ProxyAuthenticator extends Authenticator {
private String user, password;
public ProxyAuthenticator(String user, String password) {
this.user = user;
this.password = password;
}
protected PasswordAuthentication getPasswordAuthentication() {
return new PasswordAuthentication(user, password.toCharArray());
}
}
/**
* 注意:下面代码仅仅实现HTTP请求链接,每一次请求都是无状态保留的,仅仅是这次请求是更换IP的,如果下次请求的IP地址会改变
* 如果是多线程访问的话,只要将下面的代码嵌入到你自己的业务逻辑里面,那么每次都会用新的IP进行访问,如果担心IP有重复,
* 自己可以维护IP的使用情况,并做校验。
*/
public class Demo {
public static void main(String args[]) throws Exception {
// Change in Java 8 Update 111 以上版本需要下面代码
// System.setProperty("jdk.http.auth.tunneling.disabledSchemes", "false");
// System.setProperty("jdk.http.auth.proxying.disabledSchemes", "false");
// 要访问的目标页面
String targetUrl = "http://httpbin.org/ip";
// 代理服务器(产品官网 www.16yun.cn)
String proxyServer = "t.16yun.cn";
int proxyPort = 31111;
// 代理验证信息
String proxyUser = "username";
String proxyPass = "password";
try {
URL url = new URL(targetUrl);
Authenticator.setDefault(new ProxyAuthenticator(proxyUser, proxyPass));
// 创建代理服务器地址对象
InetSocketAddress addr = new InetSocketAddress(proxyServer, proxyPort);
// 创建HTTP类型代理对象
Proxy proxy = new Proxy(Proxy.Type.HTTP, addr);
// 设置通过代理访问目标页面
HttpURLConnection connection = (HttpURLConnection) url.openConnection(proxy);
// 设置KeepAlive
// connection.setRequestProperty("Connection", "keep-alive");
// connection.setRequestProperty("Keep-Alive", "timeout=5, max=100");
// 设置Proxy-Tunnel
// Random random = new Random();
// int tunnel = random.nextInt(10000);
// connection.setRequestProperty("Proxy-Tunnel",String.valueOf(tunnel));
// 解析返回数据
byte[] response = readStream(connection.getInputStream());
System.out.println(new String(response));
} catch (Exception e) {
System.out.println(e.getLocalizedMessage());
}
}
/**
* 将输入流转换成字符串
*
* @param inStream
* @return
* @throws Exception
*/
public static byte[] readStream(InputStream inStream) throws Exception {
ByteArrayOutputStream outSteam = new ByteArrayOutputStream();
byte[] buffer = new byte[1024];
int len = -1;
while ((len = inStream.read(buffer)) != -1) {
outSteam.write(buffer, 0, len);
}
outSteam.close();
inStream.close();
return outSteam.toByteArray();
}
}
package htmlunit;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.impl.client.BasicCredentialsProvider;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class HtmlunitDemo {
// 代理服务器(产品官网 www.16yun.cn)
final static String proxyHost = "t.16yun.cn";
final static Integer proxyPort = 31111;
// 代理验证信息
final static String proxyUser = "USERNAME";
final static String proxyPass = "PASSWORD";
public static void main(String[] args) {
CredentialsProvider credsProvider = new BasicCredentialsProvider();
credsProvider.setCredentials(
new AuthScope(proxyHost, proxyPort),
new UsernamePasswordCredentials(proxyUser, proxyPass));
WebClient webClient = new WebClient(BrowserVersion.CHROME,proxyHost, proxyPort);
webClient.setCredentialsProvider(credsProvider);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setActiveXNative(false);
webClient.getOptions().setCssEnabled(false);
HtmlPage page = null;
try {
page = webClient.getPage("http://httpbin.org/ip");
} catch (Exception e) {
e.printStackTrace();
} finally {
webClient.close();
}
webClient.waitForBackgroundJavaScript(30000);
String pageXml = page.asXml();
System.out.println(pageXml);
}
}
import okhttp3.*;
import java.io.IOException;
import java.net.InetSocketAddress;
import java.net.Proxy;
import java.util.concurrent.TimeUnit;
public class OkHttp {
// 代理服务器(产品官网 www.16yun.cn)
final static String proxyHost = "t.16yun.cn";
final static Integer proxyPort = 31111;
// 代理验证信息
final static String proxyUser = "USERNAME";
final static String proxyPass = "PASSWORD";
static OkHttpClient client = null;
static {
Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxyHost, proxyPort));
Authenticator proxyAuthenticator = new Authenticator() {
public Request authenticate(Route route, Response response) {
String credential = Credentials.basic(proxyUser, proxyPass);
return response.request().newBuilder()
.header("Proxy-Authorization", credential)
.build();
}
};
client = new OkHttpClient().newBuilder()
.connectTimeout(5, TimeUnit.SECONDS)
.readTimeout(5, TimeUnit.SECONDS)
.proxy(proxy)
.proxyAuthenticator(proxyAuthenticator)
.connectionPool(new ConnectionPool(5, 1, TimeUnit.SECONDS))
.build();
}
public static Response doGet() throws IOException {
// 要访问的目标页面
String targetUrl = "http://httpbin.org/ip";
Request request = new Request.Builder()
.url(targetUrl)
.build();
Response response = client.newCall(request).execute();
return response;
}
public static void main(String[] args) throws IOException {
Response response1 = doGet();
System.out.println("GET请求返回结果:");
System.out.println(response1.body().string());
}
}
5、golangLink
package main
import (
"net/url"
"net/http"
"bytes"
"fmt"
"io/ioutil"
)
// 代理服务器(产品官网 www.16yun.cn)
const ProxyServer = "t.16yun.cn:31111"
type ProxyAuth struct {
Username string
Password string
}
func (p ProxyAuth) ProxyClient() http.Client {
var proxyURL *url.URL
if p.Username != ""&& p.Password!="" {
proxyURL, _ = url.Parse("http://" + p.Username + ":" + p.Password + "@" + ProxyServer)
}else{
proxyURL, _ = url.Parse("http://" + ProxyServer)
}
return http.Client{Transport: &http.Transport{Proxy:http.ProxyURL(proxyURL)}}
}
func main() {
targetURI := "https://httpbin.org/ip"
// 初始化 proxy http client
client := ProxyAuth{"username", "password"}.ProxyClient()
request, _ := http.NewRequest("GET", targetURI, bytes.NewBuffer([] byte(``)))
// 设置Proxy-Tunnel
// rand.Seed(time.Now().UnixNano())
// tunnel := rand.Intn(10000)
// request.Header.Set("Proxy-Tunnel", strconv.Itoa(tunnel) )
response, err := client.Do(request)
if err != nil {
panic("failed to connect: " + err.Error())
} else {
bodyByte, err := ioutil.ReadAll(response.Body)
if err != nil {
fmt.Println("读取 Body 时出错", err)
return
}
response.Body.Close()
body := string(bodyByte)
fmt.Println("Response Status:", response.Status)
fmt.Println("Response Header:", response.Header)
fmt.Println("Response Body:\n", body)
}
}
package main
import (
"crypto/tls"
"fmt"
"io/ioutil"
"net"
"net/http"
"net/http/httputil"
"net/url"
"time"
)
func main() {
for {
// 代理服务器的用户名和密码
proxyUsername := "username"
proxyPassword := "password"
// 代理服务器(产品官网 www.16yun.cn)
// 代理服务器的 URL
proxyURL, err := url.Parse(fmt.Sprintf("http://%s:%s@t.16yun.cn:31111", proxyUsername, proxyPassword))
if err != nil {
fmt.Println(err)
return
}
// 添加自定义头部,
//rand.Seed(time.Now().UnixNano())
//tunnel := rand.Intn(10000)
//proxyHeaders.Add("Proxy-Tunnel", strconv.Itoa(tunnel))
proxyHeaders := http.Header{}
// 设置为固定的数字 ,后面所有请求都会固定到一个IP上
// proxyHeaders.Add("Proxy-Tunnel", "1")
// 定制Transport
tr := &http.Transport{
Proxy: http.ProxyURL(proxyURL),
TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, // 如果需要跳过证书验证
// 自定义DialContext函数
DialContext: (&net.Dialer{
Timeout: 30 * time.Second,
KeepAlive: 30 * time.Second,
DualStack: true,
}).DialContext,
ProxyConnectHeader: proxyHeaders,
}
// 定制Client
client := &http.Client{
Transport: tr,
}
// 发起请求
req, err := http.NewRequest("GET", "https://httpbin.org/ip", nil)
if err != nil {
fmt.Println(err)
return
}
// 使用httputil.DumpRequest输出完整的请求
dump, err := httputil.DumpRequestOut(req, true)
if err != nil {
fmt.Println(err)
return
}
fmt.Println(string(dump))
// 发送请求
resp, err := client.Do(req)
if err != nil {
fmt.Println("error")
fmt.Println(err)
return
}
defer resp.Body.Close()
// 读取响应体
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
fmt.Println(err)
return
}
// 打印响应体
fmt.Println("Response Body:", string(body))
time.Sleep(time.Second * 1)
}
}
6、PhantomJS/CasperJS/playwrightLink
以参数方式传递代理信息,示例如下:
phantomjs --proxy-auth=USERNAME:PASSWORD --proxy=http://t.16yun.cn:31111 --ignore-ssl-errors=true http-demo.js
http-demo.js 内容如下:
var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 UCBrowser/9.4.1.362 U3/0.8.0 Mobile Safari/533.1';
console.log('The user agent is ' + page.settings.userAgent);
// 生成一个随机 proxy tunnel
var seed = 1;
function random() {
var x = Math.sin(seed++) * 10000;
return x - Math.floor(x);
}
const tunnel = random()*100;
//page.customHeaders = {
// "proxy-tunnel": tunnel,
//};
page.onResourceReceived = function(j) {
for (var i = 0; i < j.headers.length; ++i) {
console.log(j.headers[i].name + ': ' + j.headers[i].value);
}
};
page.open("http://httpbin.org/ip", {}, function(status) {
console.log('status> ' + status);
console.log(page.content);
setTimeout(function() {
phantom.exit();
}, 3000);
});
以参数方式传递代理信息,示例如下:
casperjs --proxy-auth=USERNAME:PASSWORD --proxy=http://t.16yun.cn:31111 --ignore-ssl-errors=true --ssl-protocol=any http-demo.js
http-demo.js 内容如下:
var casper = require('casper').create();
// 生成一个随机 proxy tunnel
var seed = 1;
function random() {
var x = Math.sin(seed++) * 10000;
return x - Math.floor(x);
}
const tunnel = random()*1000;
casper.on('started', function () {
this.page.customHeaders = {
"User-Agent" : "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Connection" : "keep-alive",
"Proxy-Tunnel": tunnel
}
});
casper.start("http://httpbin.org/headers");
casper.then(function() {
console.log('First Page: ' + this.page.content);
});
casper.run();
const { chromium, webkit, firefox } = require('playwright');
(async () => {
const browser = await chromium.launch({
proxy: {
server: 'http://t.16yun.cn:31111',
username: 'username',
password: 'password'
}
});
const page = await browser.newPage();
// Subscribe to 'request' and 'response' events.
page.on('request', request =>
console.log('>>', request.method(), request.url()));
page.on('response', response =>
console.log('<<', response.status(), response.url()));
await page.goto('https://httpbin.org/ip');
await browser.close();
})();
7、nodejsLink
const http = require("http");
const url = require("url");
// 要访问的目标页面
const targetUrl = "http://httpbin.org/ip";
const urlParsed = url.parse(targetUrl);
// 代理服务器(产品官网 www.16yun.cn)
const proxyHost = "t.16yun.cn";
const proxyPort = "36600";
// 生成一个随机 proxy tunnel
var seed = 1;
function random() {
var x = Math.sin(seed++) * 10000;
return x - Math.floor(x);
}
const tunnel = random()*100;
// 代理验证信息
const proxyUser = "username";
const proxyPass = "password";
const base64 = new Buffer.from(proxyUser + ":" + proxyPass).toString("base64");
const options = {
host: proxyHost,
port: proxyPort,
path: targetUrl,
method: "GET",
headers: {
"Host": urlParsed.hostname,
"Proxy-Tunnel": tunnel,
"Proxy-Authorization" : "Basic " + base64
}
};
http.request(options, function (res) {
console.log("got response: " + res.statusCode);
res.pipe(process.stdout);
}).on("error", function (err) {
console.log(err);
}).end();
const https = require("https");
const url = require("url");
const httpsProxyAgent = require('https-proxy-agent');
// 要访问的目标页面
const targetUrl = "https://httpbin.org/ip";
const urlParsed = url.parse(targetUrl);
// 代理服务器(产品官网 www.16yun.cn)
const proxyHost = "t.16yun.cn";
const proxyPort = "31111";
// 代理验证信息
const proxyUser = "username";
const proxyPass = "password";
var options = urlParsed;
var agent = new httpsProxyAgent("http://" + proxyUser + ":" + proxyPass + "@" + proxyHost + ":" + proxyPort);
options.agent = agent;
https.request(options, function (res) {
console.log("got response: " + res.statusCode);
res.pipe(process.stdout);
}).on("error", function (err) {
console.log(err);
}).end();
const https = require("https");
const url = require("url");
const httpsProxyAgent = require('https-proxy-agent');
// 要访问的目标页面
const targetUrl = "https://httpbin.org/ip";
const urlParsed = url.parse(targetUrl);
// 代理服务器(产品官网 www.16yun.cn)
const proxyHost = "t.16yun.cn";
const proxyPort = "31111";
// 代理验证信息
const proxyUser = "username";
const proxyPass = "password";
var options = urlParsed;
const proxy_url = "http://" + proxyUser + ":" + proxyPass + "@" + proxyHost + ":" + proxyPort;
var agent_options = url.parse(proxy_url);
agent_options.headers = { "Proxy-Tunnel" : "1" };
var agent = new httpsProxyAgent(agent_options);
options.agent = agent;
for(var i=0;i<10;i++){
https.request(options, function (res) {
console.log("got response: " + res.statusCode);
res.pipe(process.stdout);
}).on("error", function (err) {
console.log(err);
}).end();
}
const request = require("request");
// 要访问的目标页面
const targetUrl = "http://httpbin.org/ip";
// 代理服务器(产品官网 www.16yun.cn)
const proxyHost = "t.16yun.cn";
const proxyPort = "31111";
// 代理验证信息
const proxyUser = "username";
const proxyPass = "password";
const proxyUrl = "http://" + proxyUser + ":" + proxyPass + "@" + proxyHost + ":" + proxyPort;
const proxiedRequest = request.defaults({'proxy': proxyUrl});
const options = {
url : targetUrl,
headers : {
}
};
proxiedRequest
.get(options, function (err, res, body) {
console.log("got response: " + res.statusCode);
})
.on("error", function (err) {
console.log(err);
})
;
const request = require("superagent");
require("superagent-proxy")(request);
// 要访问的目标页面
const targetUrl = "http://httpbin.org/ip";
// 代理服务器(产品官网 www.16yun.cn)
const proxyHost = "t.16yun.cn";
const proxyPort = 31111;
// 代理验证信息
const proxyUser = "username";
const proxyPass = "password";
const proxyUrl = "http://" + proxyUser + ":" + proxyPass + "@" + proxyHost + ":" + proxyPort;
request
.get(targetUrl)
.proxy(proxyUrl)
.end(function onResponse(err, res) {
if (err) {
return console.log(err);
}
console.log(res.status, res.headers);
console.log(res.text);
})
;
const axios = require('axios');
// 要访问的目标页面
const targetUrl = "http://httpbin.org/ip";
const targetHttpsUrl = "https://httpbin.org/ip";
// 代理服务器(产品官网 www.16yun.cn)
const proxyHost = "t.16yun.cn";
const proxyPort = 31111;
// 代理验证信息
const proxyUser = "username";
const proxyPass = "password";
var proxy = {
host: proxyHost,
port: proxyPort,
auth: {
username: proxyUser,
password: proxyPass
}
};
axios.get(targetUrl,{proxy:proxy})
.then(function (response) {
// handle success
console.log(response.data);
})
.catch(function (error) {
// handle error
console.log(error);
})
.finally(function () {
// always executed
});
// 目标为https网站 axios库支持有bug,不推荐使用
// 具体参看 https://github.com/axios/axios/issues/4531
8、SeleniumLink
Selenium程序运行经常出现如下问题,需要重点关注如下:
1、提示弹窗要求手动输入用户名和密码的问题:
-
注意选择浏览器版本对应的demo,例如Chrome版本>=92,注意看demo的标题进行选择。
-
注意浏览器版本和浏览器驱动要一致,例如:100版本的需要下载对于100版本的driver.
2、临时文件存放目录权限错误的问题:
- 注意设置临时文件存放目录权限,例如 plugin_path = r'/tmp/{}_{}@t.16yun.zip',该文件用于临时存放用户名和密码,如果目录不存在运行也会提示错误,或者提示弹窗要求手动输入用户名和密码。
3、程序运行被目标网站识别的问题:
- 设置运行模式(防止被网站反爬),如果浏览器正常运行下,navigator.webdriver的值应该是undefined或者false,如果为true目标网站能检测到selenium,设置为开发者模式,可以防止目标网站识别。
from selenium import webdriver
import string
import zipfile
# 代理服务器(产品官网 www.16yun.cn)
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "username"
proxyPass = "password"
def create_proxy_auth_extension(proxy_host, proxy_port,
proxy_username, proxy_password,
scheme='http', plugin_path=None):
if plugin_path is None:
plugin_path = r'D:/{}_{}@t.16yun.zip'.format(proxy_username, proxy_password)
manifest_json = """
{
"version": "1.0.0",
"manifest_version": 2,
"name": "16YUN Proxy",
"permissions": [
"proxy",
"tabs",
"unlimitedStorage",
"storage",
"",
"webRequest",
"webRequestBlocking"
],
"background": {
"scripts": ["background.js"]
},
"minimum_chrome_version":"22.0.0"
}
"""
background_js = string.Template(
"""
var config = {
mode: "fixed_servers",
rules: {
singleProxy: {
scheme: "${scheme}",
host: "${host}",
port: parseInt(${port})
},
bypassList: ["foobar.com"]
}
};
chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
function callbackFn(details) {
return {
authCredentials: {
username: "${username}",
password: "${password}"
}
};
}
chrome.webRequest.onAuthRequired.addListener(
callbackFn,
{urls: [""]},
['blocking']
);
"""
).substitute(
host=proxy_host,
port=proxy_port,
username=proxy_username,
password=proxy_password,
scheme=scheme,
)
with zipfile.ZipFile(plugin_path, 'w') as zp:
zp.writestr("manifest.json", manifest_json)
zp.writestr("background.js", background_js)
return plugin_path
proxy_auth_plugin_path = create_proxy_auth_extension(
proxy_host=proxyHost,
proxy_port=proxyPort,
proxy_username=proxyUser,
proxy_password=proxyPass)
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")
# 如报错 chrome-extensions
# option.add_argument("--disable-extensions")
option.add_extension(proxy_auth_plugin_path)
# 关闭webdriver的一些标志
# option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = webdriver.Chrome(chrome_options=option)
# 修改webdriver get属性
# script = '''
# Object.defineProperty(navigator, 'webdriver', {
# get: () => undefined
# })
# '''
# driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": script})
driver.get("http://httpbin.org/ip")
from selenium import webdriver
import string
import zipfile
# 代理服务器(产品官网 www.16yun.cn)
proxyHost = "t.16yun.cn"
proxyPort = "3111"
# 代理验证信息
proxyUser = "username"
proxyPass = "password"
def create_proxy_auth_extension(proxy_host, proxy_port,
proxy_username, proxy_password,
scheme='http', plugin_path=None):
if plugin_path is None:
plugin_path = r'/tmp/{}_{}@t.16yun.zip'.format(proxy_username, proxy_password)
manifest_json = """
{
"version": "1.0.0",
"manifest_version": 2,
"name": "16YUN Proxy",
"permissions": [
"proxy",
"tabs",
"unlimitedStorage",
"storage",
"<all_urls>",
"webRequest",
"webRequestBlocking"
],
"background": {
"scripts": ["background.js"]
},
"minimum_chrome_version":"22.0.0"
}
"""
background_js = string.Template(
"""
var config = {
mode: "fixed_servers",
rules: {
singleProxy: {
scheme: "${scheme}",
host: "${host}",
port: parseInt(${port})
},
bypassList: ["localhost"]
}
};
chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
function callbackFn(details) {
return {
authCredentials: {
username: "${username}",
password: "${password}"
}
};
}
chrome.webRequest.onAuthRequired.addListener(
callbackFn,
{urls: ["<all_urls>"]},
['blocking']
);
"""
).substitute(
host=proxy_host,
port=proxy_port,
username=proxy_username,
password=proxy_password,
scheme=scheme,
)
print(background_js)
with zipfile.ZipFile(plugin_path, 'w') as zp:
zp.writestr("manifest.json", manifest_json)
zp.writestr("background.js", background_js)
return plugin_path
proxy_auth_plugin_path = create_proxy_auth_extension(
proxy_host=proxyHost,
proxy_port=proxyPort,
proxy_username=proxyUser,
proxy_password=proxyPass)
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")
# 如报错 chrome-extensions
# option.add_argument("--disable-extensions")
option.add_extension(proxy_auth_plugin_path)
# 关闭webdriver的一些标志
# option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = webdriver.Chrome(
chrome_options=option,
executable_path="./chromdriver"
)
# 修改webdriver get属性
# script = '''
# Object.defineProperty(navigator, 'webdriver', {
# get: () => undefined
# })
# '''
# driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": script})
driver.get("https://httpbin.org/ip")
使用selenium登录获取cookie Demo源代码
import os
import time
import zipfile
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
class GenCookies(object):
# 随机useragent
USER_AGENT = open('useragents.txt').readlines()
# 代理服务器(产品官网 www.16yun.cn)
PROXY_HOST = 't.16yun.cn' # proxy or host
PROXY_PORT = 31111 # port
PROXY_USER = 'USERNAME' # username
PROXY_PASS = 'PASSWORD' # password
@classmethod
def get_chromedriver(cls, use_proxy=False, user_agent=None):
manifest_json = """
{
"version": "1.0.0",
"manifest_version": 2,
"name": "Chrome Proxy",
"permissions": [
"proxy",
"tabs",
"unlimitedStorage",
"storage",
"<all_urls>",
"webRequest",
"webRequestBlocking"
],
"background": {
"scripts": ["background.js"]
},
"minimum_chrome_version":"22.0.0"
}
"""
background_js = """
var config = {
mode: "fixed_servers",
rules: {
singleProxy: {
scheme: "http",
host: "%s",
port: parseInt(%s)
},
bypassList: ["localhost"]
}
};
chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
function callbackFn(details) {
return {
authCredentials: {
username: "%s",
password: "%s"
}
};
}
chrome.webRequest.onAuthRequired.addListener(
callbackFn,
{urls: ["<all_urls>"]},
['blocking']
);
""" % (cls.PROXY_HOST, cls.PROXY_PORT, cls.PROXY_USER, cls.PROXY_PASS)
path = os.path.dirname(os.path.abspath(__file__))
chrome_options = webdriver.ChromeOptions()
# 关闭webdriver的一些标志
# chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])
if use_proxy:
pluginfile = 'proxy_auth_plugin.zip'
with zipfile.ZipFile(pluginfile, 'w') as zp:
zp.writestr("manifest.json", manifest_json)
zp.writestr("background.js", background_js)
chrome_options.add_extension(pluginfile)
if user_agent:
chrome_options.add_argument('--user-agent=%s' % user_agent)
driver = webdriver.Chrome(
os.path.join(path, 'chromedriver'),
chrome_options=chrome_options)
# 修改webdriver get属性
# script = '''
# Object.defineProperty(navigator, 'webdriver', {
# get: () => undefined
# })
# '''
# driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": script})
return driver
def __init__(self, username, password):
# 登录example网站
self.url = 'https://passport.example.cn/signin/login?entry=example&r=https://m.example.cn/'
self.browser = self.get_chromedriver(use_proxy=True, user_agent=self.USER_AGENT)
self.wait = WebDriverWait(self.browser, 20)
self.username = username
self.password = password
def open(self):
"""
打开网页输入用户名密码并点击
:return: None
"""
self.browser.delete_all_cookies()
self.browser.get(self.url)
username = self.wait.until(EC.presence_of_element_located((By.ID, 'loginName')))
password = self.wait.until(EC.presence_of_element_located((By.ID, 'loginPassword')))
submit = self.wait.until(EC.element_to_be_clickable((By.ID, 'loginAction')))
username.send_keys(self.username)
password.send_keys(self.password)
time.sleep(1)
submit.click()
def password_error(self):
"""
判断是否密码错误
:return:
"""
try:
return WebDriverWait(self.browser, 5).until(
EC.text_to_be_present_in_element((By.ID, 'errorMsg'), '用户名或密码错误'))
except TimeoutException:
return False
def get_cookies(self):
"""
获取Cookies
:return:
"""
return self.browser.get_cookies()
def main(self):
"""
入口
:return:
"""
self.open()
if self.password_error():
return {
'status': 2,
'content': '用户名或密码错误'
}
cookies = self.get_cookies()
return {
'status': 1,
'content': cookies
}
if __name__ == '__main__':
result = GenCookies(
username='180000000',
password='16yun',
).main()
print(result)
import org.json.JSONException;
import org.json.JSONObject;
import org.openqa.selenium.Platform;
import org.openqa.selenium.Proxy;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.remote.DesiredCapabilities;
import com.gargoylesoftware.htmlunit.DefaultCredentialsProvider;
import com.gargoylesoftware.htmlunit.WebClient;
public class HtmlUnitDriverProxyDemo
{
// 代理验证信息
final static String proxyUser = "username";
final static String proxyPass = "password";
// 代理服务器
final static String proxyServer = "t.16yun.cn:3111";
public static void main(String[] args) throws JSONException
{
HtmlUnitDriver driver = getHtmlUnitDriver();
driver.get("https://httpbin.org/ip");
String title = driver.getTitle();
System.out.println(title);
}
public static HtmlUnitDriver getHtmlUnitDriver()
{
HtmlUnitDriver driver = null;
Proxy proxy = new Proxy();
proxy.setHttpProxy(proxyServer);
DesiredCapabilities capabilities = DesiredCapabilities.htmlUnit();
capabilities.setCapability(CapabilityType.PROXY, proxy);
capabilities.setJavascriptEnabled(true);
capabilities.setPlatform(Platform.WIN8_1);
driver = new HtmlUnitDriver(capabilities) {
@Override
protected WebClient modifyWebClient(WebClient client) {
DefaultCredentialsProvider creds = new DefaultCredentialsProvider();
creds.addCredentials(proxyUser, proxyPass);
client.setCredentialsProvider(creds);
return client;
}
};
driver.setJavascriptEnabled(true);
return driver;
}
}
import org.json.JSONException;
import org.json.JSONObject;
import org.openqa.selenium.Platform;
import org.openqa.selenium.Proxy;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxProfile;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.remote.DesiredCapabilities;
import com.gargoylesoftware.htmlunit.DefaultCredentialsProvider;
import com.gargoylesoftware.htmlunit.WebClient;
public class FirefoxDriverProxyDemo
{
// 代理隧道验证信息
final static String proxyUser = "username";
final static String proxyPass = "password";
// 代理服务器
final static String proxyHost = "t.16yun.cn";
final static int proxyPort = 31111;
final static String firefoxBin = "C:/Program Files/Mozilla Firefox/firefox.exe";
public static void main(String[] args) throws JSONException
{
System.setProperty("webdriver.firefox.bin", firefoxBin);
FirefoxProfile profile = new FirefoxProfile();
profile.setPreference("network.proxy.type", 1);
profile.setPreference("network.proxy.http", proxyHost);
profile.setPreference("network.proxy.http_port", proxyPort);
profile.setPreference("network.proxy.ssl", proxyHost);
profile.setPreference("network.proxy.ssl_port", proxyPort);
profile.setPreference("username", proxyUser);
profile.setPreference("password", proxyPass);
profile.setPreference("network.proxy.share_proxy_settings", true);
profile.setPreference("network.proxy.no_proxies_on", "localhost");
FirefoxDriver driver = new FirefoxDriver(profile);
}
}
$pluginForProxyLogin = '/tmp/a'.uniqid().'.zip';
$zip = new ZipArchive();
$res = $zip->open($pluginForProxyLogin, ZipArchive::CREATE | ZipArchive::OVERWRITE);
$zip->addFile('/path/to/Chrome-proxy-helper/manifest.json', 'manifest.json');
$background = file_get_contents('/path/to/Chrome-proxy-helper/background.js');
$background = str_replace(['%proxy_host', '%proxy_port', '%username', '%password'], ['t.16yun.cn', '31111', 'username', 'password'], $background);
$zip->addFromString('background.js', $background);
$zip->close();
putenv("webdriver.chrome.driver=/path/to/chromedriver");
$options = new ChromeOptions();
$options->addExtensions([$pluginForProxyLogin]);
$caps = DesiredCapabilities::chrome();
$caps->setCapability(ChromeOptions::CAPABILITY, $options);
$driver = ChromeDriver::start($caps);
$driver->get('https://httpbin.org/ip');
header('Content-Type: image/png');
echo $driver->takeScreenshot();
unlink($pluginForProxyLogin);
9、PuppeteerLink
const puppeteer = require('puppeteer');
// 代理服务器(产品官网 www.16yun.cn)
const proxyServer = 'http://t.16yun.cn:31111';
const username = 'username';
const password = 'password';
(async() => {
const browser = await puppeteer.launch({
args: [ '--proxy-server='+proxyServer+'','--no-sandbox', '--disable-setuid-sandbox' ]});
const page = await browser.newPage();
await page.authenticate({ username, password });
await page.goto('https://www.baidu.com');
const cookies = await page.cookies();
await console.log(cookies);
await page.setViewport({width: 320, height: 480});
await page.screenshot({path: '/screenshots/full.png', fullPage: true});
await browser.close();
})();
在项目中新建middlewares.py文件(./项目名/middlewares.py)
#! -*- encoding:utf-8 -*-
import websockets
from scrapy.http import HtmlResponse
from logging import getLogger
import asyncio
import pyppeteer
import logging
from concurrent.futures._base import TimeoutError
import base64
import sys
import random
pyppeteer_level = logging.WARNING
logging.getLogger('websockets.protocol').setLevel(pyppeteer_level)
logging.getLogger('pyppeteer').setLevel(pyppeteer_level)
PY3 = sys.version_info[0] >= 3
def base64ify(bytes_or_str):
if PY3 and isinstance(bytes_or_str, str):
input_bytes = bytes_or_str.encode('utf8')
else:
input_bytes = bytes_or_str
output_bytes = base64.urlsafe_b64encode(input_bytes)
if PY3:
return output_bytes.decode('ascii')
else:
return output_bytes
class ProxyMiddleware(object):
# 加载随机UserAgent(根据需求)
# USER_AGENT = open('useragents.txt').readlines()
def process_request(self, request, spider):
# 代理服务器
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "username"
proxyPass = "password"
request.meta['proxy'] = "http://{0}:{1}".format(proxyHost, proxyPort)
# 添加验证头
encoded_user_pass = base64ify(proxyUser + ":" + proxyPass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
# 设置IP切换头(根据需求)
tunnel = random.randint(1, 10000)
request.headers['Proxy-Tunnel'] = str(tunnel)
# 设置随机UserAgent(根据需求)
# request.headers['User-Agent'] = random.choice(self.USER_AGENT)
class PyppeteerMiddleware(object):
def __init__(self, **args):
"""
init logger, loop, browser
:param args:
"""
self.logger = getLogger(__name__)
self.loop = asyncio.get_event_loop()
self.browser = self.loop.run_until_complete(
pyppeteer.launch(headless=True))
self.args = args
def __del__(self):
"""
close loop
:return:
"""
self.loop.close()
def render(self, url, retries=1, script=None, wait=0.3, scrolldown=False, sleep=0,
timeout=8.0, keep_page=False):
"""
render page with pyppeteer
:param url: page url
:param retries: max retry times
:param script: js script to evaluate
:param wait: number of seconds to wait before loading the page, preventing timeouts
:param scrolldown: how many times to page down
:param sleep: how many long to sleep after initial render
:param timeout: the longest wait time, otherwise raise timeout error
:param keep_page: keep page not to be closed, browser object needed
:param browser: pyppetter browser object
:param with_result: return with js evaluation result
:return: content, [result]
"""
# define async render
async def async_render(url, script, scrolldown, sleep, wait, timeout, keep_page):
try:
# basic render
page = await self.browser.newPage()
await asyncio.sleep(wait)
response = await page.goto(url, options={'timeout': int(timeout * 1000)})
if response.status != 200:
return None, None, response.status
result = None
# evaluate with script
if script:
result = await page.evaluate(script)
# scroll down for {scrolldown} times
if scrolldown:
for _ in range(scrolldown):
await page._keyboard.down('PageDown')
await asyncio.sleep(sleep)
else:
await asyncio.sleep(sleep)
if scrolldown:
await page._keyboard.up('PageDown')
# get html of page
content = await page.content()
return content, result, response.status
except TimeoutError:
return None, None, 500
finally:
# if keep page, do not close it
if not keep_page:
await page.close()
content, result, status = [None] * 3
# retry for {retries} times
for i in range(retries):
if not content:
content, result, status = self.loop.run_until_complete(
async_render(url=url, script=script, sleep=sleep, wait=wait,
scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
else:
break
# if need to return js evaluation result
return content, result, status
def process_request(self, request, spider):
"""
:param request: request object
:param spider: spider object
:return: HtmlResponse
"""
if request.meta.get('render'):
try:
self.logger.debug('rendering %s', request.url)
html, result, status = self.render(request.url)
return HtmlResponse(url=request.url, body=html, request=request, encoding='utf-8',
status=status)
except websockets.exceptions.ConnectionClosed:
pass
@classmethod
def from_crawler(cls, crawler):
return cls(**crawler.settings.get('PYPPETEER_ARGS', {}))
DOWNLOADER_MIDDLEWARES = {
'scrapypyppeteer.middlewares.PyppeteerMiddleware': 543,
'scrapypyppeteer.middlewares.ProxyMiddleware': 100,
}
下载scrapy-pyppeteer
10、易语言Link
11、mitmproxyLink
- 下载并安装 mitmproxy (https://www.mitmproxy.org/)
- 安装完成后,启动 mitmproxy ui,打开浏览器请求 http://127.0.0.1:8081 访问 mitmproxy web 控制页面
- 进入 mitmproxy web 控制页面,点击「mitmproxy」-「 Options」-「Edit Options」进行配置
- 配置 mode 项,值为 upstream:http://t.16yun.cn:31111 # 代理服务器地址、端口请替换成自己的
12、C++Link
#include <string>
#include <curl/curl.h>
#include <iostream>
static size_t WriteCallback(void *contents, size_t size, size_t nmemb, void *userp)
{
((std::string*)userp)->append((char*)contents, size * nmemb);
return size * nmemb;
}
int main(void)
{
CURL* curl;
CURLcode res;
std::string readBuffer;
curl = curl_easy_init();
if (curl) {
curl_easy_setopt(curl, CURLOPT_URL, "https://httpbin.org/ip");
curl_easy_setopt(curl, CURLOPT_PROXY, "http://t.16yun.cn:31111");
curl_easy_setopt(curl, CURLOPT_PROXYUSERPWD, "username:password");
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);
/* Perform the request, res will get the return code */
res = curl_easy_perform(curl);
if(res != CURLE_OK){
fprintf(stderr, "curl_easy_perform() failed: %s\n",curl_easy_strerror(res));
}
/* always cleanup */
curl_easy_cleanup(curl);
std::cout << readBuffer << std::endl;
}
return 0;
}
13、更进一步的爬虫应用案例Link
针对用户更多的应用场景,我们提供如下的案例,如有需要请联系技术客服咨询:
- 爬虫入门基础-开发知识点和技巧
- 爬虫入门基础-Selenium反爬
- 爬虫入门基础-Scrapy框架之Puppeteer渲染
- 爬虫入门基础-Selenium登录生成Cookie
- 爬虫入门基础-Scrapy框架之Spalsh渲染
- 爬虫入门基础-LightProxy抓包
- Python爬虫资源大全中文版
- 爬虫入门基础-Scrapy框架
- 爬虫入门基础-Firefox数据抓包
- 爬虫入门基础-Python爬虫
- 爬虫入门基础-HTTP协议过程