跳转至

隧道转发

一、隧道转发接入Link

使用指南

1、保持IP/切换IPLink

该模式适合一些需要登陆、Cookie缓存处理等爬虫需要精确控制IP切换时机的业务。 爬虫程序可以通过设置HTTP头Proxy-Tunnel: 随机数, 当随机数相同时,访问目标网站的代理IP相同。

例如

需要登录,获取数据两个请求在一个IP下,只需对这组请求设置相同Proxy-Tunnel,例如:Proxy-Tunnel: 12345, 该组请求在代理有效期内使用相同的代理IP

注意

同一时间不同请求组可以设置不同Proxy-Tunnel: 随机数,并发完成数据爬取。

使用相同IP访问HTTPS目标网站

1 使用Connection: keep-aliveProxy-Connection: keep-alive方式访问目标网站,代理会确保在一个会话中的所有请求都通过一个IP到达目标网站 2 设置相同Proxy-Tunnel,有些库封装比较高层次,请务必确认向代理发送了该HTTP头。

  • 每个TCP请求自动切换,是指爬虫代理为爬虫程序发出的每个TCP请求随机提供一个代理IP,同一个TCP会话中IP不变
  • 通过设置Proxy-Connection: Keep-AliveConnection: Keep-Alive可以保持同一个Session代理IP不变

例如

需要登录,获取数据两个请求在一个IP下,只需保证该组请求在一个TCP(Keep-Alive)会话下, 该组请求在代理有效期内使用相同的代理IP

HTTPS

使用爬虫代理访问HTTPS网站时,会自动开启KeepAlive,同一个Session(HTTPS会话)代理IP不变。如需要每个请求强制切换IP,可以设置Proxy-Connection: CloseConnection: Close

2、用户密码认证Link

  • 通过用户名和密码的形式进行身份认证,该认证信息最终会转换为Proxy-Authorization 协议头跟随请求一起发出
  • 如用户认证错误,系统会返回401 Unauthorized407 Proxy Authentication Required

例如

在代码中使用HTTP隧道时,如果代码的 HTTP 请求方法不支持以用户名/密码的形式设置身份认证信息, 则需要手动为每个 HTTP 请求增加Proxy-Authorization协议头, 其值为 Basic <base64>。其中 <base64> 为 “用户名” 和 “密码” 通过 : 拼接后, 再经由 BASE64 编码得到的字符串。 正确设置后,发出的请求都将包含如下格式的 HTTP 协议头信息: Proxy-Authorization: Basic MTZZVU4xMjM6MTIzNDMyMw==

注意

建议使用Proxy-Authorization 进行用户密码认证。如果使用Authorization,该HTTP头信息会随请求发送到目标网站。 访问HTTPS网站时,请使用库自带的代理认证方式,手动设置的Proxy-Authorization协议头,在访问HTTPS网站的情况下,会被代理直接转发到目标网站,导致匿名失效。

域名解析失败

爬虫代理域名ttl时间比较短【多机多地热备】,如遇到解析爬虫代理的域名失败,建议使用 114.114.114.114 或运营商的dns来做DNS解析。

二、隧道转发代码示例Link

1、PythonLink

requests

    #! -*- encoding:utf-8 -*-

    import requests
    import random

    # 要访问的目标页面
    targetUrl = "http://httpbin.org/ip"

    # 要访问的目标HTTPS页面
    # targetUrl = "https://httpbin.org/ip"

    # 代理服务器(产品官网 www.16yun.cn)
    proxyHost = "t.16yun.cn"
    proxyPort = "31111"

    # 代理验证信息
    proxyUser = "username"
    proxyPass = "password"

    proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
        "host" : proxyHost,
        "port" : proxyPort,
        "user" : proxyUser,
        "pass" : proxyPass,
    }

    # 设置 http和https访问都是用HTTP代理
    proxies = {
        "http"  : proxyMeta,
        "https" : proxyMeta,
    }


    #  设置IP切换头
    tunnel = random.randint(1,10000)
    headers = {"Proxy-Tunnel": str(tunnel)}



    resp = requests.get(targetUrl, proxies=proxies, headers=headers)

    print resp.status_code
    print resp.text
    #! -*- encoding:utf-8 -*-
    import requests
    import random
    import requests.adapters

    # 要访问的目标页面
    targetUrlList = [
        "https://httpbin.org/ip",
        "https://httpbin.org/headers",
        "https://httpbin.org/user-agent",
    ]

    # 代理服务器(产品官网 www.16yun.cn)
    proxyHost = "t.16yun.cn"
    proxyPort = "31111"

    # 代理验证信息
    proxyUser = "username"
    proxyPass = "password"

    proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
        "host": proxyHost,
        "port": proxyPort,
        "user": proxyUser,
        "pass": proxyPass,
    }

    # 设置 http和https访问都是用HTTP代理
    proxies = {
        "http": proxyMeta,
        "https": proxyMeta,
    }

    #  设置IP切换头
    tunnel = random.randint(1, 10000)
    headers = {"Proxy-Tunnel": str(tunnel)}


    class HTTPAdapter(requests.adapters.HTTPAdapter):
        def proxy_headers(self, proxy):
            headers = super(HTTPAdapter, self).proxy_headers(proxy)
            if hasattr(self, 'tunnel'):
                headers['Proxy-Tunnel'] = self.tunnel
            return headers


    # 访问三次网站,使用相同的tunnel标志,均能够保持相同的外网IP
    for i in range(3):
        s = requests.session()

        a = HTTPAdapter()

        #  设置IP切换头
        a.tunnel = tunnel
        s.mount('https://', a)

        for url in targetUrlList:
            r = s.get(url, proxies=proxies)
            print r.text
    #! -*- encoding:utf-8 -*-
    import requests
    import random
    import requests.adapters

    # 要访问的目标页面
    targetUrlList = [
        "https://httpbin.org/ip",
        "https://httpbin.org/headers",
        "https://httpbin.org/user-agent",
    ]

    # 代理服务器(产品官网 www.16yun.cn)
    proxyHost = "t.16yun.cn"
    proxyPort = "31111"

    # 代理验证信息
    proxyUser = "username"
    proxyPass = "password"

    proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
        "host": proxyHost,
        "port": proxyPort,
        "user": proxyUser,
        "pass": proxyPass,
    }

    # 设置 http和https访问都是用HTTP代理
    proxies = {
        "http": proxyMeta,
        "https": proxyMeta,
    }

    # 访问三次网站,使用相同的Session(keep-alive),均能够保持相同的外网IP
    s = requests.session()

    # 设置cookie
    # cookie_dict = {"JSESSION":"123456789"}
    # cookies = requests.utils.cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True)
    # s.cookies = cookies

    for i in range(3):
        for url in targetUrlList:
            r = s.get(url, proxies=proxies)
            print r.text
urllib2
    #! -*- encoding:utf-8 -*-

    from urllib import request

    # 要访问的目标页面
    targetUrl = "http://httpbin.org/ip"

    # 代理服务器(产品官网 www.16yun.cn)
    proxyHost = "t.16yun.cn"
    proxyPort = "31111"

    # 代理验证信息
    proxyUser = "username"
    proxyPass = "password"


    proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
        "host" : proxyHost,
        "port" : proxyPort,
        "user" : proxyUser,
        "pass" : proxyPass,
    }

    proxy_handler = request.ProxyHandler({
        "http"  : proxyMeta,
        "https" : proxyMeta,
    })        

    opener = request.build_opener(proxy_handler)

    request.install_opener(opener)
    resp = request.urlopen(targetUrl).read()

    print (resp)              
    #! -*- encoding:utf-8 -*-
    import urllib2
    import random
    import httplib


    class HTTPSConnection(httplib.HTTPSConnection):

        def set_tunnel(self, host, port=None, headers=None):
            httplib.HTTPSConnection.set_tunnel(self, host, port, headers)
            if hasattr(self, 'proxy_tunnel'):
                self._tunnel_headers['Proxy-Tunnel'] = self.proxy_tunnel


    class HTTPSHandler(urllib2.HTTPSHandler):
        def https_open(self, req):
            return urllib2.HTTPSHandler.do_open(self, HTTPSConnection, req, context=self._context)


    # 要访问的目标页面
    targetUrlList = [
        "https://httpbin.org/ip",
        "https://httpbin.org/headers",
        "https://httpbin.org/user-agent",
    ]

    # 代理服务器(产品官网 www.16yun.cn)
    proxyHost = "t.16yun.cn"
    proxyPort = "31111"

    # 代理验证信息
    proxyUser = "username"
    proxyPass = "password"

    proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
        "host": proxyHost,
        "port": proxyPort,
        "user": proxyUser,
        "pass": proxyPass,
    }

    # 设置 http和https访问都是用HTTP代理
    proxies = {
        "http": proxyMeta,
        "https": proxyMeta,
    }

    #  设置IP切换头
    tunnel = random.randint(1, 10000)
    headers = {"Proxy-Tunnel": str(tunnel)}
    HTTPSConnection.proxy_tunnel = tunnel


    proxy = urllib2.ProxyHandler(proxies)
    opener = urllib2.build_opener(proxy, HTTPSHandler)
    urllib2.install_opener(opener)

    # 访问三次网站,使用相同的tunnel标志,均能够保持相同的外网IP
    for i in range(3):
        for url in targetUrlList:
            r = urllib2.Request(url)
            print(urllib2.urlopen(r).read())

urllib2无法使用Keep-alive

urllib2对于HTTP/1.1默认会关闭连接 请通过设置相同Proxy-Tunnel来保持相同的外网IP.

scrapy

在项目中新建middlewares.py文件(./项目名/middlewares.py)

        #! -*- encoding:utf-8 -*-
        import base64            
        import sys
        import random

        PY3 = sys.version_info[0] >= 3

        def base64ify(bytes_or_str):
            if PY3 and isinstance(bytes_or_str, str):
                input_bytes = bytes_or_str.encode('utf8')
            else:
                input_bytes = bytes_or_str

            output_bytes = base64.urlsafe_b64encode(input_bytes)
            if PY3:
                return output_bytes.decode('ascii')
            else:
                return output_bytes

        class ProxyMiddleware(object):                
            def process_request(self, request, spider):
                # 代理服务器(产品官网 www.16yun.cn)
                proxyHost = "t.16yun.cn"
                proxyPort = "31111"

                # 代理验证信息
                proxyUser = "username"
                proxyPass = "password"

                request.meta['proxy'] = "http://{0}:{1}".format(proxyHost,proxyPort)

                # 添加验证头
                encoded_user_pass = base64ify(proxyUser + ":" + proxyPass)
                request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass                    

                # 设置IP切换头(根据需求)
                tunnel = random.randint(1,10000)
                request.headers['Proxy-Tunnel'] = str(tunnel)

修改项目配置文件 (./项目名/settings.py)

    DOWNLOADER_MIDDLEWARES = {
        '项目名.middlewares.ProxyMiddleware': 100,
    }

设置随机UserAgent

请参考

在项目中新建middlewares.py文件(./项目名/middlewares.py)

        #! -*- encoding:utf-8 -*-
        import base64            
        import sys
        import random

        PY3 = sys.version_info[0] >= 3

        def base64ify(bytes_or_str):
            if PY3 and isinstance(bytes_or_str, str):
                input_bytes = bytes_or_str.encode('utf8')
            else:
                input_bytes = bytes_or_str

            output_bytes = base64.urlsafe_b64encode(input_bytes)
            if PY3:
                return output_bytes.decode('ascii')
            else:
                return output_bytes

        class ProxyMiddleware(object):                
            def process_request(self, request, spider):
                # 代理服务器(产品官网 www.16yun.cn)
                proxyHost = "t.16yun.cn"
                proxyPort = "31111"

                # 代理验证信息
                proxyUser = "username"
                proxyPass = "password"

                request.meta['proxy'] = "http://{0}:{1}".format(proxyHost,proxyPort)

                # 添加验证头
                encoded_user_pass = base64ify(proxyUser + ":" + proxyPass)
                request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass                    

                # 设置IP切换头(根据需求)
                tunnel = random.randint(1,10000)
                request.headers['Proxy-Tunnel'] = str(tunnel)

        class SplashProxyMiddleware(object):                
            def process_request(self, request, spider):
                # 代理服务器(产品官网 www.16yun.cn)
                proxyHost = "t.16yun.cn"
                proxyPort = "31111"

                # 代理验证信息
                proxyUser = "username"
                proxyPass = "password"

                request.meta['splash']['args']['proxy'] = "http://{}:{}@{}:{}".format(proxyUser,proxyPass,proxyHost,proxyPort)

给splash设置代理修改项目配置文件 (./项目名/settings.py)

    DOWNLOADER_MIDDLEWARES = {
        '项目名.middlewares.ProxyMiddleware': 100,
        '项目名.middlewares.SplashProxyMiddleware': 500,
    }

在项目中新建middlewares.py文件(./项目名/middlewares.py)

        #! -*- encoding:utf-8 -*-    

        import websockets
        from scrapy.http import HtmlResponse
        from logging import getLogger
        import asyncio
        import pyppeteer
        import logging
        from concurrent.futures._base import TimeoutError
        import base64
        import sys
        import random

        pyppeteer_level = logging.WARNING
        logging.getLogger('websockets.protocol').setLevel(pyppeteer_level)
        logging.getLogger('pyppeteer').setLevel(pyppeteer_level)

        PY3 = sys.version_info[0] >= 3


        def base64ify(bytes_or_str):
            if PY3 and isinstance(bytes_or_str, str):
                input_bytes = bytes_or_str.encode('utf8')
            else:
                input_bytes = bytes_or_str

            output_bytes = base64.urlsafe_b64encode(input_bytes)
            if PY3:
                return output_bytes.decode('ascii')
            else:
                return output_bytes


        class ProxyMiddleware(object):
            USER_AGENT = open('useragents.txt').readlines()

            def process_request(self, request, spider):
                # 代理服务器
                proxyHost = "t.16yun.cn"
                proxyPort = "31111"

                # 代理验证信息
                proxyUser = "username"
                proxyPass = "password"

                request.meta['proxy'] = "http://{0}:{1}".format(proxyHost, proxyPort)

                # 添加验证头
                encoded_user_pass = base64ify(proxyUser + ":" + proxyPass)
                request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

                # 设置IP切换头(根据需求)
                tunnel = random.randint(1, 10000)
                request.headers['Proxy-Tunnel'] = str(tunnel)
                request.headers['User-Agent'] = random.choice(self.USER_AGENT)


        class PyppeteerMiddleware(object):
            def __init__(self, **args):
                """
                init logger, loop, browser
                :param args:
                """
                self.logger = getLogger(__name__)
                self.loop = asyncio.get_event_loop()
                self.browser = self.loop.run_until_complete(
                    pyppeteer.launch(headless=True))
                self.args = args

            def __del__(self):
                """
                close loop
                :return:
                """
                self.loop.close()

            def render(self, url, retries=1, script=None, wait=0.3, scrolldown=False, sleep=0,
                       timeout=8.0, keep_page=False):
                """
                render page with pyppeteer
                :param url: page url
                :param retries: max retry times
                :param script: js script to evaluate
                :param wait: number of seconds to wait before loading the page, preventing timeouts
                :param scrolldown: how many times to page down
                :param sleep: how many long to sleep after initial render
                :param timeout: the longest wait time, otherwise raise timeout error
                :param keep_page: keep page not to be closed, browser object needed
                :param browser: pyppetter browser object
                :param with_result: return with js evaluation result
                :return: content, [result]
                """

                # define async render
                async def async_render(url, script, scrolldown, sleep, wait, timeout, keep_page):
                    try:
                        # basic render
                        page = await self.browser.newPage()
                        await asyncio.sleep(wait)
                        response = await page.goto(url, options={'timeout': int(timeout * 1000)})
                        if response.status != 200:
                            return None, None, response.status
                        result = None
                        # evaluate with script
                        if script:
                            result = await page.evaluate(script)

                        # scroll down for {scrolldown} times
                        if scrolldown:
                            for _ in range(scrolldown):
                                await page._keyboard.down('PageDown')
                                await asyncio.sleep(sleep)
                        else:
                            await asyncio.sleep(sleep)
                        if scrolldown:
                            await page._keyboard.up('PageDown')

                        # get html of page
                        content = await page.content()

                        return content, result, response.status
                    except TimeoutError:
                        return None, None, 500
                    finally:
                        # if keep page, do not close it
                        if not keep_page:
                            await page.close()

                content, result, status = [None] * 3

                # retry for {retries} times
                for i in range(retries):
                    if not content:
                        content, result, status = self.loop.run_until_complete(
                            async_render(url=url, script=script, sleep=sleep, wait=wait,
                                         scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
                    else:
                        break

                # if need to return js evaluation result
                return content, result, status

            def process_request(self, request, spider):
                """
                :param request: request object
                :param spider: spider object
                :return: HtmlResponse
                """
                if request.meta.get('render'):
                    try:
                        self.logger.debug('rendering %s', request.url)
                        html, result, status = self.render(request.url)
                        return HtmlResponse(url=request.url, body=html, request=request, encoding='utf-8',
                                            status=status)
                    except websockets.exceptions.ConnectionClosed:
                        pass

            @classmethod
            def from_crawler(cls, crawler):
                return cls(**crawler.settings.get('PYPPETEER_ARGS', {}))

给splash设置代理修改项目配置文件 (./项目名/settings.py)

下载DEMO

git地址

通过设置环境变量,来使用爬虫代理

Windows

    C:\>set http_proxy=http://username:password@ip:port
aiohttp
    #! -*- encoding:utf-8 -*-

    import aiohttp, asyncio


    targetUrl = "http://httpbin.org/ip"

    # 代理服务器(产品官网 www.16yun.cn)
    proxyHost = "t.16yun.cn"
    proxyPort = "31111"

    # 代理验证信息
    proxyUser = "username"
    proxyPass = "password"

    proxyServer = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
        "host" : proxyHost,
        "port" : proxyPort,
        "user" : proxyUser,
        "pass" : proxyPass,
    }

    userAgent = "Chrome/83.0.4103.61"

    async def entry():
        conn = aiohttp.TCPConnector(verify_ssl=False)

        async with aiohttp.ClientSession(headers={"User-Agent": userAgent}, connector=conn) as session:
            async with session.get(targetUrl, proxy=proxyServer) as resp:
                body = await resp.read()

                print(resp.status)
                print(body)

    loop = asyncio.get_event_loop()
    loop.run_until_complete(entry())
    loop.run_forever()

2、C SharpLink

// 要访问的目标页面
string targetUrl = "http://httpbin.org/ip";


// 代理服务器(产品官网 www.16yun.cn)
string proxyHost = "http://t.16yun.cn";
string proxyPort = "31111";

// 代理验证信息
string proxyUser = "username";
string proxyPass = "password";

// 设置代理服务器
WebProxy proxy = new WebProxy(string.Format("{0}:{1}", proxyHost, proxyPort), true);


ServicePointManager.Expect100Continue = false;

var request = WebRequest.Create(targetUrl) as HttpWebRequest;

request.AllowAutoRedirect = true;
request.KeepAlive = true;
request.Method    = "GET";
request.Proxy     = proxy;

//request.Proxy.Credentials = CredentialCache.DefaultCredentials;

request.Proxy.Credentials = new System.Net.NetworkCredential(proxyUser, proxyPass);

// 设置Proxy Tunnel
// Random ran=new Random();
// int tunnel =ran.Next(1,10000);
// request.Headers.Add("Proxy-Tunnel", String.valueOf(tunnel));


//request.Timeout = 20000;
//request.ServicePoint.ConnectionLimit = 512;
//request.UserAgent = "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36";
//request.Headers.Add("Cache-Control", "max-age=0");
//request.Headers.Add("DNT", "1");


//String encoded = System.Convert.ToBase64String(System.Text.Encoding.GetEncoding("ISO-8859-1").GetBytes(proxyUser + ":" + proxyPass));
//request.Headers.Add("Proxy-Authorization", "Basic " + encoded);

using (var response = request.GetResponse() as HttpWebResponse)
using (var sr = new StreamReader(response.GetResponseStream(), Encoding.UTF8))
{
    string htmlStr = sr.ReadToEnd();
}
如果出现Section=ResponseStatusLine 异常

通过修改配置文件解决:在app.config(WinForm)或web.config(Web)文件里修改。 WinForm下的app.config默认不存在,手动在Debug文件夹所在的同级目录下新建一个XML配置文件,内容为:

    <?xml version="1.0" encoding="utf-8" ?>
    <configuration>
    <system.net>
            <settings>
                <httpWebRequest  useUnsafeHeaderParsing= "true "  />
            </settings>
        </system.net>
    </configuration>
编译以后会在Debug下面自动创建一个 程序名.exe.config 的配置文件

3、PHPLink

<?php
    // 要访问的目标页面
    $url = "http://httpbin.org/ip";
    $urls = "https://httpbin.org/ip";

    // 代理服务器(产品官网 www.16yun.cn)
    define("PROXY_SERVER", "tcp://t.16yun.cn:31111");

    // 代理身份信息
    define("PROXY_USER", "username");
    define("PROXY_PASS", "password");

    $proxyAuth = base64_encode(PROXY_USER . ":" . PROXY_PASS);

    // 设置 Proxy tunnel
    $tunnel = rand(1,10000);

    $headers = implode("\r\n", [
        "Proxy-Authorization: Basic {$proxyAuth}",
        "Proxy-Tunnel: ${tunnel}",
    ]);
    $sniServer = parse_url($urls, PHP_URL_HOST);
    $options = [
        "http" => [
            "proxy"  => PROXY_SERVER,
            "header" => $headers,
            "method" => "GET",
            'request_fulluri' => true,
        ],
        'ssl' => array(
                'SNI_enabled' => true, // Disable SNI for https over http proxies
                'SNI_server_name' => $sniServer
        )
    ];
    print($url);
    $context = stream_context_create($options);
    $result = file_get_contents($url, false, $context);
    var_dump($result);

    // 访问 HTTPS 页面
    print($urls);
    $context = stream_context_create($options);
    $result = file_get_contents($urls, false, $context);
    var_dump($result);
?>
<?php

    function curlFile($url,$proxy_ip,$proxy_port,$loginpassw)
    {
        //$loginpassw = 'username:password';
        //$proxy_ip = 't.16yun.cn';
        //$proxy_port = '31111';
        //$url = 'https://httpbin.org/ip';

        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_PROXYPORT, $proxy_port);
        curl_setopt($ch, CURLOPT_PROXYTYPE, 'HTTP');
        curl_setopt($ch, CURLOPT_PROXY, $proxy_ip);
        curl_setopt($ch, CURLOPT_PROXYUSERPWD, $loginpassw);

        // curl used to include a list of accepted CAs, but no longer bundles ANY CA certs. So by default it'll reject all SSL certificates as unverifiable.
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);

        $data = curl_exec($ch);
        curl_close($ch);

        return $data;
    }

    $data = curlFile('https://httpbin.org/ip','t.16yun.cn',31111,'username:password');
    print($data);

?>
<?php    
    namespace App\Console\Commands;    
    use Illuminate\Console\Command;

    class Test16Proxy extends Command
    {
        /**
         * The name and signature of the console command.
         *
         * @var string
         */
        protected $signature = 'test:16proxy';

        /**
         * The console command description.
         *
         * @var string
         */
        protected $description = 'Command description';

        /**
         * Create a new command instance.
         *
         * @return void
         */
        public function __construct()
        {
            parent::__construct();
        }

        /**
         * Execute the console command.
         *
         * @return mixed
         */
        public function handle()
        {
            $client = new \GuzzleHttp\Client();
            // 要访问的目标页面
            $targetUrl = "http://httpbin.org/ip";

            // 代理服务器(产品官网 www.16yun.cn)
            define("PROXY_SERVER", "t.16yun.cn:31111");

            // 代理身份信息
            define("PROXY_USER", "username");
            define("PROXY_PASS", "password");

            $proxyAuth = base64_encode(PROXY_USER . ":" . PROXY_PASS);

            $options = [
                "proxy"  => PROXY_SERVER,
                "headers" => [
                    "Proxy-Authorization" => "Basic " . $proxyAuth
                ]
            ];
            //print_r($options);
            $result = $client->request('GET', $targetUrl, $options);
            var_dump($result->getBody()->getContents());
        }
    }
?>

4、JAVALink

import org.apache.commons.httpclient.Credentials;
import org.apache.commons.httpclient.HostConfiguration;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.UsernamePasswordCredentials;
import org.apache.commons.httpclient.auth.AuthScope;
import org.apache.commons.httpclient.methods.GetMethod;

import java.io.IOException;

public class Main {
    # 代理服务器(产品官网 www.16yun.cn)
    private static final String PROXY_HOST = "t.16yun.cn";
    private static final int PROXY_PORT = 31111;

    public static void main(String[] args) {
        HttpClient client = new HttpClient();
        HttpMethod method = new GetMethod("https://httpbin.org/ip");

        HostConfiguration config = client.getHostConfiguration();
        config.setProxy(PROXY_HOST, PROXY_PORT);

        client.getParams().setAuthenticationPreemptive(true);

        String username = "16ABCCKJ";
        String password = "712323";
        Credentials credentials = new UsernamePasswordCredentials(username, password);
        AuthScope authScope = new AuthScope(PROXY_HOST, PROXY_PORT);

        client.getState().setProxyCredentials(authScope, credentials);

        try {
            client.executeMethod(method);

            if (method.getStatusCode() == HttpStatus.SC_OK) {
                String response = method.getResponseBodyAsString();
                System.out.println("Response = " + response);
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            method.releaseConnection();
        }
    }
}
//*感谢 “情歌”提供的代码

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
import java.net.URI;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.http.Header;
import org.apache.http.HeaderElement;
import org.apache.http.HttpHost;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.AuthCache;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.client.HttpRequestRetryHandler;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.config.AuthSchemes;
import org.apache.http.client.entity.GzipDecompressingEntity;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.methods.HttpRequestBase;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.config.Registry;
import org.apache.http.config.RegistryBuilder;
import org.apache.http.conn.socket.ConnectionSocketFactory;
import org.apache.http.conn.socket.LayeredConnectionSocketFactory;
import org.apache.http.conn.socket.PlainConnectionSocketFactory;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.impl.auth.BasicScheme;
import org.apache.http.impl.client.BasicAuthCache;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.client.ProxyAuthenticationStrategy;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.message.BasicHeader;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.NameValuePair;
import org.apache.http.util.EntityUtils;

public class Demo
{
    // 代理服务器(产品官网 www.16yun.cn)
    final static String proxyHost = "t.16yun.cn";
    final static Integer proxyPort = 31000;

    // 代理验证信息
    final static String proxyUser = "username";
    final static String proxyPass = "password";




    private static PoolingHttpClientConnectionManager cm = null;
    private static HttpRequestRetryHandler httpRequestRetryHandler = null;
    private static HttpHost proxy = null;

    private static CredentialsProvider credsProvider = null;
    private static RequestConfig reqConfig = null;

    static {
        ConnectionSocketFactory plainsf = PlainConnectionSocketFactory.getSocketFactory();
        LayeredConnectionSocketFactory sslsf = SSLConnectionSocketFactory.getSocketFactory();

        Registry registry = RegistryBuilder.create()
            .register("http", plainsf)
            .register("https", sslsf)
            .build();

        cm = new PoolingHttpClientConnectionManager(registry);
        cm.setMaxTotal(20);
        cm.setDefaultMaxPerRoute(5);

        proxy = new HttpHost(proxyHost, proxyPort, "http");

        credsProvider = new BasicCredentialsProvider();
        credsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials(proxyUser, proxyPass));

        reqConfig = RequestConfig.custom()
            .setConnectionRequestTimeout(5000)
            .setConnectTimeout(5000)
            .setSocketTimeout(5000)
            .setExpectContinueEnabled(false)
            .setProxy(new HttpHost(proxyHost, proxyPort))
            .build();
    }

    public static void doRequest(HttpRequestBase httpReq) {
        CloseableHttpResponse httpResp = null;

        try {
            setHeaders(httpReq);

            httpReq.setConfig(reqConfig);

            CloseableHttpClient httpClient = HttpClients.custom()
                .setConnectionManager(cm)
                .setDefaultCredentialsProvider(credsProvider)
                .build();

            AuthCache authCache = new BasicAuthCache();
            authCache.put(proxy, new BasicScheme());

            HttpClientContext localContext = HttpClientContext.create();
            localContext.setAuthCache(authCache);

            httpResp = httpClient.execute(httpReq, localContext);

            int statusCode = httpResp.getStatusLine().getStatusCode();

            System.out.println(statusCode);

            BufferedReader rd = new BufferedReader(new InputStreamReader(httpResp.getEntity().getContent()));

            String line = "";
            while((line = rd.readLine()) != null) {
                System.out.println(line);
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (httpResp != null) {
                    httpResp.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    /**
     * 设置请求头
     *
     * @param httpReq
     */
    private static void setHeaders(HttpRequestBase httpReq) {

        // 设置Proxy-Tunnel
        // Random random = new Random();
        // int tunnel = random.nextInt(10000);
        // httpReq.setHeader("Proxy-Tunnel", String.valueOf(tunnel));

        httpReq.setHeader("Accept-Encoding", null);

    }


    public static void doGetRequest() {
        // 要访问的目标页面
        String targetUrl = "https://httpbin.org/ip";


        try {
            HttpGet httpGet = new HttpGet(targetUrl);

            doRequest(httpGet);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        doGetRequest();


    }
}
import java.io.IOException;
import java.net.Authenticator;
import java.net.InetSocketAddress;
import java.net.PasswordAuthentication;
import java.net.Proxy;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;


public class Demo
{
    // 代理验证信息
    final static String ProxyUser = "username";
    final static String ProxyPass = "password";

    // 代理服务器(产品官网 www.16yun.cn)
    final static String ProxyHost = "t.16yun.cn";
    final static Integer ProxyPort = 31111;

    // 设置IP切换头
    final static String ProxyHeadKey = "Proxy-Tunnel";


    public static String getUrlProxyContent(String url)
    {
        Authenticator.setDefault(new Authenticator() {
            public PasswordAuthentication getPasswordAuthentication()
            {
                return new PasswordAuthentication(ProxyUser, ProxyPass.toCharArray());
            }
        });
        // 设置Proxy-Tunnel
        Random random = new Random();
        int tunnel = random.nextInt(10000);
        String ProxyHeadVal = String.valueOf(tunnel);

        Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(ProxyHost, ProxyPort));

        try
        {
            // 处理异常、其他参数
            Document doc = Jsoup.connect(url).timeout(3000).header(ProxyHeadKey, ProxyHeadVal).proxy(proxy).get();

            if(doc != null) {
                System.out.println(doc.body().html());
            }
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }

        return null;
    }

    public static void main(String[] args) throws Exception
    {
        // 要访问的目标页面
        String targetUrl = "http://httpbin.org/ip";


        getUrlProxyContent(targetUrl);
    }
}

JSoup无法使用Keep-alive

JSoup默认会关闭连接 访问HTTP网站请通过设置相同Proxy-Tunnel来保持相同的外网IP. 访问HTTPS网站请使用其他库,保持相同的外网IP.

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Demo {

    public static void main(String[] args) {

        try{

            // 代理服务器(产品官网 www.16yun.cn)
            final static String ProxyHost = "t.16yun.cn";
            final static String ProxyPort = "31111";

            System.setProperty("http.proxyHost", ProxyHost);
            System.setProperty("https.proxyHost", ProxyHost);

            System.setProperty("http.proxyPort", ProxyPort);
            System.setProperty("https.proxyPort", ProxyPort);

            // 代理验证信息
            final static String ProxyUser = "username";
            final static String ProxyPass = "password";

            System.setProperty("http.proxyUser", ProxyUser);
            System.setProperty("http.proxyPassword", ProxyPass);

            System.setProperty("https.proxyUser", ProxyUser);
            System.setProperty("https.proxyPassword", ProxyPass);




            // 设置IP切换头
            final static String ProxyHeadKey = "Proxy-Tunnel";

            // 设置Proxy-Tunnel
            Random random = new Random();
            int tunnel = random.nextInt(10000);
            String ProxyHeadVal = String.valueOf(tunnel);



            // 处理异常、其他参数
            Document doc = Jsoup.connect(url).timeout(3000).header(ProxyHeadKey, ProxyHeadVal).get();

            if(doc != null) {
                System.out.println(doc.body().html());
            }

        }catch (IOException e)
        {
            e.printStackTrace();
        }

    }
}

JSoup无法使用Keep-alive

JSoup默认会关闭连接 访问HTTP网站请通过设置相同Proxy-Tunnel来保持相同的外网IP. 访问HTTPS网站请使用其他库,保持相同的外网IP.

import java.io.ByteArrayOutputStream;
import java.io.InputStream;
import java.net.Authenticator;
import java.net.HttpURLConnection;
import java.net.InetSocketAddress;
import java.net.PasswordAuthentication;
import java.net.Proxy;
import java.net.URL;
import java.util.Random;

class ProxyAuthenticator extends Authenticator {
    private String user, password;

    public ProxyAuthenticator(String user, String password) {
        this.user     = user;
        this.password = password;
    }

    protected PasswordAuthentication getPasswordAuthentication() {
        return new PasswordAuthentication(user, password.toCharArray());
    }
}

/**
 * 注意:下面代码仅仅实现HTTP请求链接,每一次请求都是无状态保留的,仅仅是这次请求是更换IP的,如果下次请求的IP地址会改变
 * 如果是多线程访问的话,只要将下面的代码嵌入到你自己的业务逻辑里面,那么每次都会用新的IP进行访问,如果担心IP有重复,
 * 自己可以维护IP的使用情况,并做校验。
 */
public class Demo {
    public static void main(String args[]) throws Exception {
        // 要访问的目标页面
        String targetUrl = "http://httpbin.org/ip";


        // 代理服务器(产品官网 www.16yun.cn)
        String proxyServer = "t.16yun.cn";
        int proxyPort      = 31111;

        // 代理验证信息
        String proxyUser  = "username";
        String proxyPass  = "password";

        try {
            URL url = new URL(targetUrl);

            Authenticator.setDefault(new ProxyAuthenticator(proxyUser, proxyPass));

            // 创建代理服务器地址对象
            InetSocketAddress addr = new InetSocketAddress(proxyServer, proxyPort);
            // 创建HTTP类型代理对象
            Proxy proxy = new Proxy(Proxy.Type.HTTP, addr);

            // 设置通过代理访问目标页面
            HttpURLConnection connection = (HttpURLConnection) url.openConnection(proxy);
            // 设置Proxy-Tunnel
            // Random random = new Random();
            // int tunnel = random.nextInt(10000);
            // connection.setRequestProperty("Proxy-Tunnel",String.valueOf(tunnel));

            // 解析返回数据
            byte[] response = readStream(connection.getInputStream());

            System.out.println(new String(response));
        } catch (Exception e) {
            System.out.println(e.getLocalizedMessage());
        }
    }

    /**
     * 将输入流转换成字符串
     *
     * @param inStream
     * @return
     * @throws Exception
     */
    public static byte[] readStream(InputStream inStream) throws Exception {
        ByteArrayOutputStream outSteam = new ByteArrayOutputStream();
        byte[] buffer = new byte[1024];
        int len = -1;

        while ((len = inStream.read(buffer)) != -1) {
            outSteam.write(buffer, 0, len);
        }
        outSteam.close();
        inStream.close();

        return outSteam.toByteArray();
    }
}

通过代理访问HTTP2网站

需要保证JDK的版本支持HTTP2网站的访问,java9已经以上才能完整支持

package htmlunit;

import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.impl.client.BasicCredentialsProvider;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class HtmlunitDemo {
    // 代理服务器(产品官网 www.16yun.cn)
    final static String proxyHost = "t.16yun.cn";
    final static Integer proxyPort = 31111;

    // 代理验证信息
    final static String proxyUser = "USERNAME";
    final static String proxyPass = "PASSWORD";

    public static void main(String[] args) {

        CredentialsProvider credsProvider = new BasicCredentialsProvider();
        credsProvider.setCredentials(

        new AuthScope(proxyHost, proxyPort),
        new UsernamePasswordCredentials(proxyUser, proxyPass));


        WebClient webClient = new WebClient(BrowserVersion.CHROME,proxyHost, proxyPort);


        webClient.setCredentialsProvider(credsProvider);


        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setActiveXNative(false);
        webClient.getOptions().setCssEnabled(false);

        HtmlPage page = null;

        try {
            page = webClient.getPage("http://httpbin.org/ip");
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            webClient.close();
        }

        webClient.waitForBackgroundJavaScript(30000);


        String pageXml = page.asXml();

        System.out.println(pageXml);
    }
}
    import okhttp3.*;

    import java.io.IOException;
    import java.net.InetSocketAddress;
    import java.net.Proxy;
    import java.util.concurrent.TimeUnit;

    public class OkHttp {

        // 代理服务器(产品官网 www.16yun.cn)
        final static String proxyHost = "t.16yun.cn";
        final static Integer proxyPort = 31111;

        // 代理验证信息
        final static String proxyUser = "USERNAME";
        final static String proxyPass = "PASSWORD";

        static OkHttpClient client = null;

        static {
            Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxyHost, proxyPort));

            Authenticator proxyAuthenticator = new Authenticator() {
                public Request authenticate(Route route, Response response) {
                    String credential = Credentials.basic(proxyUser, proxyPass);
                    return response.request().newBuilder()
                            .header("Proxy-Authorization", credential)
                            .build();
                }
            };

            client = new OkHttpClient().newBuilder()
                    .connectTimeout(5, TimeUnit.SECONDS)
                    .readTimeout(5, TimeUnit.SECONDS)
                    .proxy(proxy)
                    .proxyAuthenticator(proxyAuthenticator)
                    .connectionPool(new ConnectionPool(5, 1, TimeUnit.SECONDS))
                    .build();
        }

        public static Response doGet() throws IOException {
            // 要访问的目标页面
            String targetUrl = "http://httpbin.org/ip";

            Request request = new Request.Builder()
                    .url(targetUrl)
                    .build();
            Response response = client.newCall(request).execute();
            return response;
        }

        public static void main(String[] args) throws IOException {
            Response response1 = doGet();
            System.out.println("GET请求返回结果:");
            System.out.println(response1.body().string());
        }

    }

5、golangLink

        package main

        import (
            "net/url"
            "net/http"
            "bytes"
            "fmt"
            "io/ioutil"
        )

        // 代理服务器(产品官网 www.16yun.cn)
        const ProxyServer = "t.16yun.cn:31111"

        type ProxyAuth struct {
            Username string
            Password string
        }

        func (p ProxyAuth) ProxyClient() http.Client {

            var proxyURL *url.URL
            if p.Username != ""&& p.Password!="" {
                proxyURL, _ = url.Parse("http://" + p.Username + ":" + p.Password + "@" + ProxyServer)
            }else{
                proxyURL, _ = url.Parse("http://" + ProxyServer)
            }
            return http.Client{Transport: &http.Transport{Proxy:http.ProxyURL(proxyURL)}}
        }

        func main()  {


            targetURI := "https://httpbin.org/ip"


            // 初始化 proxy http client
            client := ProxyAuth{"username",  "password"}.ProxyClient()

            request, _ := http.NewRequest("GET", targetURI, bytes.NewBuffer([] byte(``)))

            // 设置Proxy-Tunnel
            // rand.Seed(time.Now().UnixNano())
            // tunnel := rand.Intn(10000)
            // request.Header.Set("Proxy-Tunnel", strconv.Itoa(tunnel) )

            response, err := client.Do(request)

            if err != nil {
                panic("failed to connect: " + err.Error())
            } else {
                bodyByte, err := ioutil.ReadAll(response.Body)
                if err != nil {
                    fmt.Println("读取 Body 时出错", err)
                    return
                }
                response.Body.Close()

                body := string(bodyByte)

                fmt.Println("Response Status:", response.Status)
                fmt.Println("Response Header:", response.Header)
                fmt.Println("Response Body:\n", body)
            }
        }

源代码

6、PhantomJS/CasperJSLink

以参数方式传递代理信息,示例如下:

phantomjs --proxy-auth=USERNAME:PASSWORD --proxy=http://t.16yun.cn:31111 --ignore-ssl-errors=true http-demo.js

http-demo.js 内容如下:

    var page = require('webpage').create();
    page.settings.userAgent = 'Mozilla/5.0 UCBrowser/9.4.1.362 U3/0.8.0 Mobile Safari/533.1';

    console.log('The user agent is ' + page.settings.userAgent);

    // 生成一个随机 proxy tunnel
    var seed = 1;
    function random() {
        var x = Math.sin(seed++) * 10000;
        return x - Math.floor(x);
    }
    const tunnel = random()*100;

    //page.customHeaders = {
    //  "proxy-tunnel": tunnel,
    //};

    page.onResourceReceived = function(j) {
      for (var i = 0; i < j.headers.length; ++i) {
        console.log(j.headers[i].name + ': ' + j.headers[i].value);
      }
    };

    page.open("http://httpbin.org/ip", {}, function(status) {
      console.log('status> ' + status);
      console.log(page.content);
      setTimeout(function() {
        phantom.exit();
      }, 3000);
    });

以参数方式传递代理信息,示例如下:

casperjs --proxy-auth=USERNAME:PASSWORD --proxy=http://t.16yun.cn:31111  --ignore-ssl-errors=true --ssl-protocol=any http-demo.js

http-demo.js 内容如下:

    var casper = require('casper').create();

    // 生成一个随机 proxy tunnel
    var seed = 1;
    function random() {
        var x = Math.sin(seed++) * 10000;
        return x - Math.floor(x);
    }
    const tunnel = random()*1000;

    casper.on('started', function () {
        this.page.customHeaders = {
            "User-Agent" : "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.5",
            "Accept-Encoding": "gzip, deflate",
            "Connection" : "keep-alive",
            "Proxy-Tunnel": tunnel
        }
    });
    casper.start("http://httpbin.org/headers");


    casper.then(function() {
        console.log('First Page: ' + this.page.content);
    });
    casper.run();

7、nodejsLink

const http = require("http");
const url = require("url");

// 要访问的目标页面
const targetUrl = "http://httpbin.org/ip";


const urlParsed = url.parse(targetUrl);

// 代理服务器(产品官网 www.16yun.cn)
const proxyHost = "t.16yun.cn";
const proxyPort = "36600";

// 生成一个随机 proxy tunnel
var seed = 1;
function random() {
    var x = Math.sin(seed++) * 10000;
    return x - Math.floor(x);
}
const tunnel = random()*100;

// 代理验证信息
const proxyUser = "username";
const proxyPass = "password";

const base64    = new Buffer.from(proxyUser + ":" + proxyPass).toString("base64");

const options = {
    host: proxyHost,
    port: proxyPort,
    path: targetUrl,
    method: "GET",
    headers: {
        "Host": urlParsed.hostname,
        "Proxy-Tunnel": tunnel,
        "Proxy-Authorization" : "Basic " + base64
    }
};

http.request(options, function (res) {
    console.log("got response: " + res.statusCode);
    res.pipe(process.stdout);
}).on("error", function (err) {
    console.log(err);
}).end();
    const https = require("https");
    const url = require("url");
    const httpsProxyAgent = require('https-proxy-agent');

    // 要访问的目标页面
    const targetUrl = "https://httpbin.org/ip";


    const urlParsed = url.parse(targetUrl);

    // 代理服务器(产品官网 www.16yun.cn)
    const proxyHost = "t.16yun.cn";
    const proxyPort = "31111";


    // 代理验证信息
    const proxyUser = "username";
    const proxyPass = "password";

    var options = urlParsed;
    var agent = new httpsProxyAgent("http://" + proxyUser + ":" + proxyPass + "@" + proxyHost + ":" + proxyPort);
    options.agent = agent;

    https.request(options, function (res) {
        console.log("got response: " + res.statusCode);
        res.pipe(process.stdout);
    }).on("error", function (err) {
        console.log(err);
    }).end();
    const https = require("https");
    const url = require("url");
    const httpsProxyAgent = require('https-proxy-agent');

    // 要访问的目标页面
    const targetUrl = "https://httpbin.org/ip";


    const urlParsed = url.parse(targetUrl);

    // 代理服务器(产品官网 www.16yun.cn)
    const proxyHost = "t.16yun.cn";
    const proxyPort = "31111";


    // 代理验证信息
    const proxyUser = "username";
    const proxyPass = "password";

    var options = urlParsed;
    const proxy_url = "http://" + proxyUser + ":" + proxyPass + "@" + proxyHost + ":" + proxyPort;
    var agent_options = url.parse(proxy_url);

    agent_options.headers = { "Proxy-Tunnel" : "1" }
    var agent = new httpsProxyAgent(agent_options);

    options.agent = agent;

    for(var i=0;i<10;i++){
        https.request(options, function (res) {
            console.log("got response: " + res.statusCode);
            res.pipe(process.stdout);
        }).on("error", function (err) {
            console.log(err);
        }).end();
    }
const request = require("request");

// 要访问的目标页面
const targetUrl = "http://httpbin.org/ip";

// 代理服务器(产品官网 www.16yun.cn)
const proxyHost = "t.16yun.cn";
const proxyPort = "31111";


// 代理验证信息
const proxyUser = "username";
const proxyPass = "password";

const proxyUrl = "http://" + proxyUser + ":" + proxyPass + "@" + proxyHost + ":" + proxyPort;

const proxiedRequest = request.defaults({'proxy': proxyUrl});

const options = {
  url     : targetUrl,
  headers : {
          }
};

proxiedRequest
    .get(options, function (err, res, body) {
        console.log("got response: " + res.statusCode);
    })
    .on("error", function (err) {
        console.log(err);
    })
;              
const request = require("superagent");

require("superagent-proxy")(request);

// 要访问的目标页面
const targetUrl = "http://httpbin.org/ip";

// 代理服务器(产品官网 www.16yun.cn)
const proxyHost = "t.16yun.cn";
const proxyPort = 31111;

// 代理验证信息
const proxyUser = "username";
const proxyPass = "password";

const proxyUrl = "http://" + proxyUser + ":" + proxyPass + "@" + proxyHost + ":" + proxyPort;

request
    .get(targetUrl)
    .proxy(proxyUrl)
    .end(function onResponse(err, res) {
        if (err) {
            return console.log(err);
        }

        console.log(res.status, res.headers);
        console.log(res.text);
    })
;              
const axios = require('axios');

// 要访问的目标页面
const targetUrl = "http://httpbin.org/ip";

// 代理服务器(产品官网 www.16yun.cn)
const proxyHost = "t.16yun.cn";
const proxyPort = 31111;

// 代理验证信息
const proxyUser = "username";
const proxyPass = "password";

var proxy = {
    host: proxyHost,
    port: proxyPort,
    auth: {
        username: proxyUser,
        password: proxyPass
    }
};


axios.get(targetUrl,{proxy:proxy})
    .then(function (response) {
        // handle success
        console.log(response.data);
    })
    .catch(function (error) {
        // handle error
        console.log(error);
    })
    .finally(function () {
        // always executed
    });              

8、SeleniumLink

Demo源代码

    from selenium import webdriver
    import string
    import zipfile

    # 代理服务器(产品官网 www.16yun.cn)
    proxyHost = "t.16yun.cn"
    proxyPort = "31111"

    # 代理验证信息
    proxyUser = "username"
    proxyPass = "password"

    def create_proxy_auth_extension(proxy_host, proxy_port,
                                   proxy_username, proxy_password,
                                   scheme='http', plugin_path=None):
        if plugin_path is None:
            plugin_path = r'D:/{}_{}@t.16yun.zip'.format(proxy_username, proxy_password)

        manifest_json = """
        {
            "version": "1.0.0",
            "manifest_version": 2,
            "name": "16YUN Proxy",
            "permissions": [
                "proxy",
                "tabs",
                "unlimitedStorage",
                "storage",
                "",
                "webRequest",
                "webRequestBlocking"
            ],
            "background": {
                "scripts": ["background.js"]
            },
            "minimum_chrome_version":"22.0.0"
        }
        """

        background_js = string.Template(
            """
            var config = {
                mode: "fixed_servers",
                rules: {
                    singleProxy: {
                        scheme: "${scheme}",
                        host: "${host}",
                        port: parseInt(${port})
                    },
                    bypassList: ["foobar.com"]
                }
              };

            chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});

            function callbackFn(details) {
                return {
                    authCredentials: {
                        username: "${username}",
                        password: "${password}"
                    }
                };
            }

            chrome.webRequest.onAuthRequired.addListener(
                callbackFn,
                {urls: [""]},
                ['blocking']
            );
            """
        ).substitute(
            host=proxy_host,
            port=proxy_port,
            username=proxy_username,
            password=proxy_password,
            scheme=scheme,
        )

        with zipfile.ZipFile(plugin_path, 'w') as zp:
            zp.writestr("manifest.json", manifest_json)
            zp.writestr("background.js", background_js)

        return plugin_path

    proxy_auth_plugin_path = create_proxy_auth_extension(
        proxy_host=proxyHost,
        proxy_port=proxyPort,
        proxy_username=proxyUser,
        proxy_password=proxyPass)

    option = webdriver.ChromeOptions()

    option.add_argument("--start-maximized")

    # 如报错 chrome-extensions 
    # option.add_argument("--disable-extensions")

    option.add_extension(proxy_auth_plugin_path)

    # 关闭webdriver的一些标志
    # option.add_experimental_option('excludeSwitches', ['enable-automation'])        

    driver = webdriver.Chrome(chrome_options=option)

    # 修改webdriver get属性
    # script = '''
    # Object.defineProperty(navigator, 'webdriver', {
    # get: () => undefined
    # })
    # '''
    # driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": script})     



    driver.get("http://httpbin.org/ip")

使用selenium登录获取cookie Demo源代码

    import os
    import time
    import zipfile

    from selenium import webdriver
    from selenium.common.exceptions import TimeoutException
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.support.ui import WebDriverWait


    class GenCookies(object):
        # 随机useragent
        USER_AGENT = open('useragents.txt').readlines()


        # 代理服务器(产品官网 www.16yun.cn)
        PROXY_HOST = 't.16yun.cn'  #  proxy or host
        PROXY_PORT = 31111  # port
        PROXY_USER = 'USERNAME'  # username
        PROXY_PASS = 'PASSWORD'  # password

        @classmethod
        def get_chromedriver(cls, use_proxy=False, user_agent=None):
            manifest_json = """
            {
                "version": "1.0.0",
                "manifest_version": 2,
                "name": "Chrome Proxy",
                "permissions": [
                    "proxy",
                    "tabs",
                    "unlimitedStorage",
                    "storage",
                    "<all_urls>",
                    "webRequest",
                    "webRequestBlocking"
                ],
                "background": {
                    "scripts": ["background.js"]
                },
                "minimum_chrome_version":"22.0.0"
            }
            """

            background_js = """
            var config = {
                    mode: "fixed_servers",
                    rules: {
                    singleProxy: {
                        scheme: "http",
                        host: "%s",
                        port: parseInt(%s)
                    },
                    bypassList: ["localhost"]
                    }
                };

            chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});

            function callbackFn(details) {
                return {
                    authCredentials: {
                        username: "%s",
                        password: "%s"
                    }
                };
            }

            chrome.webRequest.onAuthRequired.addListener(
                        callbackFn,
                        {urls: ["<all_urls>"]},
                        ['blocking']
            );
            """ % (cls.PROXY_HOST, cls.PROXY_PORT, cls.PROXY_USER, cls.PROXY_PASS)
            path = os.path.dirname(os.path.abspath(__file__))
            chrome_options = webdriver.ChromeOptions()

            # 关闭webdriver的一些标志
            # chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])        


            if use_proxy:
                pluginfile = 'proxy_auth_plugin.zip'

                with zipfile.ZipFile(pluginfile, 'w') as zp:
                    zp.writestr("manifest.json", manifest_json)
                    zp.writestr("background.js", background_js)
                chrome_options.add_extension(pluginfile)
            if user_agent:
                chrome_options.add_argument('--user-agent=%s' % user_agent)
            driver = webdriver.Chrome(
                os.path.join(path, 'chromedriver'),
                chrome_options=chrome_options)

            # 修改webdriver get属性
            # script = '''
            # Object.defineProperty(navigator, 'webdriver', {
            # get: () => undefined
            # })
            # '''
            # driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": script})

            return driver

        def __init__(self, username, password):        
            # 登录example网站
            self.url = 'https://passport.example.cn/signin/login?entry=example&r=https://m.example.cn/'
            self.browser = self.get_chromedriver(use_proxy=True, user_agent=self.USER_AGENT)
            self.wait = WebDriverWait(self.browser, 20)
            self.username = username
            self.password = password

        def open(self):
            """
            打开网页输入用户名密码并点击
            :return: None
            """
            self.browser.delete_all_cookies()
            self.browser.get(self.url)
            username = self.wait.until(EC.presence_of_element_located((By.ID, 'loginName')))
            password = self.wait.until(EC.presence_of_element_located((By.ID, 'loginPassword')))
            submit = self.wait.until(EC.element_to_be_clickable((By.ID, 'loginAction')))
            username.send_keys(self.username)
            password.send_keys(self.password)
            time.sleep(1)
            submit.click()

        def password_error(self):
            """
            判断是否密码错误
            :return:
            """
            try:
                return WebDriverWait(self.browser, 5).until(
                    EC.text_to_be_present_in_element((By.ID, 'errorMsg'), '用户名或密码错误'))
            except TimeoutException:
                return False

        def get_cookies(self):
            """
            获取Cookies
            :return:
            """
            return self.browser.get_cookies()

        def main(self):
            """
            入口
            :return:
            """
            self.open()
            if self.password_error():
                return {
                    'status': 2,
                    'content': '用户名或密码错误'
                }            

            cookies = self.get_cookies()
            return {
                'status': 1,
                'content': cookies
            }


    if __name__ == '__main__':
        result = GenCookies(
            username='180000000',
            password='16yun',
        ).main()
        print(result)
import org.json.JSONException;
import org.json.JSONObject;
import org.openqa.selenium.Platform;
import org.openqa.selenium.Proxy;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.remote.DesiredCapabilities;

import com.gargoylesoftware.htmlunit.DefaultCredentialsProvider;
import com.gargoylesoftware.htmlunit.WebClient;

public class HtmlUnitDriverProxyDemo
{
    // 代理验证信息
    final static String proxyUser = "username";
    final static String proxyPass = "password";

    // 代理服务器
    final static String proxyServer = "t.16yun.cn:3111";

    public static void main(String[] args) throws JSONException
    {
        HtmlUnitDriver driver = getHtmlUnitDriver();

        driver.get("https://httpbin.org/ip");

        String title = driver.getTitle();
        System.out.println(title);
    }

    public static HtmlUnitDriver getHtmlUnitDriver()
    {
        HtmlUnitDriver driver = null;

        Proxy proxy = new Proxy();

        proxy.setHttpProxy(proxyServer);

        DesiredCapabilities capabilities = DesiredCapabilities.htmlUnit();
        capabilities.setCapability(CapabilityType.PROXY, proxy);
        capabilities.setJavascriptEnabled(true);
        capabilities.setPlatform(Platform.WIN8_1);

        driver = new HtmlUnitDriver(capabilities) {
            @Override
            protected WebClient modifyWebClient(WebClient client) {
                DefaultCredentialsProvider creds = new DefaultCredentialsProvider();
                creds.addCredentials(proxyUser, proxyPass);
                client.setCredentialsProvider(creds);
                return client;
            }
        };

        driver.setJavascriptEnabled(true);

        return driver;
    }
}              
import org.json.JSONException;
import org.json.JSONObject;
import org.openqa.selenium.Platform;
import org.openqa.selenium.Proxy;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxProfile;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.remote.DesiredCapabilities;

import com.gargoylesoftware.htmlunit.DefaultCredentialsProvider;
import com.gargoylesoftware.htmlunit.WebClient;

public class FirefoxDriverProxyDemo
{
    // 代理隧道验证信息
    final static String proxyUser = "username";
    final static String proxyPass = "password";

    // 代理服务器
    final static String proxyHost = "t.16yun.cn";
    final static int proxyPort = 31111;

    final static String firefoxBin = "C:/Program Files/Mozilla Firefox/firefox.exe";

    public static void main(String[] args) throws JSONException
    {
        System.setProperty("webdriver.firefox.bin", firefoxBin);

        FirefoxProfile profile = new FirefoxProfile();

        profile.setPreference("network.proxy.type", 1);


        profile.setPreference("network.proxy.http", proxyHost);
        profile.setPreference("network.proxy.http_port", proxyPort);

        profile.setPreference("network.proxy.ssl", proxyHost);
        profile.setPreference("network.proxy.ssl_port", proxyPort);

        profile.setPreference("username", proxyUser);
        profile.setPreference("password", proxyPass);


        profile.setPreference("network.proxy.share_proxy_settings", true);


        profile.setPreference("network.proxy.no_proxies_on", "localhost");


        FirefoxDriver driver = new FirefoxDriver(profile);
    }
}              

9、PuppeteerLink

const puppeteer = require('puppeteer');
// 代理服务器(产品官网 www.16yun.cn)
const proxyServer = 'http://t.16yun.cn:31111';

const username = 'username';
const password = 'password';

(async() => {
    const browser = await puppeteer.launch({
        args: [  '--proxy-server='+proxyServer+'','--no-sandbox', '--disable-setuid-sandbox' ]});
    const page = await browser.newPage();
    await page.authenticate({ username, password });
    await page.goto('https://www.baidu.com');
    const cookies = await page.cookies();
    await console.log(cookies);
    await page.setViewport({width: 320, height: 480});
    await page.screenshot({path: '/screenshots/full.png', fullPage: true});
    await browser.close();
})();

在项目中新建middlewares.py文件(./项目名/middlewares.py)

    #! -*- encoding:utf-8 -*-
    import websockets
    from scrapy.http import HtmlResponse
    from logging import getLogger
    import asyncio
    import pyppeteer
    import logging
    from concurrent.futures._base import TimeoutError
    import base64
    import sys
    import random

    pyppeteer_level = logging.WARNING
    logging.getLogger('websockets.protocol').setLevel(pyppeteer_level)
    logging.getLogger('pyppeteer').setLevel(pyppeteer_level)

    PY3 = sys.version_info[0] >= 3


    def base64ify(bytes_or_str):
        if PY3 and isinstance(bytes_or_str, str):
            input_bytes = bytes_or_str.encode('utf8')
        else:
            input_bytes = bytes_or_str

        output_bytes = base64.urlsafe_b64encode(input_bytes)
        if PY3:
            return output_bytes.decode('ascii')
        else:
            return output_bytes


    class ProxyMiddleware(object):
        # 加载随机UserAgent(根据需求)
        # USER_AGENT = open('useragents.txt').readlines()

        def process_request(self, request, spider):
            # 代理服务器
            proxyHost = "t.16yun.cn"
            proxyPort = "31111"

            # 代理验证信息
            proxyUser = "username"
            proxyPass = "password"

            request.meta['proxy'] = "http://{0}:{1}".format(proxyHost, proxyPort)

            # 添加验证头
            encoded_user_pass = base64ify(proxyUser + ":" + proxyPass)
            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

            # 设置IP切换头(根据需求)
            tunnel = random.randint(1, 10000)
            request.headers['Proxy-Tunnel'] = str(tunnel)

            # 设置随机UserAgent(根据需求)
            # request.headers['User-Agent'] = random.choice(self.USER_AGENT)

    class PyppeteerMiddleware(object):
        def __init__(self, **args):
            """
            init logger, loop, browser
            :param args:
            """
            self.logger = getLogger(__name__)
            self.loop = asyncio.get_event_loop()
            self.browser = self.loop.run_until_complete(
                pyppeteer.launch(headless=True))
            self.args = args

        def __del__(self):
            """
            close loop
            :return:
            """
            self.loop.close()

        def render(self, url, retries=1, script=None, wait=0.3, scrolldown=False, sleep=0,
                   timeout=8.0, keep_page=False):
            """
            render page with pyppeteer
            :param url: page url
            :param retries: max retry times
            :param script: js script to evaluate
            :param wait: number of seconds to wait before loading the page, preventing timeouts
            :param scrolldown: how many times to page down
            :param sleep: how many long to sleep after initial render
            :param timeout: the longest wait time, otherwise raise timeout error
            :param keep_page: keep page not to be closed, browser object needed
            :param browser: pyppetter browser object
            :param with_result: return with js evaluation result
            :return: content, [result]
            """

            # define async render
            async def async_render(url, script, scrolldown, sleep, wait, timeout, keep_page):
                try:
                    # basic render
                    page = await self.browser.newPage()
                    await asyncio.sleep(wait)
                    response = await page.goto(url, options={'timeout': int(timeout * 1000)})
                    if response.status != 200:
                        return None, None, response.status
                    result = None
                    # evaluate with script
                    if script:
                        result = await page.evaluate(script)

                    # scroll down for {scrolldown} times
                    if scrolldown:
                        for _ in range(scrolldown):
                            await page._keyboard.down('PageDown')
                            await asyncio.sleep(sleep)
                    else:
                        await asyncio.sleep(sleep)
                    if scrolldown:
                        await page._keyboard.up('PageDown')

                    # get html of page
                    content = await page.content()

                    return content, result, response.status
                except TimeoutError:
                    return None, None, 500
                finally:
                    # if keep page, do not close it
                    if not keep_page:
                        await page.close()

            content, result, status = [None] * 3

            # retry for {retries} times
            for i in range(retries):
                if not content:
                    content, result, status = self.loop.run_until_complete(
                        async_render(url=url, script=script, sleep=sleep, wait=wait,
                                     scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
                else:
                    break

            # if need to return js evaluation result
            return content, result, status

        def process_request(self, request, spider):
            """
            :param request: request object
            :param spider: spider object
            :return: HtmlResponse
            """
            if request.meta.get('render'):
                try:
                    self.logger.debug('rendering %s', request.url)
                    html, result, status = self.render(request.url)
                    return HtmlResponse(url=request.url, body=html, request=request, encoding='utf-8',
                                        status=status)
                except websockets.exceptions.ConnectionClosed:
                    pass

        @classmethod
        def from_crawler(cls, crawler):
            return cls(**crawler.settings.get('PYPPETEER_ARGS', {}))
修改项目配置文件 (./项目名/settings.py)

    DOWNLOADER_MIDDLEWARES = {
        'scrapypyppeteer.middlewares.PyppeteerMiddleware': 543,
        'scrapypyppeteer.middlewares.ProxyMiddleware': 100,        
    }

下载scrapy-pyppeteer

源代码

10、易语言Link

下载源代码 下载源代码(https)

11、mitmproxyLink

# 启动时设置上游代理服务器
# 代理服务器地址、端口、用户名、密码请替换成自己的
mitmproxy --mode=upstream:http://t.16yun.cn:31111 --upstream-auth=username:password
  • 下载并安装 mitmproxy (https://www.mitmproxy.org/)
  • 安装完成后,启动 mitmproxy ui,打开浏览器请求 http://127.0.0.1:8081 访问 mitmproxy web 控制页面
  • 进入 mitmproxy web 控制页面,点击「mitmproxy」-「 Options」-「Edit Options」进行配置
  • 配置 mode 项,值为 upstream:http://t.16yun.cn:31111 # 代理服务器地址、端口请替换成自己的
# 代理服务器 用户名、密码请替换成自己的
配置 upstream_auth 项,值为 username:password

12、更进一步的爬虫应用案例Link

针对用户更多的应用场景,我们提供如下的案例,如有需要请联系技术客服咨询:

  1. 爬虫入门基础-开发知识点和技巧
  2. 爬虫入门基础-Selenium反爬
  3. 爬虫入门基础-Scrapy框架之Puppeteer渲染
  4. 爬虫入门基础-Selenium登录生成Cookie
  5. 爬虫入门基础-Scrapy框架之Spalsh渲染
  6. 爬虫入门基础-LightProxy抓包
  7. Python爬虫资源大全中文版
  8. 爬虫入门基础-Scrapy框架
  9. 爬虫入门基础-Firefox数据抓包
  10. 爬虫入门基础-Python爬虫
  11. 爬虫入门基础-HTTP协议过程