前言
下面给大家写一点表单登陆在Scrapy上面应用的知识点,内容不多,不多说了,看看代码应该就可以理解,在代码里面我会尽量给点注释
代码实现
这里是对我自己的个人博客进行登录
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| # -*- coding: utf-8 -*- import scrapy from scrapy.http import Request,FormRequest class LoginFormSpider(scrapy.Spider): name = 'Login_Form' allowed_domains = ['bywalks.com'] start_urls = ['http://www.bywalks.com/wp-admin/'] def parse(self, response): content = response.xpath('//*[@id="comment-5"]/div/blockquote/p/text()').extract_first() print(content) #登陆 login_url = 'http://www.bywalks.com/wp-login.php' #重写基类的start_requests方法,最先请求登陆页面 def start_requests(self): yield Request(self.login_url,callback = self.login) def login(self,response): #我们只需要构建用户名和密码即可,因为from_response会把隐藏表单自动填写进去 fd = {'log':'XXX','pwd':'XXX'} yield FormRequest.from_response(response,formdata = fd,callback = self.parse_login) def parse_login(self,response): #登陆成功后,爬取我们需要内容 if "欢迎使用WordPress" in response.text: yield from super().start_requests()
|
1 2 3 4
| # Obey robots.txt rules #这里要改为False,就是不遵循robots.txt的规则 #因为有些网站会不希望爬虫去爬网站的后台,所以添加了这个规则,如何爬虫看到了这个规则,就不会去爬取 ROBOTSTXT_OBEY = False
|
Author:
Bywalks
Permalink:
http://bywalks.com/2017/11/03/python3-study-12/
License:
Copyright (c) 2022 CC-BY-NC-4.0 LICENSE
Slogan:
Do you believe in DESTINY?