Python + Selenium climb netease cloud class class title timely long

Software installation

selenium

pip install selenium

geckodriver

Github.com/mozilla/gec…

The target page

Study.163.com/course/intr…

At the beginning of the conventional method request down, found that the source can not find any class information, that the web pageJavaScriptTo load content dynamically.
Using the developer tool, we found that the browser requested the following address to obtain the class details:

Study.163.com/dwr/call/pl…

In the preview interface, you can see the information of each classUnicodeEncoding.

If you try to request the address above, you will get an errorSeleniumAnyway, just one page, no performance requirements.

code

instructions

study163seleniumff.py 是Main run file
helper.pyIs a secondary module, and the main run file in the same directory
geckodriver.exeNeed to put in../drivers/In this relative path

study163seleniumff.py

from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from lxml import etree
import csv
from helper import Chapter, Lesson

# request data
url = 'https://study.163.com/course/introduction.htm?courseId=1006078212#/courseDetail?tab=1'

options = Options()
options.add_argument('-headless')  # headless parameter
driver = Firefox(
    executable_path='.. /drivers/geckodriver',
    firefox_options=options)
driver.get(url)
text = driver.page_source
driver.quit()

# parse data
html = etree.HTML(text)
chapters = html.xpath('//div[@class="chapter"]')
TABLEHEAD = ['Section Number'.'Section name'.'Class number'.'Class name'.'Class leader']
rows = []

for each in chapters:
    chapter = Chapter(each)
    lessons = chapter.get_lessons()
    for each in lessons:
        lesson = Lesson(each)
        chapter_info = chapter.chapter_info
        lesson_info = lesson.lesson_info
        values = (*chapter_info, *lesson_info)
        row = dict(zip(TABLEHEAD, values))
        rows.append(row)

# Store data
with open('courseinfo.csv'.'w', encoding='utf-8-sig', newline=' ') as f:
    writer = csv.DictWriter(f, TABLEHEAD)
    writer.writeheader()
    writer.writerows(rows)

Copy the code

helper.py

class Chapter:
    def __init__(self, chapter) :
        self.chapter = chapter
        self._chapter_info = None

    def parse_all(self) :
        # section number
        chapter_num = self.chapter.xpath(
            './/span[contains(@class, "chaptertitle")]/text()') [0]
        # Remove the colon at the end of the chapter number
        chapter_num = chapter_num[:-1]
        Chapter #
        chapter_name = self.chapter.xpath(
            './/span[contains(@class, "chaptername")]/text()') [0]
        return chapter_num, chapter_name

    @property
    def chapter_info(self) :
        self._chapter_info = self.parse_all()
        return self._chapter_info
    
    def get_lessons(self) :
        return self.chapter.xpath(
            './/div[@data-lesson]')


class Lesson:
    def __init__(self, lesson) :
        self.lesson = lesson
        self._lesson_info = None

    @property
    def lesson_info(self) :
        Class # # #
        lesson_num = self.lesson.xpath(
            './/span[contains(@class, "ks")]/text()') [0]
        # class name
        lesson_name = self.lesson.xpath(
            './/span[@title]/@title') [0]
        # hours long
        lesson_len = self.lesson.xpath(
            './/span[contains(@class, "kstime")]/text()') [0]
        self._lesson_info = lesson_num, lesson_name, lesson_len
        return self._lesson_info

Copy the code

The final result

The final result is saved ascourseinfo.csv, the same path as the main run file.

Completed in 2018.11.16

Python + Selenium climb netease cloud class class title timely long

Python + Selenium climb netease cloud class class title timely long

Software installation

The target page

code

instructions

study163seleniumff.py

helper.py

The final result

Related Posts

MyBatis source code parsing (three) – cache chapter

Single sign-on based on Spring Boot2 + Spring Security OAuth2

【 freehand caricature 】 Interview must test two search (solution template and in-depth analysis), middle back