Python + Selenium climb netease cloud class class title timely long


Software installation

  • selenium

pip install selenium

  • geckodriver

Github.com/mozilla/gec…


The target page

Study.163.com/course/intr…

  • At the beginning of the conventional method request down, found that the source can not find any class information, that the web pageJavaScriptTo load content dynamically.
  • Using the developer tool, we found that the browser requested the following address to obtain the class details:

Study.163.com/dwr/call/pl…

  • In the preview interface, you can see the information of each classUnicodeEncoding.

  • If you try to request the address above, you will get an errorSeleniumAnyway, just one page, no performance requirements.

code

instructions

  • study163seleniumff.pyMain run file
  • helper.pyIs a secondary module, and the main run file in the same directory
  • geckodriver.exeNeed to put in../drivers/In this relative path

study163seleniumff.py

from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from lxml import etree
import csv
from helper import Chapter, Lesson

# request data
url = 'https://study.163.com/course/introduction.htm?courseId=1006078212#/courseDetail?tab=1'

options = Options()
options.add_argument('-headless')  # headless parameter
driver = Firefox(
    executable_path='.. /drivers/geckodriver',
    firefox_options=options)
driver.get(url)
text = driver.page_source
driver.quit()

# parse data
html = etree.HTML(text)
chapters = html.xpath('//div[@class="chapter"]')
TABLEHEAD = ['Section Number'.'Section name'.'Class number'.'Class name'.'Class leader']
rows = []

for each in chapters:
    chapter = Chapter(each)
    lessons = chapter.get_lessons()
    for each in lessons:
        lesson = Lesson(each)
        chapter_info = chapter.chapter_info
        lesson_info = lesson.lesson_info
        values = (*chapter_info, *lesson_info)
        row = dict(zip(TABLEHEAD, values))
        rows.append(row)

# Store data
with open('courseinfo.csv'.'w', encoding='utf-8-sig', newline=' ') as f:
    writer = csv.DictWriter(f, TABLEHEAD)
    writer.writeheader()
    writer.writerows(rows)

Copy the code

helper.py

class Chapter:
    def __init__(self, chapter) :
        self.chapter = chapter
        self._chapter_info = None

    def parse_all(self) :
        # section number
        chapter_num = self.chapter.xpath(
            './/span[contains(@class, "chaptertitle")]/text()') [0]
        # Remove the colon at the end of the chapter number
        chapter_num = chapter_num[:-1]
        Chapter #
        chapter_name = self.chapter.xpath(
            './/span[contains(@class, "chaptername")]/text()') [0]
        return chapter_num, chapter_name

    @property
    def chapter_info(self) :
        self._chapter_info = self.parse_all()
        return self._chapter_info
    
    def get_lessons(self) :
        return self.chapter.xpath(
            './/div[@data-lesson]')


class Lesson:
    def __init__(self, lesson) :
        self.lesson = lesson
        self._lesson_info = None

    @property
    def lesson_info(self) :
        Class # # #
        lesson_num = self.lesson.xpath(
            './/span[contains(@class, "ks")]/text()') [0]
        # class name
        lesson_name = self.lesson.xpath(
            './/span[@title]/@title') [0]
        # hours long
        lesson_len = self.lesson.xpath(
            './/span[contains(@class, "kstime")]/text()') [0]
        self._lesson_info = lesson_num, lesson_name, lesson_len
        return self._lesson_info

Copy the code

The final result

The final result is saved ascourseinfo.csv, the same path as the main run file.


Completed in 2018.11.16