This is the third day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021

preface

Recently, many people asked me if there are any good graduation projects for big data major, and I simply replied. There are also direct ask me to source….

So I took time to write an idea of my graduation project. Big data is what I learned by myself during my internship. This idea is what I thought at the beginning, and it can be regarded as a reference.

In the year of my graduation, most of my classmates’ graduation projects were based on various management systems and XX mall developed in Java language, including my initial idea. This is the computer major is very common graduation project topic.

The advantage of this choice is simple, many templates online. For those with strong hands-on skills, go directly to Github and pull down the source code, modify it slightly, and a graduation project will be completed. Students with weak hands-on ability can also use money to finish at a low cost.

As for the downside, this kind of design is all too common, unless the UI design and thinking are particularly impressive and impress the teacher. Otherwise, the well-informed teacher will take a heart without waves, with empty eyes to finish your demonstration, mechanical like you lay a passing mark.

Of course, for most of the students’ inner thought is: can pass ok. Some students worry that their graduation project will fit in well with other students, and that the teacher may ask some detailed (odd) questions. Therefore, it is best to do their own design, even if find the template, but also to make clear the technology and architecture.

At the same time, if you want to make a distinctive design, you must also be “fancy” in technology.

Big data design ideas

In the final analysis, the design of big data is based on big data platform. For such projects as management systems and shopping malls, we are oriented to programming languages, while big data is mainly oriented to platforms. It’s like when you say big data, people say, Big data… Is it Hadoop?

Yes. Although this answer is one-sided, but for big data design, it is based on Hadoop to spread and extend.

I did not major in big data, and I had a dream of becoming an excellent (C) show (V) Java developer. Later, I came into contact with big data by accident during my internship in 17 years, and began to learn big data by myself. Therefore, I completed my graduation project based on big data when I graduated in 2018. Here simply say at the beginning of my graduation design process.

  1. Set up Hadoop, Hive, Kafka, and Spark clusters on VMS
  2. 163W data was collected in MySQL using Java (suggested Python)
  3. Use Flume to write mysql data to Kafka in real time
  4. Scala develops sparkStreaming programs that read Kafka data for processing and then write to Kafka
  5. Kafka data was written to HDFS using Flume, and then loaded to Hive for HSQL analysis
  6. Using Springboot and Vue, the development of data management system, data query and graphical display, connected with Echarts and Baidu Map.

It’s simple. It’s simple. You can expand on the idea above. Let’s expand out the specific steps.

Big data graduation practice

For some of the big data concepts mentioned below, refer to an article I wrote earlier on big data.

0. Data preparation

Big data, big data, data must be big endless. How big is big? After 18 years of being responsible for the access, storage and processing of 10 billion pieces of data a day, I have been drifting ~ often colleagues told me that I need to access a file interface with a large amount of data, I asked him how much, he said 10 billion pieces a day, I will generally lightly say, 10 billion, is too much ~~~

In fact, for the graduation project, the amount of data does not need to be so large. The flow of data in the big data platform can simulate ETL and real-time processing in big data, so as to reflect the value of data. So where does the data come from?

In method 1, we can write a program to generate some test data, but in this case, the data consistency is too high to reflect the value of data analysis. Then adopt method two, develop crawler to collect data on the Internet.

At that time, I developed a crawler with Java and collected 163W POI location data, which was stored in MySQL to complete the data preparation. The development of crawler is still recommended to use Python. I did not know Python in 17 years, and then I began to learn Python in 18 years, and then I did a lot of crawler development work, and then I wrote a series of crawler learning articles from the beginning to the end. You can also refer to it.

1. Big data platform construction

To pick the stars and moon, must first grow tall buildings from ground to ground.

Also said above, big data or around the platform to do. At that time, I set up three virtual machines with centos system on my laptop, mainly used to set up the following cluster.

Before setting up the cluster, configure the following operating systems and environments.

  1. Install JDK and Scala
  2. Perform trust operations among the three VMS
  3. Install mysql data as hive metadata

Hadoop – Basic core

As the big data infrastructure, Hadoop cluster is also the core of big data. HDFS provides distributed storage, and Yarn provides computing resources.

If it is a master and slave architecture, you can choose a NameNode and two Datanodes architecture. If you want to play more efficiently, choose the HA high availability architecture, that is, two master and two slave, which requires four virtual machines.

For HA, there are two Namenodes, but one NN is active and one NN is standby. You can kill the active NN and let the standby NN take over the cluster.

As for HA, big data is everywhere. In the Hadoop ecosystem, multiple NNS and multiple DN in the cluster are HA, and the copy mechanism of HDFS is ALSO HA, which can reflect a lot in the paper.

The following is the basic information of the Hadoop cluster NN and DN.

Hive – Offline analysis

Hive’s role in my graduation project is a data analysis tool, which mainly describes the L stage of BIG data ETL and the offline analysis part of big data platform.

Hive is a data warehouse that performs offline analysis on HDFS data. Although Hive is not a database, it can be used as a database. There are not many other basic concepts here.

There are many alternatives to Hive, such as ClickHouse, which claims to be 800 times faster than Hive, and Druid, but there are still some differences in application scenarios.

In terms of big data warehousing, there are many interesting platform architectures and some basic concepts. ETL describes the data processing based on data warehouse.