25 Big Data Terms Everyone Should Know

If you’re new to big data, it looks scary! Based on your basic knowledge, let’s focus on a few key terms to impress your date, boss, family, or anyone else. Let’s get started: 1. Algorithms. How does “algorithm” relate to big data? Even though algorithm is a generic term, big data analysis has made it more popular and popular in modern times.

Click here to view the original text:click.aliyun.com/m/27352/

If you’re new to the scene, big data looks scary! Based on your basic knowledge, let’s focus on a few key terms to impress your date, boss, family, or anyone else.

Let’s get started:

1.MaxCompute (formerly ODPS). MaxCompute, a big data platform product independently developed by Alibaba Cloud in China, can provide fast, fully managed PB-level data warehouse solutions that can economically and efficiently analyze and process massive data, providing computing capacity for 1.8 million enterprises in 60 countries around the world. www.aliyun.com/product/odp… Similar open source products have Hadoop, data address yq.aliyun.com/articles/78… .

2. Analysis. At the end of the year you may receive a year-end report from your credit card company containing all of your transactions for the year. What if you’re interested in taking a closer look at how much you spend on food, clothing, entertainment, etc? Then you are doing “analysis”. You’re drawing lessons from a pile of raw data to help you make spending decisions for the coming year. What if you were doing the same exercise with a Twitter or Facebook post for an entire city? Then we’re talking about big data analytics. The essence of big data analytics is to use large amounts of data to make inferences and tell stories. There are three different types of big data analysis, and this topic will continue to be discussed in turn.

3. Descriptive analysis. If you just told me that 25% of your credit card spending last year was on food, 35% on clothing, 20% on entertainment, and the rest is just a bunch of stuff, that would be descriptive analysis. Of course you can refer to more details.

4. Predictive analysis. If you run the analysis based on credit card history over the past five years, with some continuity, you can predict with a high probability that next year will be similar to the past few years. The detail to note here is that this is not “predicting the future”, but rather a “probability” of what might happen in the future. In big data predictive analytics, data scientists may use advanced techniques such as machine learning and advanced statistical processes (terms described below) to predict weather and economic changes.

5. Normative analysis. Using the example of credit card transactions, you might want to figure out which items (food, clothing, entertainment, etc.) have a big impact on your overall spending. Normative analysis builds on predictive analysis, including records of “actions” (such as reducing spending on food, clothing, entertainment), and analyzes the results to “specify” the best categories to reduce overall spending. You can try to diverge that into big data and imagine how executives can make data-driven decisions by looking at the impact of various actions.

6. Batch. Although batch data processing has been around since the mainframe era, big data has given it more meaning by giving it more processing power from large data sets. Batch data processing provides an efficient way to process large amounts of data for a set of transactions collected over a period of time. MaxCompute, described later, focuses on batch data processing.

Cassandra is a popular open source database management system managed by the Apache Software Foundation. Much of the big data technology is credited to Apache, with Cassandra being designed to handle large amounts of data across distributed servers.

8. Cloud computing. Cloud computing has obviously become so ubiquitous that I probably don’t need to cover it in this article, but I will for the sake of completeness. The essence of cloud computing is hosting software and/or data that runs on remote servers and allows access from anywhere on the Internet.

9. Cluster computing. It is a fancy way of doing computing using a “cluster” of pooled resources on multiple servers. After we learn more about the technology, we may also talk about nodes, cluster management, load balancing, and parallel processing.

10. Dark data. In my opinion, this term applies to senior managers who are scared out of their wits. Dark data is basically data that is collected and processed by businesses but not used for any meaningful purpose, so describing it as “dark” can be buried forever. They could be social-networking feeds, call center logs, meeting notes, and so on. Many estimates have been made that 60-90% of all corporate data could be “dark data,” but no one really knows.

11. Data lake. When I first heard it, I honestly thought someone was playing an April Fool’s joke. But it really is a term! A data lake is a large repository of enterprise-level data in raw format. Although I’m talking about data lakes here, it’s worth discussing data warehouses again, because data lakes and data warehouses are very similar in concept as repositories for enterprise-level data, but differ in their structured formats after cleansing and integration with other data sources. Data warehouses are often used for routine (but not complete) data. The data lake is said to give users easy access to enterprise-level data, and users truly know what they are looking for, how to process it, and make it smart to use.

12. Data mining. Data mining refers to the use of complex pattern recognition technology to find meaningful patterns and extract insights from a large amount of data. This is closely related to the term “analytics” that we discussed earlier for using personal data for analysis. To extract meaningful patterns, data miners use statistics (yes, good old math), machine learning algorithms and artificial intelligence.

13. Data scientist. We’re talking about such a hot profession! Data scientists can do this by extracting raw data. , process the data and come up with new insights. Some of the skills needed to be a data scientist are the same as superman: analytics, statistics, computer science, creativity, storytelling and understanding of the business environment. No wonder they are paid such high salaries.

Distributed file systems. Because big data is too large to be stored on a single system, the distributed file system provides a data storage system that facilitates the storage of large amounts of data across multiple storage devices and helps reduce the cost and complexity of massive data storage.

15. The ETL. ETL is an acronym for Extract, Transform, and Load, respectively, and stands for the extraction, transform, and load processes. It specifically refers to the entire process of “extracting” raw data, “transforming” it through data cleaning/grooming to get “fit for use” data, and then “loading” it into an appropriate repository for use by the system. Although the concept of ETL originated in data warehousing, it is now applicable to processes in other contexts, such as retrieving/absorbing data from external data sources in big data systems.

16. Algorithm. How does “algorithm” relate to big data? Even though algorithm is a generic term, big data analysis has made it more popular and popular in modern times.

17. Memory calculation. In general, any computation that can be done without access to I/O is expected to be faster than the required ACCESS to I/O. In-memory computing is a technique that allows working data sets to be completely moved to the collective memory of the cluster without writing intermediate computations to disk. Apache Spark is an in-memory computing system that has a huge advantage over I/O for binding on a system such as MaxCompute MapReduce.

18. IOT. The latest buzzword is the Internet of Things (IOT). IOT connects computing devices in embedded objects (sensors, wearables, cars, refrigerators, etc.) via the Internet and can send/receive data. IOT generates a lot of data, which provides more opportunities to present big data analytics.

Machine learning. Machine learning is the design of a system that can be continuously learned, adjusted, and improved based on the data provided. The machine uses predictive and statistical algorithms to learn and focus on achieving “correct” behavior patterns and simple insights, and it continues to improve as more and more data are injected into the system. Typical applications include fraud detection and online personalized recommendation.

20. Graphs. The concept of MapReduce can be a little confusing, but let me give it a try. MapReduce is a programming model that is best understood as two separate units. In this case, the programming model first divides the dataset of big data into parts (technically called “tuples,” but I don’t want to get too technical in this article) that can therefore be deployed to different computers in different locations (clustered computing as described above), which are essentially part of a Map. The model then collects all the results and “reduces” them into the same report. MapReduce’s data processing model complements MaxCompute’s distributed file system.

21. No. While this may sound like a protest against Structured Query Language (SQL), the object-oriented form of traditional relational database management systems (RDBMSS), NoSQL stands for NOT ONLY SQL, meaning “more than SQL.” NoSQL is actually a database management system that is designed to process large amounts of unstructured, or technically called “charts,” such as the tables of a relational database. NoSQL databases are generally well suited for large data systems due to their flexibility and the distributed structure necessary for large unstructured databases.

22. The R language. Can anyone think of a worse name than this programming language? Yes, ‘R’ is a programming language that does very well in statistical computing. If you don’t know ‘R’, you’re not a data scientist. If you don’t know ‘R’, please don’t send me bad code. This is R, one of the most popular languages in data science.

23. The Spark (Apache Spark). Apache Spark is a fast in-memory data processing engine that can efficiently perform streams, machine learning, or SQL workloads that require rapid iterative access to data sets. Spark is usually much faster than MapReduce, which we discussed earlier.

24. Stream processing. Stream processing is designed to manipulate real-time and streaming data through “continuous” queries. Combined with flow analysis (the ability to perform continuous computational mathematical or statistical analysis simultaneously within a stream), flow processing solutions can be used to process very large data in real time.

25. Structured and unstructured data. This is the “Variety” of Big Data 5V. Structured data is the most basic type of data that can be put into a relational database, and any other data can be related through the way the tables are organized. Unstructured data is anything that can’t be stored directly in a relational database, such as emails, social media posts, human recordings, and so on.

Scan for more information:

25 Big Data Terms Everyone Should Know

Related Posts

Batch to utf8BOM signature header tool

The beginning of garbage Collection

Behavioral patterns — observer patterns