Fundamentals of python data analysis and visualization

I. Introduction to data analysis:
1. Data analysis implications:
Data analysis refers to the analysis of large amounts of collected data using appropriate statistical analysis methods to summarize and understand them and digest them in order to maximize the development of the data's functionality and usefulness. Data analysis is the process of studying and summarizing data in detail in order to extract useful information and form conclusions.

The mathematical foundations of data analysis were established in the early 1900s, but it was not until the advent of computers that the practicalities were made possible and allowed data analysis to spread. Data analysis is a product of the combination of mathematics and computer science.

2. Common data analysis methods:

1) The list method:

Expressing data in a list according to certain laws is the most common method of recording and processing. The design of the table requires clear correspondence, simple and clear, conducive to the discovery of correlation between the relevant quantities of correlation; in addition, it is also required to indicate in the title column the name of the various quantities, symbols, orders of magnitude and units, etc.: according to the need to also be listed in addition to the original data, such as calculation columns and statistical columns.

2) Graphic method:

Graphing can be the most eye-catching expression of the relationship between changes in various physical quantities. From the graph line can be simple to find out some of the results of the experimental needs, but also can be some complex functional relationships, through certain transformations with graphical representation. [3]
There are two main ways to generate charts and graphs: manual tabulation and automatic generation with a program, where tabulation with a program is done through the appropriate software, such as SPSS, Excel, MATLAB, and so on. The data of the survey is entered into the program, and the final results are obtained through the operation of these software, and the results can be shown in the form of charts or graphs. Graphs and charts can directly reflect the results of the research, which greatly saves the designer's time and helps designers to better analyze and predict the products needed by the market, paving the way for further design. At the same time, these forms of analysis are also used in product sales statistics, which can visually give the recent product sales, and can be timely analysis and prediction of future market sales and so on. So data analysis method is very widely used in industrial design and is extremely important.

I. Introduction to data visualization:

1. Data visualization implications:

Data visualization is the scientific and technical study of the visual representation of data. Among other things, this visual representation of data is defined as an extract of information in some summarized form, including various attributes and variables of the corresponding information unit.

It is an evolving concept whose boundaries are constantly expanding. It mainly refers to technically advanced technical methods that allow for the visual interpretation of data through representation, modeling, and display of dimensions, surfaces, attributes, and animations using graphics, image processing, computer vision, and user interfaces. Data visualization encompasses a much broader range of technical approaches than specific technical approaches such as stereo modeling.

2. Tool software:

There are a variety of data visualization tools on the market, of which Excel can be said to be a typical entry-level data visualization tool. From the automation aspect of data visualization, it is recommended to use Python programming to achieve. there are many libraries for data visualization in Python, the more common ones are: Matplotlib (powerful, complex), Seaborn (based on Matplotlib, simple), pyecharts (based on Echarts, cool), plotnine (ggplot2 ported to R, graph syntax), PyQtGraph (interactive, high performance).

III. common data analysis and visualization libraries:

1. NumPy：

NumPy (Numerical Python) is an open source numerical computing extension to Python. This tool can be used to store and process large matrices , than Python's own nested list (nested list structure) structure is much more efficient (the structure can also be used to represent the matrix (matrix)), support for a large number of dimensional arrays and matrices , in addition to providing a large number of mathematical libraries for arrays of operations .

2. pandas：

pandas is a NumPy-based tool created to solve data analysis tasks. pandas incorporates a large number of libraries and a number of standard data models that provide the tools needed to efficiently manipulate large datasets. pandas provides a large number of functions and methods that enable us to work with data quickly and easily. As you will soon discover, it is one of the key factors that make Python a powerful and efficient environment for analyzing data!

3. Matplotlib

Matplotlib is a 2D plotting library for Python that generates publication-quality graphics in a variety of hardcopy formats and cross-platform interactive environments.

4. SciPy：

SciPy is a Python-based open source library primarily used for scientific computing. It helps users solve optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations and statistics by providing a range of algorithms and mathematical tools.

-learn：

Scikit-learn (formerly known as, also known as sklearn) is a free software machine learning library for the Python programming language [1]. It features a variety of classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, k-means, and DBSCAN, and is intended to be used in conjunction with the Python numerical science libraries NumPy and SciPy.

6. Wordcloud:

The WordCloud library is a very useful tool. It is able to present keywords in a text as a word cloud, making it possible to visualize the topic and keywords of the text.The WordCloud library not only makes it simple to generate a word cloud, but also provides a wealth of configuration options that allow users to customize the appearance and layout of the word cloud according to their needs. As a member of the Python open source library, the WordCloud library has been used in a wide variety of applications, including data analytics, natural language processing, social media analytics, and more.

7. jieba：

jieba is a Python Chinese word separation component with good performance. The principle of word separation in jieba library is to use a Chinese thesaurus to determine the probability of association between Chinese characters, and form phrases with high probability between Chinese characters to form word separation results. In addition to word splitting, users can also add customized phrases.

IV. Relevant references:
1. Baidu Encyclopedia of Data Analysis：/item/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/6577123?fr=ge_ala

2. Baidu Encyclopedia of Data Visualization：/item/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/1252367

3. numpy：/item/numpy/5678437?fr=ge_ala

4. pandas：/item/pandas?fromModule=lemma_search-box

5. Matplotlib：/item/Matplotlib?fromModule=lemma_search-box

6. SciPy：/p/686571574

7. Scikit-learn：/item/Scikit-learn/3006276?fr=ge_ala

8. Wordcloud：/qq_23091073/article/details/137687456

9. jieba：/m0_63244368/article/details/126837925

Wordcloud