💡 Author: Han Xinzi @ShowMeAI
📘Data Analysis◉Skill Improvement Series : https://www.showmeai.tech/tutorials/33
📘Data analysis combat series : https://www.showmeai.tech/tutorials/40
📘The address of this article : https://www.showmeai.tech/article-detail/284
📢 Disclaimer: All rights reserved, please contact the platform and author for reprinting and indicate the source 📢 Favorite ShowMeAI for more exciting content
In practice, we often rely on business data analysis to formulate business strategies. This process requires frequent data analysis and mining to discover patterns and regularities. For algorithm engineers, the implementation of an effective AI algorithm system is not just as simple as a model - data is the underlying driver.
A typical " machine learning workflow " contains 6 key steps, of which " Exploratory Data Analysis (EDA) " is a crucial step.
- define the problem
- Data Acquisition and ETL
- Exploratory Data Analysis
- data preparation
- Modeling (model training and selection)
- Deploy and monitor
Wiki: In statistics, exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing.
Exploratory data analysis, usually using data visualization methods such as statistical graphics, to explore the structure and regularity of the data, and to summarize the main characteristics of the data. This process usually involves finely divided processing steps and analytical operations.
Common tools for exploratory data analysis EDA
Great tools to simplify the above process! Analysis reports can even be generated with one click. This ShowMeAI summarizes the most popular exploratory data analysis tool libraries as of 2022, let's try it together!
Generally, we have the following 3 ways to do EDA:
- Approach 1: Manual analysis using library/framework in Python/R
- Approach 2: Using an automated EDA library in Python/R
- Option 3: Use a tool like Microsoft Power BI or Tableau
We have sorted out the best tool libraries corresponding to the three methods below. Students who are interested in automated data analysis can directly jump to the " Automated EDA Tool Library " section.
Method 1: Manual analysis tool library
💡 Matplotlib
Matplotlib is a Python tool for plotting and interactive visualization. Most of the toolkit extensions you use in Python are built on top of Matplotlib (including Seaborn, HoloViews, ggplot, and some automated EDA tools mentioned later).
Based on Matplotlib, it can be implemented with simple codes: scatter plots, histograms, bar charts, error plots and boxplots, which help us understand the data and carry out subsequent work.
You can learn from the official 📘User Guide , 📘Tutorials and 📘Code Samples , or go to station B to watch the 📘Video tutorials . It is also recommended to download the 📘Matplotlib Quick Lookup Sheet that is a favorite of ShowMeAI , so that you can quickly find the functions you need.
💡 Seaborn
Another popular Python data visualization framework is Seaborn, which is more concise than Matplotlib and extends many analysis functions and presentation forms.
You can also learn it through Seaborn's 📘User Guides and Tutorials , or go watch 📘Video Tutorials . Also welcome to read the 📘 Seaborn Cheat Sheet summarized by ShowMeAI, and the Seaborn Visualization Tutorial Seaborn Tools and Data Visualization .
💡 Plotly
Plotly is another Python library of open source tools for creating interactive data visualizations. Built on top of the Plotly JavaScript library ( plotly.js
), Plotly can be used to create web-based data visualizations that can be displayed in Jupyter notebooks or web applications using Dash, or saved as separate HTML files .
It provides up to 40+ chart types, including scatter, histogram, line, bar, pie, error bars, boxplot, multi-axis, sparkline, treemap, and 3-D charts (even contour plots, which are not common in other data visualization libraries). You can learn and use it through the 📘Official User Guide .
💡 Bokeh
Bokeh is a Python library for creating interactive visualizations for modern web browsers. It can build beautiful graphs, from simple plots to complex dashboards with streaming datasets. With Bokeh, you can create JavaScript-based visualizations without having to write any JavaScript yourself.
You can learn a series of its usage through Bokeh's 📘 official website and 📘 sample library . It is also recommended that you download the 📘 Bokeh Cheat Sheet for your favorite ShowMeAI to quickly find the functions you need.
💡 Altair
Altair is a declarative statistical visualization library for Python, based on Vega and Vega-Lite. Altair's API is simple and friendly, producing beautiful and effective visualizations with minimal code. You can learn the use of the Altair tool library through the official 📘 Altair Notebook Examples .
Method 2: Automated EDA Tool Library
💡 pandas-profiling
Many students who have done Python data analysis are familiar with the describe function of Pandas. Pandas-profiling extends the corresponding functions through its low-code interface and presents the information in the form of reports. The pandas-profiling library automatically generates profiling reports from pandas DataFrames, and the whole process only takes two or three lines of code.
pandas-profiling analyzes single-field and associated fields. For each column (field) of the dataset, it analyzes the following and renders it in an interactive HTML report:
- Type Inference : Types of Field Columns
- Key Points : Types, Unique Values, Missing Values
- Quantile statistics : including min, Q1, median, Q3, max, range, interquartile range
- Descriptive statistics : including mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness, etc.
- Histograms : Categorical and Numeric
- Correlation : Spearman, Pearson, and Kendall Matrices
- Missing Values : Matrices, Counts, Heatmaps, and Dendrograms of Missing Values
- Text Analysis : Understanding categories (capitals, spaces), scripts (Latin, Cyrillic) and blocks (ASCII) of text data
- File and Image Analysis : Extract file size, creation date, and dimensions, and scan for truncated images or images with EXIF information
You can get detailed usage methods on the pandas-profiling project 📘 GitHub page. The simple data analysis and report generation process can be generated with the following 1-line command (run on the command line).
pandas_profiling --title "Example Profiling Report" --config_file default.yaml data.csv report.html
Or in Python with the following lines of code:
# 读取数据
df = pd.read_csv(file_name)
# 数据分析
profile = ProfileReport(df, title="Data Report", explorative=True)
# html分析报告生成
profile.to_file(Path("data_report.html"))
💡 Sweetviz
The functionality of Sweetviz is very similar to pandas-profiling. It is an open-source Python library that generates beautiful reports of highly informative results and starts the exploratory data analysis process with just two lines of code. The output is a completely self-contained HTML report (and fully interactive).
Features of Sweetviz :
- type inference
- summary information
- target field analysis
- Shows the correlation analysis of the target column with other features
- Visualize and compare
The official code for SweetViz can be found on 📘 GitHub . Analysis and report generation only require the following 2 lines of code:
# 数据分析
my_report = sv.analyze(data)
# 报告生成
my_report.show_html()
The image below is a report generated using Sweetviz.
💡 AutoViz
AutoViz is another automation EDA framework. It is also functionally similar to Sweetviz and pandas-profiling. AutoViz can automatically visualize any dataset with just one line of code, it can also complete automatic field selection, find the most important feature fields for analysis and visualization, and it runs very fast.
AutoViz can be combined with Bokeh for interactive data exploration and analysis. You can find detailed tutorials in the official 📘 AutoViz Sample Notebook . The core code is as follows:
AV = AutoViz_Class()
_ = AV . AutoViz(filename)
The image below shows a report generated using AutoViz.
Method 3: Data analysis tool software
💡 Microsoft Power BI
Power BI is interactive data visualization software developed by Microsoft with a primary focus on business intelligence. It is part of Microsoft Power Platform. Power BI is a collection of software services, applications, and connectors that work together to turn disparate data sources into coherent, visually immersive, and interactive insights. Data can be entered by reading it directly from databases, web pages or spreadsheets, structured files such as CSV, XML, and JSON.
However, Power BI is not open source, it is a paid enterprise tool with a free desktop version. You can learn Power BI from the 📘Official Study Guide .
💡 Tableau
Tableau is the leading data visualization tool for data analysis and business intelligence. Gartner's Magic Quadrant ranks Tableau as a leader in analytics and business intelligence. Tableau is a tool that is changing the way we use data to solve problems - enabling people and organizations to get the most out of their data.
The image below shows a report generated using Tableau. Everyone go to station B to watch 📘 1 hour quick learning video tutorial .
References
- 📘 Matplotlib official tutorial : https://matplotlib.org/stable/tutorials/index.html
- 📘 Matplotlib Knowledge Cheat Sheet: https://www.showmeai.tech/article-detail/103
- 📘 Seaborn Data Visualization Tutorial : https://www.showmeai.tech/article-detail/151
- 📘 Seaborn Knowledge Cheat Sheet: https://www.showmeai.tech/article-detail/105
- 📘 Plotly official tutorial : https://plotly.com/python/getting-started/
- 📘 Official Bokeh tutorial : http://docs.bokeh.org/en/latest/
- 📘 Bokeh Knowledge Cheat Sheet: https://www.showmeai.tech/article-detail/104
- 📘 Altair Notebook Examples : https://github.com/altair-viz/altair_notebooks
- 📘 pandas-profiling detailed tutorial : https://github.com/ydataai/pandas-profiling
- 📘 SweetViz official code : https://github.com/fbdesignpro/sweetviz
- 📘 AutoViz Examples : https://github.com/AutoViML/AutoViz/tree/master/Examples
- 📘 Power BI Official Learning Guide : https://powerbi.microsoft.com/en-ca/learning/
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。