推荐:数据预处理的四个步骤

数据预处理的四个步骤,下面是详细内容。。。

  数据预处理638.mp34:22

  来自LearningYard学苑

  数据

预处理方法数据易受到噪声、缺失值和不一致数据的侵扰。数据太大且多半来自多个异种数据源。低质量的数据直接用来做数据分析会导致低质量的挖掘结果。数据预处理技术,可以显著地提高挖掘模式的总体质量减少实际挖掘所需的时间。

  步骤

  数据的预处理一般要经过以下几个步骤:

  1.数据清理

  纠正不一致数据和噪声数据,填充缺失值、光滑噪声并识别离群点。

  2。数据变换

  数据规范化到一个较小的区间,提高涉及采用距离度量判断相似性的挖掘算法的准确性和效率

  3、数据归约

  通过聚集、抽样等方法,删除冗余属性特征,或者降低分析数据集的规模异构数据集数据清洗、数据的规范化和离散化(数据转换)、数据约减

  对于缺失的数据,我们常常采用以下方法来处理:

  1、缺失值的处理方法:

  忽略空缺属性太多或者缺少类标号的数据对象

  忽略缺失属性的元组,但是只能针对缺失了类别的属性,或者其他属性空缺太多的元组

  2、人工填写空缺属性值

  数据集特别大时,行不通

  3、使用全局常识填写

  方法简单,但如果处理不好,在数据分析时可能造成分析结果的歧义

  4、使用最可能的值填写空缺属性值

  通常用回归或者贝叶斯方法的基于推理的工具来推断可能的取值

  5、使用与给定元组属于同一类的所有样本的平均值或者中位数值

  数据集中数据分布是倾斜的,中位数替代是最好的选择

  上节中说道,我们可以通过盒图来对数据进行分析,找到其中的离散点进行处理,通常我们采用分箱法和回归法。

  分箱法:通过考察“邻居”(周围的值)来平滑存储数据的值目的是去噪,或者将连续数据离散化,提高粒度(离散化技术)

  分箱法分为等深分箱和等宽分箱,两种分箱采用两种分隔方式。

  等深分箱:等频分箱,不同分箱里有相同个数的数据

  等宽分箱:每个箱子值的取值区间相同,每个箱子值的取值范围是常量。

  为了便于处理,我们会对箱中的数据进行

  1.箱中值光滑:每个箱中的每一个值被该箱中的中位数替代。

  2.箱均值光滑:每个箱中的每个属性值被替换为该箱子中的平均值。

  3.箱边界光滑:给定箱中最大值和最小值,将其视为箱边界箱中的每一个值都被替换成离它最近的边界值。

  去噪:另外一种处理噪声数据的方法即通过让数据来适合一个拟合的回归函数来平滑噪声点数据。

  例子:假设有8、24、15、41、6、10、18、67、25等9个数,先对数进行从小到大排序,6、8、10、15、18、24、25、41、67,再按等深分箱法分为3箱

  按箱平均值求得平滑数据值:

  箱1:8、8、8,平均值为8

  按照箱中值求得平滑数据值:

  箱2:18、18、18,箱子中每一个值被箱中值替换。

  箱边界值光体育资讯滑:

  箱3:25、25、67,箱中最大和最小值被视为箱边界。箱中每一个值都用接近的边界值替换。41离25更近,所以41用25替代。

  关于回归法,我们将在下一节中介绍。

  翻译

  Data is susceptible to noise, missing values, and inconsistent data. The data is too large and mostly comes from multiple heterogeneous data sources. Direct use of low-quality data for data analysis will lead to low-quality mining results. Data preprocessing technology can significantly improve the overall quality of the mining mode and reduce the time required for actual mining.

  The preprocessing of data generally goes through the following steps: 1. Data cleaning correct inconsistent data and noisy data, fill in missing values, smooth noise and identify outliers. 2. Data transformation data is normalized to a smaller interval to improve the accuracy and efficiency of mining algorithms involving the use of distance metrics to determine similarity. 3. Data reduction uses methods such as aggregation and sampling to remove redundant attribute features or reduce the size of the analysis data set Data cleaning of heterogeneous data sets, data standardization and discretization (data conversion), data reduction.

  How to deal with missing values: Ignore data objects with too many missing attributes or missing class labels. Ignore tuples with missing attributes, but only for tuples with missing categories or too many other attributes. 1. Manually fill in missing attributes When the value data set is very large, it is not feasible to use the global common sense to fill in the method is simple, but if it is not handled well, it may cause ambiguity in the data analysis. 2. Use the most probable value to fill in the missing attribute value usually using regression or Bayesian The method’s inference-based tools to infer possible values 3. Use the average or median value of all samples belonging to the same class as the given tuple. The data distribution in the data set is skewed. The median replacement is the best choice.

  As mentioned in the previous section, we can analyze the data through the box plot and find the discrete points for processing. Usually we use the boxing method and the regression method. Binning method: The purpose of smoothing the value of stored data by examining “neighbors” (surrounding values) is to denoise, or to discretize continuous data and increase granularity (discretization technology). The binning method is divided into equal depth bins and etc. Wide sub-box, two kinds of sub-boxes adopt two kinds of separation methods. Equal depth bins: equal frequency bins. Different bins have the same number of data. Equal width bins: the value range of each box value is the same, and the value range of each box value is constant. In order to facilitate processing, we will perform 1. Box median smoothing: each value in each box is replaced by the median of the box. 2. The box mean is smooth: each attribute value in each box is replaced with the average value in the box. 3. The box boundary is smooth: given the maximum and minimum values in the box, each value in the box is regarded as the boundary of the box, and each value in the box is replaced with the nearest boundary value. Denoising: Another way to deal with noisy data is to smooth the noisy data by fitting the data to a fitted regression function. Example: Assuming there are 9 numbers such as 8, 24, 15, 41, 6, 10, 18, 67, 25, etc., first sort the numbers from small to large, 6, 8, 10, 15, 18, 24, 25, 41 , 67, and then divide into 3 boxes according to the equal depth box method to obtain the smoothed data value according to the box average value: box 1: 8, 8, 8, and the average value is 8 according to the box median value to obtain the smoothed data value: box 2: 18, 18, 18, each value in the box is replaced by the value in the box. The box boundary value is smooth: box 3: 25, 25, 67, the maximum and minimum values in the box are regarded as the box boundary. Each value in the box is replaced with a close boundary value. 41 is closer to 25, so replace 41 with 25.

  

以上就是本站(www.youliangdian.com)提供的关于数据预处理的四个步骤的内容,希望对你有帮助。数据预处理的四个步骤来自网络分享,如有疑问请联系本站,谢谢你的关注。


Warning: error_log(/www/wwwroot/www.youliangdian.com/wp-content/plugins/spider-analyser/#log/log-2323.txt): failed to open stream: No such file or directory in /www/wwwroot/www.youliangdian.com/wp-content/plugins/spider-analyser/spider.class.php on line 2900