結(jié)構(gòu)化數(shù)據(jù)清洗技術(shù)綜述
2. 清華大學(xué) 計(jì)算機(jī)科學(xué)與技術(shù)系, 數(shù)據(jù)庫(kù)組, 北京 100084Survey of structured data cleaning methodsHAO Shuang1,2, LI Guoliang2, FENG Jianhua2, WANG Ning11. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China;
2. Database Group, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 摘要圖/表參考文獻(xiàn) 相關(guān)文章Metrics 摘要 數(shù)據(jù)清洗是對(duì)臟數(shù)據(jù)進(jìn)行檢測(cè)和糾正的過(guò)程,是進(jìn)行數(shù)據(jù)分析和管理的基礎(chǔ)。該文對(duì)經(jīng)典和新興的數(shù)據(jù)清洗技術(shù)進(jìn)行分類和總結(jié),為進(jìn)一步的研究工作提供方向。形式化定義了數(shù)據(jù)清洗問(wèn)題,對(duì)數(shù)據(jù)缺失、數(shù)據(jù)冗余、數(shù)據(jù)沖突和數(shù)據(jù)錯(cuò)誤這4種數(shù)據(jù)噪聲的檢測(cè)技術(shù)進(jìn)行詳細(xì)闡述。按照數(shù)據(jù)清洗方式對(duì)數(shù)據(jù)噪聲的消除技術(shù)進(jìn)行分類概述,包括基于完整性約束的數(shù)據(jù)清洗算法、基于規(guī)則的數(shù)據(jù)清洗算法、基于統(tǒng)計(jì)的數(shù)據(jù)清洗算法和人機(jī)結(jié)合的數(shù)據(jù)清洗算法。介紹了常用的測(cè)評(píng)數(shù)據(jù)集和噪聲注入工具,并對(duì)未來(lái)重點(diǎn)的研究方向進(jìn)行了探討和展望。服務(wù)把本文推薦給朋友加入引用管理器 E-mail AlertRSS收藏文章作者相關(guān)文章郝爽李國(guó)良馮建華王寧關(guān)鍵詞 :數(shù)據(jù)清洗, 數(shù)據(jù)噪聲, 噪聲檢測(cè), 噪聲消除 Abstract:Data cleaning is the process of detecting and repairing dirty data which is often needed in data analysis and management. This paper classifies and summarizes the traditional and advanced data cleaning techniques and identifies potential directions for further work. This study first formally defines the cleaning problem for structured data and then describes error detection methods for missing data, redundant data, conflicting data and erroneous data. The data cleaning methods are then summarized based on their error elimination method, including constraint-based data cleaning, rule-based data cleaning, statistical data cleaning and human-in-the-loop data cleaning. Some important datasets and noise injection tools are introduced as well. Open research problems and future research directions are also discussed.Key words:data cleaning dirty data error detection error elimination收稿日期: 2018-07-31 出版日期: 2018-12-13基金資助:國(guó)家重點(diǎn)研發(fā)計(jì)劃項(xiàng)目(2018YFC0809800);國(guó)家自然科學(xué)基金項(xiàng)目(61373024,61632016,61422205,61521002)通訊作者:李國(guó)良,副教授,E-mail:liguoliang@tsinghua.edu.cn E-mail: liguoliang@tsinghua.edu.cn引用本文: 郝爽, 李國(guó)良, 馮建華, 王寧. 結(jié)構(gòu)化數(shù)據(jù)清洗技術(shù)綜述[J]. 清華大學(xué)學(xué)報(bào)(自然科學(xué)版), 2018, 58(12): 1037-1050.
HAO Shuang, LI Guoliang, FENG Jianhua, WANG Ning. Survey of structured data cleaning methods. Journal of Tsinghua University(Science and Technology), 2018, 58(12): 1037-1050.鏈接本文: http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.22.053 或 http://jst.tsinghuajournals.com/CN/Y2018/V58/I12/1037
[1] SHALLCROSS S. 2 Reasons why your data is lying to you[R/OL]. (2016-09-15)[2018-06-25]. http://hollistibbetts.sys-con.com/node/1975126.
[2] BUCKLEY J. Causes of dirty data and how to combat them[R/OL]. (2018-07-13)[2018-06-25]. https://www.qubole.com/blog/causes-of-dirty-data-and-how-to-combat-them.
[3] GALHARDAS H, FLORESCU D, SHASHA D, et al. An extensible framework for data cleaning[C]//IEEE 16th International Conference on Data Engineering. San Diego, USA, 2000.
[4] REKATSINAS T, CHU X, ILYAS I F, et al. HoloClean:Holistic data repairs with probabilistic inference[J]. Proceedings of the VLDB Endowment, 2017, 10(11):1190-1201.
[5] MAYFIELD C, NEVILLE J, PRABHAKAR S. ERACER:A database approach for statistical inference and data cleaning[C]//2010 ACM SIGMOD International Conference on Management of Data. Indianapolis, USA, 2010:75-86.
[6] YAKOUT M, BERTI-éQUILLE L, ELMAGARMID A K. Don't be SCAREd:Use scalable automatic repairing with maximal likelihood and bounded changes[C]//2013 ACM SIGMOD International Conference on Management of Data. New York, USA, 2013:553-564.
[7] WANG J N, KRASKA T, FRANKLIN M J, et al. CrowdER:Crowdsourcing entity resolution[J]. Proceedings of the VLDB Endowment, 2012, 5(11):1483-1494.
[8] CHAI C, LI G L, LI J, et al. Cost-effective crowdsourced entity resolution:A partial-order approach[C]//2016 International Conference on Management of Data. San Francisco, USA, 2016:969-984.
[9] 郭志懋, 周傲英. 數(shù)據(jù)質(zhì)量和數(shù)據(jù)清洗研究綜述[J]. 軟件學(xué)報(bào), 2002, 13(11):2076-2082. GUO Z M, ZHOU A Y. Research on data quality and data cleaning:A survey[J]. Journal of Software, 2002, 13(11):2076-2082. (in Chinese)
[10] 楊輔祥, 劉云超, 段智華. 數(shù)據(jù)清理綜述[J]. 計(jì)算機(jī)應(yīng)用研究, 2002, 19(3):3-5. YANG F X, LIU Y C, DUAN Z H. An overview of data cleaning[J]. Application Research of Computers, 2002, 19(3):3-5. (in Chinese)
[11] 王曰芬, 章成志, 張蓓蓓, 等. 數(shù)據(jù)清洗研究綜述[J]. 現(xiàn)代圖書(shū)情報(bào)技術(shù), 2007, 2(12):50-56. WANG Y F, ZHANG C Z, ZHANG B B, et al. A survey of data cleaning[J]. New Technology of Library and Information Service, 2007, 2(12):50-56. (in Chinese)
[12] 趙一凡, 卞良, 叢昕. 數(shù)據(jù)清洗方法研究綜述[J]. 軟件導(dǎo)刊, 2017(12):222-224. ZHAO Y F, BIAN L, CONG X. Survey of data cleaning method in data preprocessing[J]. Software Guide, 2017(12):222-224. (in Chinese)
[13] ELMAGARMID A K, IPEIROTIS P G, VERYKIOS V S. Duplicate record detection:A survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1):1-16.
[14] CHU X, ILYAS I F, KRISHNAN S, et al. Data cleaning:Overview and emerging challenges[C]//2016 International Conference on Management of Data. San Francisco, USA, 2016:2201-2206.
[15] ILYAS I F, CHU X. Trends in cleaning relational data:Consistency and deduplication[J]. Foundations and Trends in Databases, 2015, 5(4):281-393.
[16] FAN W F, GEERTS F, JIA X B, et al. Conditional functional dependencies for capturing data inconsistencies[J]. ACM Transactions on Database Systems (TODS), 2008, 33(2):6.
[17] BOHANNON P, FAN W F, GEERTS F, et al. Conditional functional dependencies for data cleaning[C]//IEEE 23rd International Conference on Data Engineering. Istanbul, Turkey, 2007:746-755.
[18] ABITEBOUL S, HULL R, VIANU V. Foundations of databases:The logical level[M]. Boston, USA:Addison-Wesley Longman Publishing, 1995.
[19] KOUDAS N, SAHA A, SRIVASTAVA D, et al. Metric functional dependencies[C]//IEEE 25th International Conference on Data Engineering. Shanghai, China, 2009:1275-1278.
[20] SONG S X, CHEN L. Differential dependencies:Reasoning and discovery[J]. ACM Transactions on Database Systems (TODS), 2011, 36(3):16.
[21] WANG J N, TANG N. Towards dependable data repairing with fixing rules[C]//2014 ACM SIGMOD International Conference on Management of Data. Snowbird, USA, 2014:457-468.
[22] WANG J N, TANG N. Dependable data repairing with fixing rules[J]. Journal of Data and Information Quality (JDIQ), 2017, 8(3-4):16.
[23] FAN W F, LI J Z, MA S, et al. Towards certain fixes with editing rules and master data[J]. Proceedings of the VLDB Endowment, 2010, 3(1-2):173-184.
[24] FAN W F, LI J Z, MA S, et al. Towards certain fixes with editing rules and master data[J]. The VLDB Journal, 2012, 21(2):213-238.
[25] INTERLANDI M, TANG N. Proof positive and negative in data cleaning[C]//IEEE 31st International Conference on Data Engineering. Seoul, Republic of Korea, 2015:18-29.
[26] HAO S, TANG N, LI G L, et al. Cleaning relations using knowledge bases[C]//IEEE 33rd International Conference on Data Engineering. San Diego, USA, 2017:933-944.
[27] HAO S, TANG N, LI G L, et al. Distilling relations using knowledge bases[J]. The VLDB Journal, 2018, 27(4):497-519.
[28] HUA M, PEI J. DiMaC:A disguised missing data cleaning tool[C]//14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA, 2008:1077-1080.
[29] QAHTAN A A, ELMAGARMID A, CASTRO FERNANDEZ R, et al. FAHES:A robust disguised missing values detector[C]//24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. London, UK, 2018:2100-2109.
[30] K?PCKE H, RAHM E. Frameworks for entity matching:A comparison[J]. Data & Knowledge Engineering, 2010, 69(2):197-210.
[31] HE Q L, LI Z H, ZHANG X. Data deduplication techniques[C]//2010 IEEE International Conference on Future Information Technology and Management Engineering (FITME). Changzhou, China, 2010:430-433.
[32] BAXTER R, CHRISTEN P, CHURCHES T. A comparison of fast blocking methods for record linkage[C]//2003 KDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation. Washington, DC, USA, 2003, 3:25-27.
[33] TEJADA S, KNOBLOCK C A, MINTON S. Learning object identification rules for information integration[J]. Information Systems, 2001, 26(8):607-633.
[34] ELFEKY M G, VERYKIOS V S, ELMAGARMID A K. TAILOR:A record linkage toolbox[C]//18th International Conference on Data Engineering. San Jose, USA, 2002.
[35] TEJADA S, KNOBLOCK C A, MINTON S. Learning domain-independent string transformation weights for high accuracy object identification[C]//International Conference on Knowledge Discovery and Data Mining. Edmonton, Canada, 2002:350-359.
[36] CHRISTEN P. Febrl:A freely available record linkage system with a graphical user interface[C]//2nd Australasian Workshop on Health Data and Knowledge Management. Wollongong, Australia, 2008:17-25.
[37] MCCALLUM A, NIGAM K, UNGAR L H. Efficient clustering of high-dimensional data sets with application to reference matching[C]//6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, USA, 2000:169-178.
[38] BAYARDO R J, MA Y M, SRIKANT R. Scaling up all pairs similarity search[C]//16th International Conference on World Wide Web. Banff, Canada, 2007:131-140.
[39] CHAUDHURI S, GANTI V, KAUSHIK R. A primitive operator for similarity joins in data cleaning[C]//22nd International Conference on Data Engineering. Atlanta, USA, 2006.
[40] XIAO C, WANG W, LIN X M, et al. Efficient similarity joins for near-duplicate detection[J]. ACM Transactions on Database Systems (TODS), 2011, 36(3):15.
[41] ARASU A, Ré C, SUCIU D. Large-scale deduplication with constraints using Dedupalog[C]//2009 IEEE 25th International Conference on Data Engineering. Shanghai, China, 2009:952-963.
[42] HERNáNDEZ M A, STOLFO S J. The merge/purge problem for large databases[C]//1995 ACM SIGMOD International Conference on Management of Data. San Jose, USA, 1995:127-138.
[43] LEE M L, LING T W, LOW W L. IntelliClean:A knowledge-based intelligent data cleaner[C]//Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, USA, 2000:290-294.
[44] WANG J N, LI G L, YU J X, et al. Entity matching:How similar is similar[J]. Proceedings of the VLDB Endowment, 2011, 4(10):622-633.
[45] BILENKO M, MOONEY R J. Adaptive duplicate detection using learnable string similarity measures[C]//Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, DC, USA, 2003:39-48.
[46] PAPENBROCK T, EHRLICH J, MARTEN J, et al. Functional dependency discovery:An experimental evaluation of seven algorithms[J]. Proceedings of the VLDB Endowment, 2015, 8(10):1082-1093.
[47] LIU J X, LI J Y, LIU C F, et al. Discover dependencies from data:A review[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(2):251-264.
[48] HUHTALA Y, K?RKK?INEN J, PORKKA P, et al. TANE:An efficient algorithm for discovering functional and approximate dependencies[J]. The Computer Journal, 1999, 42(2):100-111.
[49] YAO H, HAMILTON H J. Mining functional dependencies from data[J]. Data Mining and Knowledge Discovery, 2008, 16(2):197-219.
[50] NOVELLI N, CICCHETTI R. FUN:An efficient algorithm for mining functional and embedded dependencies[C]//International Conference on Database Theory. London, UK, 2001:189-203.
[51] WYSS C, GIANNELLA C, ROBERTSON E. FastFDs:A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract[C]//International Conference on Data Warehousing and Knowledge Discovery. Munich, Germany, 2001:101-110.
[52] LOPES S, PETIT J M, LAKHAL L. Efficient discovery of functional dependencies and Armstrong relations[C]//International Conference on Extending Database Technology. Konstanz, Germany, 2000:350-364.
[53] FAN W F, GEERTS F, LI J Z, et al. Discovering conditional functional dependencies[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(5):683-698.
[54] DE MARCHI F, LOPES S, PETIT J M. Efficient algorithms for mining inclusion dependencies[C]//8th International Conference on Extending Database Technology. Prague, Czech Republic, 2002:464-476.
[55] BOHANNON P, FAN W F, FLASTER M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[C]//2005 ACM SIGMOD International Conference on Management of Data. Baltimore, USA, 2005:143-154.
[56] CHIANG F, MILLER R J. A unified model for data and constraint repair[C]//2011 IEEE 27th International Conference on Data Engineering. Hannover, Germany, 2011:446-457.
[57] CHU X, ILYAS I F, PAPOTTI P. Holistic data cleaning:Putting violations into context[C]//2013 IEEE 29th International Conference on Data Engineering (ICDE). Brisbane, Australia, 2013:458-469.
[58] KOLAHI S, LAKSHMANAN L V S. On approximating optimum repairs for functional dependency violations[C]//12th International Conference on Database Theory. St. Petersburg, Russia, 2009:53-62.
[59] HAO S, TANG N, LI G L, et al. A novel cost-based model for data repairing[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(4):727-742.
[60] ARENAS M, BERTOSSI L, CHOMICKI J. Consistent query answers in inconsistent databases[C]//18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Philadelphia, USA, 1999:68-79.
[61] BRAVO L, BERTOSSI L. Logic programs for consistently querying data integration systems[C]//18th International Joint Conference on Artificial Intelligence. Acapulco, Mexico, 2003:10-15.
[62] CALí A, LEMBO D, ROSATI R. On the decidability and complexity of query answering over inconsistent and incomplete databases[C]//22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. San Diego, USA, 2003:260-271.
[63] CALI A, LEMBO D, ROSATI R. Query rewriting and answering under constraints in data integration systems[C]//18th International Joint Conference on Artificial Intelligence. Acapulco, Mexico, 2003:16-21.
[64] CHOMICKI J, MARCINKOWSKI J. Minimal-change integrity maintenance using tuple deletions[J]. Information and Computation, 2005, 197(1-2):90-121.
[65] GERTZ M, LIPECK U W. A diagnostic approach for repairing constraint violations in databases[M]//JAJODIA S, LIST W, MCGREGOR G, et al, eds. Integrity and internal control in information systems. ⅡCIS 1997. Boston, USA:Springer, 1997, 1:89-111
[66] GRECO G, GRECO S, ZUMPANO E. A logical framework for querying and repairing inconsistent databases[J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(6):1389-1408.
[67] YAKOUT M, ELMAGARMID A K, NEVILLE J, et al. Guided data repair[J]. Proceedings of the VLDB Endowment, 2011, 4(5):279-289.
[68] CHU X, MORCOS J, ILYAS I F, et al. KATARA:A data cleaning system powered by knowledge bases and crowdsourcing[C]//2015 ACM SIGMOD International Conference on Management of Data. Melbourne, Australia, 2015:1247-1261.
[69] COHEN W, RAVIKUMAR P, FIENBERG S. A comparison of string metrics for matching names and records[C]//KDD Workshop on Data Cleaning and Object Consolidation. Washington, DC, USA, 2003:73-78.
[70] JIANG Y, LI G L, FENG J H, et al. String similarity joins:An experimental evaluation[J]. Proceedings of the VLDB Endowment, 2014, 7(8):625-636.
[71] EBAID A, ELMAGARMID A, ILYAS I F, et al. NADEEF:A generalized data cleaning system[J]. Proceedings of the VLDB Endowment, 2013, 6(12):1218-1221.
[72] CONG G, FAN W F, GEERTS F, et al. Improving data quality:Consistency and accuracy[C]//33rd International Conference on Very Large Data Bases. Vienna, Austria, 2007:315-326.
[73] GEERTS F, MECCA G, PAPOTTI P, et al. The LLUNATIC data-cleaning framework[J]. Proceedings of the VLDB Endowment, 2013, 6(9):625-636.
[74] BESKALES G, ILYAS I F, GOLAB L. Sampling the repairs of functional dependency violations under hard constraints[J]. Proceedings of the VLDB Endowment, 2010, 3(1-2):197-207.
[75] KHAYYAT Z, ILYAS I F, JINDAL A, et al. BIGDANSING:A system for big data cleansing[C]//2015 ACM SIGMOD International Conference on Management of Data. Melbourne, Australia, 2015:1215-1230.
[76] BESKALES G, ILYAS I F, GOLAB L, et al. On the relative trust between inconsistent data and inaccurate constraints[C]//2013 IEEE 29th International Conference on Data Engineering. Brisbane, Australia, 2013:541-552.
[77] VOLKOVS M, CHIANG F, SZLICHTA J, et al. Continuous data cleaning[C]//2014 IEEE 30th International Conference on Data Engineering. Chicago, USA, 2014:244-255.
[78] RISSANEN J. Modeling by shortest data description[J]. Automatica, 1978, 14(5):465-471.
[79] COVER T M, THOMAS J A. Elements of information theory[M]. New York, USA:John Wiley & Sons, 1991.
[80] HE J, VELTRI E, SANTORO D, et al. Interactive and deterministic data cleaning[C]//2016 International Conference on Management of Data. San Francisco, USA, 2016:893-907.
[81] AROCENA P C, GLAVIC B, MECCA G, et al. Messing up with BART:Error generation for evaluating data-cleaning algorithms[J]. Proceedings of the VLDB Endowment, 2015, 9(2):36-47.
[82] ANANTHAKRISHNA R, CHAUDHURI S, GANTI V. Eliminating fuzzy duplicates in data warehouses[C]//28th International Conference on Very Large Databases. Hong Kong, China, 2002:586-597.
[83] WANG J N, KRISHNAN S, FRANKLIN M J, et al. A sample-and-clean framework for fast and accurate query processing on dirty data[C]//2014 ACM SIGMOD International Conference on Management of Data. Snowbird, USA, 2014:469-480.
[84] KRISHNAN S, WANG J N, FRANKLIN M J, et al. SampleClean:Fast and reliable analytics on dirty data[C/OL]//Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2015:59-75[2018-06-19]. https://www.ocf.berkeley.edu/~sanjayk/old/pubs/sc.pdf.
[85] LI G L, WANG J N, ZHENG Y D, et al. Crowdsourced data management:A survey[C]//2017 IEEE 33rd International Conference on Data Engineering. San Diego, USA, 2017:39-40.
[86] PARK H, WIDOM J. CrowdFill:Collecting structured data from the crowd[C]//2014 ACM SIGMOD International Conference on Management of Data. Snowbird, USA, 2014:577-588.
[87] WHANG S E, LOFGREN P, GARCIA-MOLINA H. Question selection for crowd entity resolution[J]. Proceedings of the VLDB Endowment, 2013, 6(6):349-360.
[88] TONG Y X, CAO C C, ZHANG C J, et al. CrowdCleaner:Data cleaning for multi-version data on the web via crowdsourcing[C]//2014 IEEE 30th International Conference on Data Engineering. Chicago, USA, 2014:1182-1185.
[89] MICHALSKI R S. A theory and methodology of inductive learning[M]//MICHALSKI R S, CARBONELL J G, MITCHELL T M. Machine learning. Berlin, Germany:Springer, 1983:83-134.
Abstract
Cited
Shared Discussed相關(guān)知識(shí)
橋梁健康監(jiān)測(cè)數(shù)據(jù)分析研究綜述.pdf
醫(yī)療健康大數(shù)據(jù)研究綜述.pdf
健康數(shù)據(jù)分析技術(shù)的發(fā)展及應(yīng)用
FESE綜述
可穿戴設(shè)備與健康醫(yī)療數(shù)據(jù)采集技術(shù)
中國(guó)健康醫(yī)療大數(shù)據(jù)研究綜述——基于期刊論文的分析
醫(yī)療健康數(shù)據(jù)分析
如何通過(guò)算法和數(shù)據(jù)庫(kù)技術(shù)實(shí)現(xiàn)健康生活的數(shù)據(jù)分析和預(yù)測(cè)
健康數(shù)據(jù)分析與預(yù)測(cè)
構(gòu)筑健康醫(yī)療大數(shù)據(jù)共享平臺(tái)
網(wǎng)址: 結(jié)構(gòu)化數(shù)據(jù)清洗技術(shù)綜述 http://m.u1s5d6.cn/newsview340805.html
推薦資訊
- 1發(fā)朋友圈對(duì)老公徹底失望的心情 12775
- 2BMI體重指數(shù)計(jì)算公式是什么 11235
- 3補(bǔ)腎吃什么 補(bǔ)腎最佳食物推薦 11199
- 4性生活姿勢(shì)有哪些 盤(pán)點(diǎn)夫妻性 10425
- 5BMI正常值范圍一般是多少? 10137
- 6在線基礎(chǔ)代謝率(BMR)計(jì)算 9652
- 7一邊做飯一邊躁狂怎么辦 9138
- 8從出汗看健康 出汗透露你的健 9063
- 9早上怎么喝水最健康? 8613
- 10五大原因危害女性健康 如何保 7826