首頁資訊結構化數據清洗技術綜述

結構化數據清洗技術綜述

來源：泰然健康網時間：2024年12月07日 15:31

結構化數據清洗技術綜述1. 北京交通大學計算機與信息技術學院, 北京 100044;
2. 清華大學計算機科學與技術系, 數據庫組, 北京 100084Survey of structured data cleaning methodsHAO Shuang1,2, LI Guoliang2, FENG Jianhua2, WANG Ning11. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China;
2. Database Group, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 摘要圖/表參考文獻相關文章Metrics 摘要數據清洗是對臟數據進行檢測和糾正的過程，是進行數據分析和管理的基礎。該文對經典和新興的數據清洗技術進行分類和總結，為進一步的研究工作提供方向。形式化定義了數據清洗問題，對數據缺失、數據冗余、數據沖突和數據錯誤這4種數據噪聲的檢測技術進行詳細闡述。按照數據清洗方式對數據噪聲的消除技術進行分類概述，包括基于完整性約束的數據清洗算法、基于規(guī)則的數據清洗算法、基于統(tǒng)計的數據清洗算法和人機結合的數據清洗算法。介紹了常用的測評數據集和噪聲注入工具，并對未來重點的研究方向進行了探討和展望。服務把本文推薦給朋友加入引用管理器 E-mail AlertRSS收藏文章作者相關文章郝爽李國良馮建華王寧關鍵詞：數據清洗, 數據噪聲, 噪聲檢測, 噪聲消除 Abstract：Data cleaning is the process of detecting and repairing dirty data which is often needed in data analysis and management. This paper classifies and summarizes the traditional and advanced data cleaning techniques and identifies potential directions for further work. This study first formally defines the cleaning problem for structured data and then describes error detection methods for missing data, redundant data, conflicting data and erroneous data. The data cleaning methods are then summarized based on their error elimination method, including constraint-based data cleaning, rule-based data cleaning, statistical data cleaning and human-in-the-loop data cleaning. Some important datasets and noise injection tools are introduced as well. Open research problems and future research directions are also discussed.Key words：data cleaning dirty data error detection error elimination收稿日期: 2018-07-31 出版日期: 2018-12-13基金資助:國家重點研發(fā)計劃項目（2018YFC0809800）；國家自然科學基金項目（61373024，61632016，61422205，61521002）通訊作者:李國良,副教授,E-mail:liguoliang@tsinghua.edu.cn E-mail: liguoliang@tsinghua.edu.cn引用本文: 郝爽, 李國良, 馮建華, 王寧. 結構化數據清洗技術綜述[J]. 清華大學學報（自然科學版）, 2018, 58(12): 1037-1050.
HAO Shuang, LI Guoliang, FENG Jianhua, WANG Ning. Survey of structured data cleaning methods. Journal of Tsinghua University(Science and Technology), 2018, 58(12): 1037-1050.鏈接本文: http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.22.053 或 http://jst.tsinghuajournals.com/CN/Y2018/V58/I12/1037

[1] SHALLCROSS S. 2 Reasons why your data is lying to you[R/OL]. (2016-09-15)[2018-06-25]. http://hollistibbetts.sys-con.com/node/1975126.
[2] BUCKLEY J. Causes of dirty data and how to combat them[R/OL]. (2018-07-13)[2018-06-25]. https://www.qubole.com/blog/causes-of-dirty-data-and-how-to-combat-them.
[3] GALHARDAS H, FLORESCU D, SHASHA D, et al. An extensible framework for data cleaning[C]//IEEE 16th International Conference on Data Engineering. San Diego, USA, 2000.
[4] REKATSINAS T, CHU X, ILYAS I F, et al. HoloClean:Holistic data repairs with probabilistic inference[J]. Proceedings of the VLDB Endowment, 2017, 10(11):1190-1201.
[5] MAYFIELD C, NEVILLE J, PRABHAKAR S. ERACER:A database approach for statistical inference and data cleaning[C]//2010 ACM SIGMOD International Conference on Management of Data. Indianapolis, USA, 2010:75-86.
[6] YAKOUT M, BERTI-éQUILLE L, ELMAGARMID A K. Don't be SCAREd:Use scalable automatic repairing with maximal likelihood and bounded changes[C]//2013 ACM SIGMOD International Conference on Management of Data. New York, USA, 2013:553-564.
[7] WANG J N, KRASKA T, FRANKLIN M J, et al. CrowdER:Crowdsourcing entity resolution[J]. Proceedings of the VLDB Endowment, 2012, 5(11):1483-1494.
[8] CHAI C, LI G L, LI J, et al. Cost-effective crowdsourced entity resolution:A partial-order approach[C]//2016 International Conference on Management of Data. San Francisco, USA, 2016:969-984.
[9] 郭志懋, 周傲英. 數據質量和數據清洗研究綜述[J]. 軟件學報, 2002, 13(11):2076-2082. GUO Z M, ZHOU A Y. Research on data quality and data cleaning:A survey[J]. Journal of Software, 2002, 13(11):2076-2082. (in Chinese)
[10] 楊輔祥, 劉云超, 段智華. 數據清理綜述[J]. 計算機應用研究, 2002, 19(3):3-5. YANG F X, LIU Y C, DUAN Z H. An overview of data cleaning[J]. Application Research of Computers, 2002, 19(3):3-5. (in Chinese)
[11] 王曰芬, 章成志, 張蓓蓓, 等. 數據清洗研究綜述[J]. 現代圖書情報技術, 2007, 2(12):50-56. WANG Y F, ZHANG C Z, ZHANG B B, et al. A survey of data cleaning[J]. New Technology of Library and Information Service, 2007, 2(12):50-56. (in Chinese)
[12] 趙一凡, 卞良, 叢昕. 數據清洗方法研究綜述[J]. 軟件導刊, 2017(12):222-224. ZHAO Y F, BIAN L, CONG X. Survey of data cleaning method in data preprocessing[J]. Software Guide, 2017(12):222-224. (in Chinese)
[13] ELMAGARMID A K, IPEIROTIS P G, VERYKIOS V S. Duplicate record detection:A survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1):1-16.
[14] CHU X, ILYAS I F, KRISHNAN S, et al. Data cleaning:Overview and emerging challenges[C]//2016 International Conference on Management of Data. San Francisco, USA, 2016:2201-2206.
[15] ILYAS I F, CHU X. Trends in cleaning relational data:Consistency and deduplication[J]. Foundations and Trends in Databases, 2015, 5(4):281-393.
[16] FAN W F, GEERTS F, JIA X B, et al. Conditional functional dependencies for capturing data inconsistencies[J]. ACM Transactions on Database Systems (TODS), 2008, 33(2):6.
[17] BOHANNON P, FAN W F, GEERTS F, et al. Conditional functional dependencies for data cleaning[C]//IEEE 23rd International Conference on Data Engineering. Istanbul, Turkey, 2007:746-755.
[18] ABITEBOUL S, HULL R, VIANU V. Foundations of databases:The logical level[M]. Boston, USA:Addison-Wesley Longman Publishing, 1995.
[19] KOUDAS N, SAHA A, SRIVASTAVA D, et al. Metric functional dependencies[C]//IEEE 25th International Conference on Data Engineering. Shanghai, China, 2009:1275-1278.
[20] SONG S X, CHEN L. Differential dependencies:Reasoning and discovery[J]. ACM Transactions on Database Systems (TODS), 2011, 36(3):16.
[21] WANG J N, TANG N. Towards dependable data repairing with fixing rules[C]//2014 ACM SIGMOD International Conference on Management of Data. Snowbird, USA, 2014:457-468.
[22] WANG J N, TANG N. Dependable data repairing with fixing rules[J]. Journal of Data and Information Quality (JDIQ), 2017, 8(3-4):16.
[23] FAN W F, LI J Z, MA S, et al. Towards certain fixes with editing rules and master data[J]. Proceedings of the VLDB Endowment, 2010, 3(1-2):173-184.
[24] FAN W F, LI J Z, MA S, et al. Towards certain fixes with editing rules and master data[J]. The VLDB Journal, 2012, 21(2):213-238.
[25] INTERLANDI M, TANG N. Proof positive and negative in data cleaning[C]//IEEE 31st International Conference on Data Engineering. Seoul, Republic of Korea, 2015:18-29.
[26] HAO S, TANG N, LI G L, et al. Cleaning relations using knowledge bases[C]//IEEE 33rd International Conference on Data Engineering. San Diego, USA, 2017:933-944.
[27] HAO S, TANG N, LI G L, et al. Distilling relations using knowledge bases[J]. The VLDB Journal, 2018, 27(4):497-519.
[28] HUA M, PEI J. DiMaC:A disguised missing data cleaning tool[C]//14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA, 2008:1077-1080.
[29] QAHTAN A A, ELMAGARMID A, CASTRO FERNANDEZ R, et al. FAHES:A robust disguised missing values detector[C]//24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. London, UK, 2018:2100-2109.
[30] K?PCKE H, RAHM E. Frameworks for entity matching:A comparison[J]. Data & Knowledge Engineering, 2010, 69(2):197-210.
[31] HE Q L, LI Z H, ZHANG X. Data deduplication techniques[C]//2010 IEEE International Conference on Future Information Technology and Management Engineering (FITME). Changzhou, China, 2010:430-433.
[32] BAXTER R, CHRISTEN P, CHURCHES T. A comparison of fast blocking methods for record linkage[C]//2003 KDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation. Washington, DC, USA, 2003, 3:25-27.
[33] TEJADA S, KNOBLOCK C A, MINTON S. Learning object identification rules for information integration[J]. Information Systems, 2001, 26(8):607-633.
[34] ELFEKY M G, VERYKIOS V S, ELMAGARMID A K. TAILOR:A record linkage toolbox[C]//18th International Conference on Data Engineering. San Jose, USA, 2002.
[35] TEJADA S, KNOBLOCK C A, MINTON S. Learning domain-independent string transformation weights for high accuracy object identification[C]//International Conference on Knowledge Discovery and Data Mining. Edmonton, Canada, 2002:350-359.
[36] CHRISTEN P. Febrl:A freely available record linkage system with a graphical user interface[C]//2nd Australasian Workshop on Health Data and Knowledge Management. Wollongong, Australia, 2008:17-25.
[37] MCCALLUM A, NIGAM K, UNGAR L H. Efficient clustering of high-dimensional data sets with application to reference matching[C]//6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, USA, 2000:169-178.
[38] BAYARDO R J, MA Y M, SRIKANT R. Scaling up all pairs similarity search[C]//16th International Conference on World Wide Web. Banff, Canada, 2007:131-140.
[39] CHAUDHURI S, GANTI V, KAUSHIK R. A primitive operator for similarity joins in data cleaning[C]//22nd International Conference on Data Engineering. Atlanta, USA, 2006.
[40] XIAO C, WANG W, LIN X M, et al. Efficient similarity joins for near-duplicate detection[J]. ACM Transactions on Database Systems (TODS), 2011, 36(3):15.
[41] ARASU A, Ré C, SUCIU D. Large-scale deduplication with constraints using Dedupalog[C]//2009 IEEE 25th International Conference on Data Engineering. Shanghai, China, 2009:952-963.
[42] HERNáNDEZ M A, STOLFO S J. The merge/purge problem for large databases[C]//1995 ACM SIGMOD International Conference on Management of Data. San Jose, USA, 1995:127-138.
[43] LEE M L, LING T W, LOW W L. IntelliClean:A knowledge-based intelligent data cleaner[C]//Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, USA, 2000:290-294.
[44] WANG J N, LI G L, YU J X, et al. Entity matching:How similar is similar[J]. Proceedings of the VLDB Endowment, 2011, 4(10):622-633.
[45] BILENKO M, MOONEY R J. Adaptive duplicate detection using learnable string similarity measures[C]//Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, DC, USA, 2003:39-48.
[46] PAPENBROCK T, EHRLICH J, MARTEN J, et al. Functional dependency discovery:An experimental evaluation of seven algorithms[J]. Proceedings of the VLDB Endowment, 2015, 8(10):1082-1093.
[47] LIU J X, LI J Y, LIU C F, et al. Discover dependencies from data:A review[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(2):251-264.
[48] HUHTALA Y, K?RKK?INEN J, PORKKA P, et al. TANE:An efficient algorithm for discovering functional and approximate dependencies[J]. The Computer Journal, 1999, 42(2):100-111.
[49] YAO H, HAMILTON H J. Mining functional dependencies from data[J]. Data Mining and Knowledge Discovery, 2008, 16(2):197-219.
[50] NOVELLI N, CICCHETTI R. FUN:An efficient algorithm for mining functional and embedded dependencies[C]//International Conference on Database Theory. London, UK, 2001:189-203.
[51] WYSS C, GIANNELLA C, ROBERTSON E. FastFDs:A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract[C]//International Conference on Data Warehousing and Knowledge Discovery. Munich, Germany, 2001:101-110.
[52] LOPES S, PETIT J M, LAKHAL L. Efficient discovery of functional dependencies and Armstrong relations[C]//International Conference on Extending Database Technology. Konstanz, Germany, 2000:350-364.
[53] FAN W F, GEERTS F, LI J Z, et al. Discovering conditional functional dependencies[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(5):683-698.
[54] DE MARCHI F, LOPES S, PETIT J M. Efficient algorithms for mining inclusion dependencies[C]//8th International Conference on Extending Database Technology. Prague, Czech Republic, 2002:464-476.
[55] BOHANNON P, FAN W F, FLASTER M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[C]//2005 ACM SIGMOD International Conference on Management of Data. Baltimore, USA, 2005:143-154.
[56] CHIANG F, MILLER R J. A unified model for data and constraint repair[C]//2011 IEEE 27th International Conference on Data Engineering. Hannover, Germany, 2011:446-457.
[57] CHU X, ILYAS I F, PAPOTTI P. Holistic data cleaning:Putting violations into context[C]//2013 IEEE 29th International Conference on Data Engineering (ICDE). Brisbane, Australia, 2013:458-469.
[58] KOLAHI S, LAKSHMANAN L V S. On approximating optimum repairs for functional dependency violations[C]//12th International Conference on Database Theory. St. Petersburg, Russia, 2009:53-62.
[59] HAO S, TANG N, LI G L, et al. A novel cost-based model for data repairing[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(4):727-742.
[60] ARENAS M, BERTOSSI L, CHOMICKI J. Consistent query answers in inconsistent databases[C]//18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Philadelphia, USA, 1999:68-79.
[61] BRAVO L, BERTOSSI L. Logic programs for consistently querying data integration systems[C]//18th International Joint Conference on Artificial Intelligence. Acapulco, Mexico, 2003:10-15.
[62] CALí A, LEMBO D, ROSATI R. On the decidability and complexity of query answering over inconsistent and incomplete databases[C]//22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. San Diego, USA, 2003:260-271.
[63] CALI A, LEMBO D, ROSATI R. Query rewriting and answering under constraints in data integration systems[C]//18th International Joint Conference on Artificial Intelligence. Acapulco, Mexico, 2003:16-21.
[64] CHOMICKI J, MARCINKOWSKI J. Minimal-change integrity maintenance using tuple deletions[J]. Information and Computation, 2005, 197(1-2):90-121.
[65] GERTZ M, LIPECK U W. A diagnostic approach for repairing constraint violations in databases[M]//JAJODIA S, LIST W, MCGREGOR G, et al, eds. Integrity and internal control in information systems. ⅡCIS 1997. Boston, USA:Springer, 1997, 1:89-111
[66] GRECO G, GRECO S, ZUMPANO E. A logical framework for querying and repairing inconsistent databases[J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(6):1389-1408.
[67] YAKOUT M, ELMAGARMID A K, NEVILLE J, et al. Guided data repair[J]. Proceedings of the VLDB Endowment, 2011, 4(5):279-289.
[68] CHU X, MORCOS J, ILYAS I F, et al. KATARA:A data cleaning system powered by knowledge bases and crowdsourcing[C]//2015 ACM SIGMOD International Conference on Management of Data. Melbourne, Australia, 2015:1247-1261.
[69] COHEN W, RAVIKUMAR P, FIENBERG S. A comparison of string metrics for matching names and records[C]//KDD Workshop on Data Cleaning and Object Consolidation. Washington, DC, USA, 2003:73-78.
[70] JIANG Y, LI G L, FENG J H, et al. String similarity joins:An experimental evaluation[J]. Proceedings of the VLDB Endowment, 2014, 7(8):625-636.
[71] EBAID A, ELMAGARMID A, ILYAS I F, et al. NADEEF:A generalized data cleaning system[J]. Proceedings of the VLDB Endowment, 2013, 6(12):1218-1221.
[72] CONG G, FAN W F, GEERTS F, et al. Improving data quality:Consistency and accuracy[C]//33rd International Conference on Very Large Data Bases. Vienna, Austria, 2007:315-326.
[73] GEERTS F, MECCA G, PAPOTTI P, et al. The LLUNATIC data-cleaning framework[J]. Proceedings of the VLDB Endowment, 2013, 6(9):625-636.
[74] BESKALES G, ILYAS I F, GOLAB L. Sampling the repairs of functional dependency violations under hard constraints[J]. Proceedings of the VLDB Endowment, 2010, 3(1-2):197-207.
[75] KHAYYAT Z, ILYAS I F, JINDAL A, et al. BIGDANSING:A system for big data cleansing[C]//2015 ACM SIGMOD International Conference on Management of Data. Melbourne, Australia, 2015:1215-1230.
[76] BESKALES G, ILYAS I F, GOLAB L, et al. On the relative trust between inconsistent data and inaccurate constraints[C]//2013 IEEE 29th International Conference on Data Engineering. Brisbane, Australia, 2013:541-552.
[77] VOLKOVS M, CHIANG F, SZLICHTA J, et al. Continuous data cleaning[C]//2014 IEEE 30th International Conference on Data Engineering. Chicago, USA, 2014:244-255.
[78] RISSANEN J. Modeling by shortest data description[J]. Automatica, 1978, 14(5):465-471.
[79] COVER T M, THOMAS J A. Elements of information theory[M]. New York, USA:John Wiley & Sons, 1991.
[80] HE J, VELTRI E, SANTORO D, et al. Interactive and deterministic data cleaning[C]//2016 International Conference on Management of Data. San Francisco, USA, 2016:893-907.
[81] AROCENA P C, GLAVIC B, MECCA G, et al. Messing up with BART:Error generation for evaluating data-cleaning algorithms[J]. Proceedings of the VLDB Endowment, 2015, 9(2):36-47.
[82] ANANTHAKRISHNA R, CHAUDHURI S, GANTI V. Eliminating fuzzy duplicates in data warehouses[C]//28th International Conference on Very Large Databases. Hong Kong, China, 2002:586-597.
[83] WANG J N, KRISHNAN S, FRANKLIN M J, et al. A sample-and-clean framework for fast and accurate query processing on dirty data[C]//2014 ACM SIGMOD International Conference on Management of Data. Snowbird, USA, 2014:469-480.
[84] KRISHNAN S, WANG J N, FRANKLIN M J, et al. SampleClean:Fast and reliable analytics on dirty data[C/OL]//Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2015:59-75[2018-06-19]. https://www.ocf.berkeley.edu/~sanjayk/old/pubs/sc.pdf.
[85] LI G L, WANG J N, ZHENG Y D, et al. Crowdsourced data management:A survey[C]//2017 IEEE 33rd International Conference on Data Engineering. San Diego, USA, 2017:39-40.
[86] PARK H, WIDOM J. CrowdFill:Collecting structured data from the crowd[C]//2014 ACM SIGMOD International Conference on Management of Data. Snowbird, USA, 2014:577-588.
[87] WHANG S E, LOFGREN P, GARCIA-MOLINA H. Question selection for crowd entity resolution[J]. Proceedings of the VLDB Endowment, 2013, 6(6):349-360.
[88] TONG Y X, CAO C C, ZHANG C J, et al. CrowdCleaner:Data cleaning for multi-version data on the web via crowdsourcing[C]//2014 IEEE 30th International Conference on Data Engineering. Chicago, USA, 2014:1182-1185.
[89] MICHALSKI R S. A theory and methodology of inductive learning[M]//MICHALSKI R S, CARBONELL J G, MITCHELL T M. Machine learning. Berlin, Germany:Springer, 1983:83-134.