| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Stop wasting time looking for files and revisions. Connect your Gmail, DriveDropbox, and Slack accounts and in less than 2 minutes, Dokkio will automatically organize all your file attachments. Learn more and claim your free account.

View
 

DataPatterns

Page history last edited by PBworks 14 years, 3 months ago

This project will be driven by analysis and understanding of commonly encountered data patterns. Algorithmic optimizations will be considered in light of the data patterns elicited from real-world datasets.

 

Data Patterns

  • Same name
  • Similar name
  • Same name, same address
  • Similar name, same address
  • Same name, similar address
  • Similar name, similar address

 

Company Data Patterns

  • Vertical Specific Tokens

 

Example Data Set:

Duke University School of Medicine

Stanford University School of Medicine

University of Kansas Medical Center

Vanderbilt University Medical Center

Harvard Medical School

Indiana University School of Medicine

Boston University School of Medicine

Creighton University School of Medicine

Johns Hopkins Medicine

Yale University School of Medicine

 

  • Semantically Intermixed Data

 

 

Example Data Set:

Citi® Platinum Select® Card

Citi® Dividend Platinum Select® Card

Citi® Simplicity Card

Citi PremierPass Card

Citi® Diamond Preferred® Rewards Card

Citi® Simplicity Rewards Card

Citi® Upromise® Card

Citi® Bronze® AAdvantage® MasterCard®

Citi® Professional(SM) Card

Citi® Driver's Edge® Platinum Select® Card

Citi® Platinum Select® AAdvantage® World MasterCard®

 

  • Semantically Overloaded Column

 

Example Data Set:

George David, MD

Paul Thos David, MD

Richard Danl David, MD

Victor Alexander David, MD

VA Castle OPC

Sierra Foothills Outpatient Clinic

Capitola Clinic

Chico Outpatient Clinic

Eureka Veterans Clinic

Fairfield Outpatient Clinic

Stockton Clinic

Los Angeles Ambulatory Care Center

West Los Angeles Ambulatory Care Center

Martinez Outpatient Clinic

Modesto Clinic

Anna Maria Camaya David, MD

Oakland Mental Health Clinic

Oakland Outpatient Clinic__ name freq cum.freq rank__

Redding Outpatient Clinic

McClellan Dental Clinic

McClellan Outpatient Clinic

Sacramento Mental Health Clinic

Mission Valley

VA 13th & Mission Outpatient Clinic

San Jose Clinic

Frederick Chas, MD

Santa Barbara Community Based Outpatient Clinic

Santa Rosa Clinic

Monterey Clinic

Sepulveda Ambulatory Care Center

Sonora Clinichttp://www.census.gov/genealogy/names/names_files.html

VA South Valley OPC

VA Ukiah Community Based Outpatient Clinic

Mare Island Outpatient Clinic

George David, MD

 

Person Data Patterns

  • Last Name Distinctiveness

 

 

Example Data Set:

Source US Census

  ---------------------------------------
  Variables in Names Files:
  name
  freq = Frequency in percent  
  cum.freq = Cumulative Frequency in percent
  rank
  ---------------------------------------
  First ten entries in dist.all.last
  ---------------------------------------
  name           freq   cum.freq  rank
  SMITH          1.006  1.006      1
  JOHNSON        0.810  1.816      2
  WILLIAMS       0.699  2.515      3
  JONES          0.621  3.136      4
  BROWN          0.621  3.757      5
  DAVIS          0.480  4.237      6
  MILLER         0.424  4.660      7
  WILSON         0.339  5.000      8
  MOORE          0.312  5.312      9
  TAYLOR         0.311  5.623     10
  ---------------------------------------
  First ten entries in dist.female.first
  ---------------------------------------
  name           freq   cum.freq  rank
  MARY           2.629  2.629      1
  PATRICIA       1.073  3.702      2
  LINDA          1.035  4.736      3
  BARBARA        0.980  5.716      4
  ELIZABETH      0.937  6.653      5
  JENNIFER       0.932  7.586      6
  MARIA          0.828  8.414      7
  SUSAN          0.794  9.209      8
  MARGARET       0.768  9.976      9
  DOROTHY        0.727 10.703     10
  ---------------------------------------
  First ten entries in dist.male.first
  ---------------------------------------
  name           freq   cum.freq  rank
  JAMES          3.318  3.318      1
  JOHN           3.271  6.589      2
  ROBERT         3.143  9.732      3
  MICHAEL        2.629 12.361      4
  WILLIAM        2.451 14.812      5
  DAVID          2.363 17.176      6
  RICHARD        1.703 18.878      7
  CHARLES        1.523 20.401      8
  JOSEPH         1.404 21.805      9
  THOMAS         1.380 23.185     10

 

 

Reference Source Data Patterns

  • Multiple Global Duns For A Corp

 

Site Data Patterns

  • Shopping Mall

Barnes & Noble, Sixty 31st Avenue #BN, San Mateo, CA 94403

Bishop's Hallmark, Sixty 31st Avenue #262, San Mateo, CA 94403

Brookstone, Sixty 31st Avenue #320, San Mateo, CA 94403

Sharper Image, The, Sixty 31st Avenue #160, San Mateo, CA 94403

Things Remembered, Sixty 31st Avenue #241, San Mateo, CA 94403

Baby Gap, Sixty 31st Avenue #156, San Mateo, CA 94403

Build-A-Bear Workshop, Sixty 31st Avenue #328, San Mateo, CA 94403

Children's Place, The, Sixty 31st Avenue #394, San Mateo, CA 94403

Gap Kids, The, Sixty 31st Avenue #340, San Mateo, CA 94403

Gymboree, The, Sixty 31st Avenue #324, San Mateo, CA 94403

  • Multi Address Facility

Kaiser Permanente Medical Center, 1150 Veterans Blvd, Redwood City, CA 94063

Kaiser Permanente Medical Center, 1154 Veterans Blvd, Redwood City, CA 94063

  • Different City Same Zip

Administaff, 19001 Crescent Springs Drive, Kingwood, Texas 77339

Administaff, 19001 Crescent Springs Drive, Humble, Texas 77339

Redwood City, CA 94062

Emerald Hills, CA 94062

Woodside, CA 94062

 

DataSets

Comments (0)

You don't have permission to comment on this page.