Decorating data

An early Christmas present has arrived from the DabbleDB team for the 100 million or so of us that have to work with data on a day to day basis.

They’ve created a do-what-I-mean web tool that lets you show how you want data to be restructured and bang! it’s done. Check out the video.

Cleanup data in action

It’s a great idea and a elegant, easy to use interface. There are so many directions I’d love to see them take this tool.

Cleanupdata is a great name, but they’re really giving you better ways to restructure data. This tool won’t help you find and fix errors and anomalies in data. At least not yet.

I also hope they extend cleanupdata to let people automate these data restructuring operations. If only you could apply a cleanup created in cleanupdata.com to 1,000 Excel spreadsheets or to a database table.

If you like this, it’s worth checking out DabbleDB. They have rethought the database with a database/spreadsheet/web forms/visualizer platypus of a tool. It lets your data be pliable in ways that databases don’t allow, while retaining structure that spreadsheets don’t recognize.

Added: Avi Bryant, one of the authors of the cleanupdata.com service notes that the example in the screencast is motivated by this post on cleaning data in Excel. Compare and contrast. I know most people would prefer to avoid ="("&MID(H2,1,3)&") "&MID(H2,4,3)&"-"&MID(H2,7,4) in order to format a phone number.

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License. All source code is released under a BSD License unless otherwise specified.

7 comments | Show all comments only the last 5 are shown


November 30, 2008
Bjoern said:

I agree with Tom. I'd love this to run locally, but transmitting the data to dabble db and then have it mailed back to me. I don't know about US regulations but in Europe I would get seriously smacked on the head for that, especially for data like social security numbers, names and contact data that was shown in the example video.


November 30, 2008
Chris Gemignani said:

Thanks for the clarification, Avi. I've updated the text. Congrats to the DabbleDB team. We love it.


December 4, 2008
Michel Guillet said:

Hi Chris,

I've been using birst, a comparable product, with some success.


December 9, 2008
Mike Chelen said:

Is the free version always available or only for a 30 day trial? Also, it would be nice if more that 15000 rows were supported, overall looks like a pretty cool site and services :)


December 10, 2008
Ted said:

Sorry, I much prefer a spreadsheet and formulas. They give you complete control (the code in this product is going to make assumptions that are wrong at times), they are very repeatable (i.e. if I do the same thing every day, I shouldn't need to repeat my actions in a browser and run batch jobs), and you can look back at a later point in time and see what you (or soemone else) did very easily. Maybe MID is complicated, but I don't think so.

Your name

Email (optional, will not be shared)

Type the word "juice" (required to confuse the spammers)

Your comment


Add a comment





Analytics Roundup: Open knowledge resources

Comprehensive Knowledge Archive Network
CKAN is a registry of open knowledge packages and projects... the place to search for open knowledge resources as well as register your own—be that a set of Shakespeare's works, a global population density database, the voting records of MPs, and more.

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License. All source code is released under a BSD License unless otherwise specified.

0 comments | Add a comment

Your name

Email (optional, will not be shared)

Type the word "juice" (required to confuse the spammers)

Your comment






ANDs, ORs, and IFs: Comparing big lists in Excel

One problem we face when manipulating large amounts of data in Excel is checking to see if two lists of the same length contain the same items. For instance, we might be given a list of products that a company has for sale this month, running to thousands of items, then the following month, we get another list of products for sale and we need to see if there has been any change between those two lists. This isn’t too hard to deal with when you only have a hundred or so items, but it gets a little thorny when your list runs to tens of thousands.

What we do is line the two lists up, side by side, in sorted order.

Two lists

Use the simple “A1=B1” formula to compare pairs of items in the lists.

Formula for comparing elements in the lists

If the pairs are the same, this will be true, otherwise they’ll be false.

Results of comparing elements in the lists

Copy this formula down for all your rows. Then use the AND function and give it the entire range of comparison formulas.

Checking comparisons

This will only be true if every single one of the values in your list are exactly matches. If even one comparison is false, this big AND statement will evaluate to false.

This is a quick and dirty approach. For tougher problems, we use a slightly more complicated formula in the comparison where we evaluate it to 1 if the value is true, 0 if the value is false. This gives us more flexibility to combine comparisons, but that’s a topic for another post.

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License. All source code is released under a BSD License unless otherwise specified.

19 comments | Show all comments only the last 5 are shown


May 5, 2007
Robbin Steif said:

To the person who wrote, this is too easy a post for Juice: you probably have a strong command of the topic, but not everyone does. And like someone else wrote: it's nice to be reminded of tricks again.

To the person who complained that then you have an error somewhere and you can find it: all you need to do is sort all the rows by the column with the proof in it, so that you can see where the problem is. You can sort Ascending, since F comes before T (how's that for easy advice?)

And finally, you can use a similar trick to dedupe email addresses, or other lists where the info will be identical. I am sure you Juicers are all over this one. But since the data are on two lines, you have to do it like this =a1=b1 And copy all the way down. Then you copy the results and repaste them right where they were, but PASTE SPECIAL - Values. Then you sort on the IF column


May 7, 2007
mike harding said:

well in the world of statistical genetics, excel doesn't have enough room in one tab. it would be useful to know how to do this in R or SAS.


May 9, 2007
AW said:

Mike,
In SAS, you'd want to use a sort-sort-merge or some inner joins.
R is on my schedule to learn in the next 2 months; so I got nothing for you there...


June 25, 2007
Brenda said:

Looks like it works for text string comparison as well!


October 11, 2007
Will said:

I think the use of MATCH is a much better option. Using the data as you have displayed it, in cell D3, you would have =IF(ISNA(VLOOKUP(C4,B$3:B$xxxxx,2,FALSE)),1,0), and then in cell D4 on downwards you would have =IF(ISNA(VLOOKUP(C4,B$3:B$xxxxx,2,FALSE)),D3+1,D3) where xxxxxx is the final row in your set. In cell D2, you have =MAX(D3:Dxxxxx). This will then tell you how many new datapoints you have.

Your name

Email (optional, will not be shared)

Type the word "juice" (required to confuse the spammers)

Your comment


Add a comment





Analytics Roundup: Late April edition

Population Estimates Data Sets
US census data

Are you generic? / Wilson Miner Live
Wilson Miners post re: Django's generic views

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License. All source code is released under a BSD License unless otherwise specified.

0 comments | Add a comment

Your name

Email (optional, will not be shared)

Type the word "juice" (required to confuse the spammers)

Your comment






Analytics Roundup

Many Eyes: IBM's collaborative visualization
Fernando Vargas' vision comes to light.

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License. All source code is released under a BSD License unless otherwise specified.

0 comments | Add a comment

Your name

Email (optional, will not be shared)

Type the word "juice" (required to confuse the spammers)

Your comment