Decorating data
By Chris Gemignani
November 29, 2008
Find more about:
data
analytics
excel
An early Christmas present has arrived from the DabbleDB team for the 100 million or so of us that have to work with data on a day to day basis.
They’ve created a do-what-I-mean web tool that lets you show how you want data to be restructured and bang! it’s done. Check out the video.
It’s a great idea and a elegant, easy to use interface. There are so many directions I’d love to see them take this tool.
Cleanupdata is a great name, but they’re really giving you better ways to restructure data. This tool won’t help you find and fix errors and anomalies in data. At least not yet.
I also hope they extend cleanupdata to let people automate these data restructuring operations. If only you could apply a cleanup created in cleanupdata.com to 1,000 Excel spreadsheets or to a database table.
If you like this, it’s worth checking out DabbleDB. They have rethought the database with a database/spreadsheet/web forms/visualizer platypus of a tool. It lets your data be pliable in ways that databases don’t allow, while retaining structure that spreadsheets don’t recognize.
Added: Avi Bryant, one of the authors of the cleanupdata.com service notes that the example in the screencast is motivated by this post on cleaning data in Excel. Compare and contrast. I know most people would prefer to avoid ="("&MID(H2,1,3)&") "&MID(H2,4,3)&"-"&MID(H2,7,4) in order to format a phone number.
Analytics Roundup: Open knowledge resources
By Zach Gemignani
July 6, 2007
Find more about:
data
- Comprehensive Knowledge Archive Network
- CKAN is a registry of open knowledge packages and projects... the place to search for open knowledge resources as well as register your own—be that a set of Shakespeare's works, a global population density database, the voting records of MPs, and more.
ANDs, ORs, and IFs: Comparing big lists in Excel
By Chris Gemignani
May 1, 2007
Find more about:
data
excel
One problem we face when manipulating large amounts of data in Excel is checking to see if two lists of the same length contain the same items. For instance, we might be given a list of products that a company has for sale this month, running to thousands of items, then the following month, we get another list of products for sale and we need to see if there has been any change between those two lists. This isn’t too hard to deal with when you only have a hundred or so items, but it gets a little thorny when your list runs to tens of thousands.
What we do is line the two lists up, side by side, in sorted order.

Use the simple “A1=B1” formula to compare pairs of items in the lists.

If the pairs are the same, this will be true, otherwise they’ll be false.

Copy this formula down for all your rows. Then use the AND function and give it the entire range of comparison formulas.

This will only be true if every single one of the values in your list are exactly matches. If even one comparison is false, this big AND statement will evaluate to false.
This is a quick and dirty approach. For tougher problems, we use a slightly more complicated formula in the comparison where we evaluate it to 1 if the value is true, 0 if the value is false. This gives us more flexibility to combine comparisons, but that’s a topic for another post.
19 comments | Show all comments only the last 5 are shown
Tre said:
Are you serious? This is an extremely rudimentary tutorial for this blog.
EK said:
I find these quick snapshot overviews very useful. After 20 years in analytics I am never too old to learn or be reminded of tips and techniques
derek said:
I appreciated it, although I prefer to use an array function to do the same thing. jeez, I didn't even know you could do that with Excel Boolean functions, I thought they only took parameters separated by commas, like CONCATENATE.
If only CONCATENATE did the same thing, it would actually have a use. Seriously, does anyone know what you can do with CONCATENATE(a,b,c...) that you can't do with a&b&c...?
(the first version of thos comment died without error message and was lost, so that I thought it had just gone somewhere for moderation; this may just be a problem at my end, or it may have been my failure to type "juice")
derek said:
I appreciated it, although I prefer to use an array function to do the same thing. jeez, I didn't even know you could do that with Excel Boolean functions, I thought they only took parameters separated by commas, like CONCATENATE.
If only CONCATENATE did the same thing, it would actually have a use. Seriously, does anyone know what you can do with CONCATENATE(a,b,c...) that you can't do with a&b&c...?
(earlier versions of this comment died without error message and were lost, even though I typed "juice"; it may just be a problem at my end)
derek said:
Wonderful. The first attempt disappears permanently, even though I waited for it, so the second attempt only temporarily fails to appear, so that I double-post with the third attempt :-(
James B. said:
Great new design. I love it.
I noticed that the links to state data are not working on this page:
http://juiceanalytics.com/writing/2006/03/census-data-in-google-earth/
Specifically, I'm looking for the Pennsylvania data.
Adam said:
I often use a similar technique to make sure that all the items in one list are in both. The problem is that if I just do =a2=b2, if I insert cells to add items from one list to the other, Excel automatically adjusts the references in the boolean cells and they get all screwed up. The solution I've found is to use "OFFSET" -- even if you insert cells, the references stay static, so the boolean variable is looking in the right place.
Scott H. said:
I have this problem all the time as well. Any suggestions on how to line up the rows so that they then match? I usually have to do this by hand .. inserting cells and moving the text in order to manually line the rows up.
Jason said:
Oh man, I'm the newest employee and my cube was recently moved across from the GUI jock (frequent audible sighs, grumbles and guffaws, smokers cough/cackle). And I've been learning how to run various sales reports from various databases on various days to send to various people.
At first I was happy to be learning something new, and I was feeling more like a productive member of the team, so thanks for clueing me in/bursting my bubble. But I've just finished my first year of my MBA, so there may be hope.
David said:
Juice- yeah, I've been doing this for years, but it is still news to some folks. There are some neat tricks with ones and zeroes, and even pivot tables to find missing items.
Derek- CONCATENATE() and the ampersand have very similar functions... the only benefit that I have seen to the CONCATENATE command is that it makes it easier to just click away at cells, as opposed to interrupting with an ampersand.
Jason- It's all over for you, man.
(just kidding)
David G. said:
Great tip, but it's by no means fool-proof. If just one pair doesn't match it could cause the rest of your list to come up as false - which makes it difficult to figure out which items are new.
Example:
1001 1001
1002 1002
1003 1004
1004 1005
... ...
1999 2000
2000 3000
Your entire list will be off, but you won't be able to provide a listing of what has changed - which is what I would expect the point of this exercise to be.
Wouldn't it be better to insert a column and use a combination of INDEX and MATCH to give a true comparison list that you could sort on?
ponyfizz said:
If it's just numbers, I would subtract one from the other and sort by the formula column.
All of the non-matching numbers would be grouped (at the top or bottom of the list). Simple
Rolen said:
I usually use the function, countif to handle this..
Countif(B:B, C3)
Chris Gemignani said:
3. Sorry to hear about the commenting woes. We'll fix comments in the future to appear faster. If only CONCATENATE worked on ranges!
Robbin Steif said:
To the person who wrote, this is too easy a post for Juice: you probably have a strong command of the topic, but not everyone does. And like someone else wrote: it's nice to be reminded of tricks again.
To the person who complained that then you have an error somewhere and you can find it: all you need to do is sort all the rows by the column with the proof in it, so that you can see where the problem is. You can sort Ascending, since F comes before T (how's that for easy advice?)
And finally, you can use a similar trick to dedupe email addresses, or other lists where the info will be identical. I am sure you Juicers are all over this one. But since the data are on two lines, you have to do it like this =a1=b1 And copy all the way down. Then you copy the results and repaste them right where they were, but PASTE SPECIAL - Values. Then you sort on the IF column
mike harding said:
well in the world of statistical genetics, excel doesn't have enough room in one tab. it would be useful to know how to do this in R or SAS.
AW said:
Mike,
In SAS, you'd want to use a sort-sort-merge or some inner joins.
R is on my schedule to learn in the next 2 months; so I got nothing for you there...
Brenda said:
Looks like it works for text string comparison as well!
Will said:
I think the use of MATCH is a much better option. Using the data as you have displayed it, in cell D3, you would have =IF(ISNA(VLOOKUP(C4,B$3:B$xxxxx,2,FALSE)),1,0), and then in cell D4 on downwards you would have =IF(ISNA(VLOOKUP(C4,B$3:B$xxxxx,2,FALSE)),D3+1,D3) where xxxxxx is the final row in your set. In cell D2, you have =MAX(D3:Dxxxxx). This will then tell you how many new datapoints you have.
Add a comment
Analytics Roundup: Late April edition
By Chris Gemignani
April 27, 2007
Find more about:
census
data
juicesite
population
powerpoint
presentation
presentations
pubapp
statistics
tools
- Population Estimates Data Sets
- US census data
- Are you generic? / Wilson Miner Live
- Wilson Miners post re: Django's generic views
Analytics Roundup
By Chris Gemignani
January 25, 2007
Find more about:
data
ibm
visualization
- Many Eyes: IBM's collaborative visualization
- Fernando Vargas' vision comes to light.






7 comments | Show all comments only the last 5 are shown
Avi Bryant said:
Yes, allowing this to be automated is definitely in our future plans. We just wanted to get something out quickly for people to start playing with so we could see how well it worked in the wild first.
Minor nitpick: I'm not the sole author of this service, everyone on the Dabble DB team worked on it to some degree.
Tom said:
Looks great, but I'm not sure about the prospect of sharing all of my data. Both from the perspective of sharing it with Dabble as a data processor, but also the need to transmit it over the internet.. less good!
Bjoern said:
I agree with Tom. I'd love this to run locally, but transmitting the data to dabble db and then have it mailed back to me. I don't know about US regulations but in Europe I would get seriously smacked on the head for that, especially for data like social security numbers, names and contact data that was shown in the example video.
Chris Gemignani said:
Thanks for the clarification, Avi. I've updated the text. Congrats to the DabbleDB team. We love it.
Michel Guillet said:
Hi Chris,
I've been using birst, a comparable product, with some success.
Mike Chelen said:
Is the free version always available or only for a 30 day trial? Also, it would be nice if more that 15000 rows were supported, overall looks like a pretty cool site and services :)
Ted said:
Sorry, I much prefer a spreadsheet and formulas. They give you complete control (the code in this product is going to make assumptions that are wrong at times), they are very repeatable (i.e. if I do the same thing every day, I shouldn't need to repeat my actions in a browser and run batch jobs), and you can look back at a later point in time and see what you (or soemone else) did very easily. Maybe MID is complicated, but I don't think so.
said:
Add a comment