ࡱ> 9;8 @bjbjT~T~ .*66@EEEEEYYYY$}Y2(  WYYYYYY"NYEYEEnlll"EEWlWllV@9]HY"_ C0kx" L""ELlYYl" : Session 2: Technical Hurdles, Research Solutions Journalists on the panel will identify specific technical problems in dealing with government records at federal, state, local, and tribal levels. Comments/Talking points from David Donald I sometimes liken my data work to cooking. In the kitchen, the cook spends much time in preparation, technique, and methods. It can take a lot of time. And the fun part sitting down to eat with friends and family can go by so quickly. In preparing data analysis for my work at the Center for Public Integrity, a lot of time goes into preparation, technique, and methods. Sometimes the fun part the analysis goes by quickly. What can really throw off the cook, however, is the technical problem of bad or inadequate food. As an analyst, its the technical hurdles presented by bad, incomplete, and inadequate data. Much of what I work with is contained in a government database. The database remains a fundamental level of government information in the information age. When government records are not stored in a database but are kept on paper or an electronic version of paper the PDF format, for instance I often have those turned into databases. I mostly, then, work in columns and rows, variables and cases, fields and records, whatever you want to label the fundamental data matrix. Here are some of technical hurdles from accessing data in usable columns and rows: The electronic format used to defeat electronic release of records. The PDF format is too often used by government officials, especially state and local, as an electronic release of records. They will jump through hoops to turn something as simple as an Excel table into a PDF. The PDF is not a data format. While we often can pull the data out of a PDF, its more successful in some instances rather than others. Missing metadata. This can mean a data dictionary is incomplete (if present), code sheets are not listed, import code is too platform specific. Lets put the data out there but keep them guessing. Platform assumption. Government officials try to be helpful by anticipating the platform that the end user will use to analyze the data. They actually make it more difficult for those users who use other platforms. In investigative reporting, were taught to assume nothing. Otherwise, the agency favors some customers over others. File corruption. Government officials point the user to the data online only to find that the data have become corrupted and dont import. A backup isnt provided (assuming the backup isnt prohibitively large) and the government agency refuses to fix the corrupted file. Government agency as retailer. I dont mean just that agencies charge retail prices for the data. Thats not so much a technical problem as a freedom of information problem. What Im talking about is treating the user as an end consumer, someone who needs to look up one record to solve a simple problem. Hence, too much government data hides behind look-up forms. Instead of someone who is buying one tomato for tonights salad, I need to buy bunches of tomatoes to find out whats going on in the market. I can distribute the individual tomatoes myself. In effect, Im the retailer, not the end customer. Those working with government data should be thought of as retailers, not the final consumer. That makes the agency a wholesaler. Unstructured data. A federal form doesnt require information to be entered as columns and rows. We get unstructured text. Even though the data are in the forms, extracting the data in a regular pattern is difficult, if not nearly impossible. While many solutions exist, Im sure, here is my government data dream. All data releasable under FOIA would be provided in a wholesale manner as Machine readable (likely a text file) With complete metadata Maintained with service in mind. Data.gov shows promise (and its potential cut in funds disturbing). Advances in text mining are encouraging. The final problem is one that may be hard to solve with increasing privacy concerns. What makes government data technically difficult to work with across agencies and federal, state and local levels is the inability to link entities, the people, organizations and other groups in the databases. Yes, Social Security numbers need to be protected. Releasing dates of birth is only a partial, if controversial solution. I have heard some advocate a non-purposed federal, state or local identification number. It connects to nothing but to connect people across data. Some have suggested unique IDs that link to nothing but the database reference. Others have suggested a metadata, such as semantic Web RDF / XML tags (see  HYPERLINK "http://rdfabout.com/intro/" http://rdfabout.com/intro/). Ill leave it there by just saying its part of my dream of serving up government data that would satisfy my appetite. ________________________________ David Donald The Center for Public Integrity Managing Editor Data 910 17th Street NW, 7th Floor Washington, DC 20006 Office: (202) 481-1247 Mobile: (703) 622-7174  HYPERLINK "http://www.publicintegrity.org/" www.publicintegrity.org 02 ! A L R \ b c h }  A B T e 047Q|-.678HWʾʲʚʲʲpppph~h~CJOJQJaJh~CJOJQJaJh~hu7CJOJQJaJhx.CJOJQJaJh &CJOJQJaJhCJOJQJaJhTCJOJQJaJhu7CJOJQJaJhu7hu7CJOJQJaJhu7CJOJQJaJhu7hu7CJOJQJaJ,12 c B 78LM]^AB56gdgdx.gd & & Fgd &gd~ & Fgd~gdu7W'KLM\_y\]^_pr>CLu@AB456'>jiͦٵًshCJOJQJaJhMCJOJQJaJhZZCJOJQJaJhx.hx.CJOJQJaJh &h &CJOJQJaJhx.CJOJQJaJhTCJOJQJaJh~CJOJQJaJhCJOJQJaJh &CJOJQJaJh &h~CJOJQJaJ-6'(%'()*KLZ?@gdu7gd~gd^gdZZ & FgdZZ^gdEfgs$'()*K%&ظ蠔ygQIEIhTjhTU+hx.hu7CJOJPJQJaJmHnHu#hx.hu7OJPJQJmHnHuh~CJOJQJaJh &hCJOJQJaJhCJOJQJaJhZZCJOJQJaJhTCJOJQJaJ h.auh0JCJOJQJaJhhCJOJQJaJjhCJOJQJUaJhCJOJQJaJhMCJOJQJaJ&=>?@ͿhWhx.hu7PJmHnHu2jhT0JCJOJPJQJUaJmHnHu/hx.hu70JCJOJPJQJaJmHnHu,1h/ =!"#$% j 666666666vvvvvvvvv666666>6666666666666666666666666666666666666666666666666hH6666666666666666666666666666666666666666666666666666666666666666662 0@P`p2( 0@P`p 0@P`p 0@P`p 0@P`p 0@P`p 0@P`p8XV~ OJPJQJ_HmH nH sH tH @`@ u7NormalCJ_HaJmH sH tH DA D Default Paragraph FontRiR 0 Table Normal4 l4a (k ( 0No List 6U`6 u70 Hyperlink >*B*ph@@@ ~ List Paragraph ^m$PK![Content_Types].xmlj0Eжr(΢Iw},-j4 wP-t#bΙ{UTU^hd}㨫)*1P' ^W0)T9<l#$yi};~@(Hu* Dנz/0ǰ $ X3aZ,D0j~3߶b~i>3\`?/[G\!-Rk.sԻ..a濭?PK!֧6 _rels/.relsj0 }Q%v/C/}(h"O = C?hv=Ʌ%[xp{۵_Pѣ<1H0ORBdJE4b$q_6LR7`0̞O,En7Lib/SeеPK!kytheme/theme/themeManager.xml M @}w7c(EbˮCAǠҟ7՛K Y, e.|,H,lxɴIsQ}#Ր ֵ+!,^$j=GW)E+& 8PK!Ptheme/theme/theme1.xmlYOo6w toc'vuر-MniP@I}úama[إ4:lЯGRX^6؊>$ !)O^rC$y@/yH*񄴽)޵߻UDb`}"qۋJחX^)I`nEp)liV[]1M<OP6r=zgbIguSebORD۫qu gZo~ٺlAplxpT0+[}`jzAV2Fi@qv֬5\|ʜ̭NleXdsjcs7f W+Ն7`g ȘJj|h(KD- dXiJ؇(x$( :;˹! I_TS 1?E??ZBΪmU/?~xY'y5g&΋/ɋ>GMGeD3Vq%'#q$8K)fw9:ĵ x}rxwr:\TZaG*y8IjbRc|XŻǿI u3KGnD1NIBs RuK>V.EL+M2#'fi ~V vl{u8zH *:(W☕ ~JTe\O*tHGHY}KNP*ݾ˦TѼ9/#A7qZ$*c?qUnwN%Oi4 =3ڗP 1Pm \\9Mؓ2aD];Yt\[x]}Wr|]g- eW )6-rCSj id DЇAΜIqbJ#x꺃 6k#ASh&ʌt(Q%p%m&]caSl=X\P1Mh9MVdDAaVB[݈fJíP|8 քAV^f Hn- "d>znNJ ة>b&2vKyϼD:,AGm\nziÙ.uχYC6OMf3or$5NHT[XF64T,ќM0E)`#5XY`פ;%1U٥m;R>QD DcpU'&LE/pm%]8firS4d 7y\`JnίI R3U~7+׸#m qBiDi*L69mY&iHE=(K&N!V.KeLDĕ{D vEꦚdeNƟe(MN9ߜR6&3(a/DUz<{ˊYȳV)9Z[4^n5!J?Q3eBoCM m<.vpIYfZY_p[=al-Y}Nc͙ŋ4vfavl'SA8|*u{-ߟ0%M07%<ҍPK! ѐ'theme/theme/_rels/themeManager.xml.relsM 0wooӺ&݈Э5 6?$Q ,.aic21h:qm@RN;d`o7gK(M&$R(.1r'JЊT8V"AȻHu}|$b{P8g/]QAsم(#L[PK-![Content_Types].xmlPK-!֧6 +_rels/.relsPK-!kytheme/theme/themeManager.xmlPK-!Ptheme/theme/theme1.xmlPK-! ѐ' theme/theme/_rels/themeManager.xml.relsPK] @ *W&@6@f%=@XXL# @0(  B S  ? _MailAutoSig*B>B( ? XBV[B33y > C u  @ f?B"w.x naR|pX. ^`OJQJo(^`OJQJ^Jo(o p^p`OJQJo( @ ^@ `OJQJo(^`OJQJ^Jo(o ^`OJQJo( ^`OJQJo(^`OJQJ^Jo(o P^P`OJQJo( ^`OJQJo(^`OJQJ^Jo(o p^p`OJQJo( @ ^@ `OJQJo(^`OJQJ^Jo(o ^`OJQJo( ^`OJQJo(^`OJQJ^Jo(o P^P`OJQJo( ^`OJQJo(^`OJQJ^Jo(o p^p`OJQJo( @ ^@ `OJQJo(^`OJQJ^Jo(o ^`OJQJo( ^`OJQJo(^`OJQJ^Jo(o P^P`OJQJo("aR|w.x                           i2 V~ &<)u7~TWMZZx.x@B@@`@Unknown G* Times New Roman5Symbol3. * Arial7.  VerdanaM. Gill Sans MTArial7.{ @Calibri?= * Courier New;WingdingsA BCambria Math"1hV;V;; $; $!2066JHX  $Pu72!xxDavid JBlakely-Hill   Oh+'0`   ( 4@HPXDavid Normal.dotmJBlakely-Hill2Microsoft Office Word@G@MH@MH;՜.+,D՜.+,, hp|  $ 6  Title, 8@ _PID_HLINKSA {2 http://www.publicintegrity.org/kuhttp://rdfabout.com/intro/  !"#$%&')*+,-./1234567:Root Entry F]H<1Table"WordDocument.*SummaryInformation((DocumentSummaryInformation80CompObjy  F'Microsoft Office Word 97-2003 Document MSWordDocWord.Document.89q