Data Visualization Chunk 1 Blog Post
Hello, I am once again blogging about my foray into the world of data and data visualization. This time, however, I will mostly be discussing the work I am doing rather than committing half of my post to a bit of a rant about a particularly biased set of data I found on Kaggle. If you haven’t read and that post and are now curious, here it is.
I’d like to think I am relatively good at researching and engaging with data—by which I mean that if you want data that has already been compiled into understandable statistics and want those stats analyzed, I’m your guy. [1] The difficulty for me is taking the raw data and making it into a visual, [2] although as my last post made clear, choosing a dataset can be hard for me as well. My professor’s comment on said post reminded me that I didn’t have to use Kaggle, and could use other sources to find datasets, such as The Census Bureau, and as soon as I read that I felt silly for thinking otherwise in the first place haha. Nevertheless, I went looking at the other sites he recommended to look for datasets [3] and I admit I still had some trouble finding datasets that I could and/or wanted to work with. A bunch of datasets were specific to a city or state and while that’s not normally a deal breaker, I don’t know a lot about the state of Washington or the city Louisville which made me feel like I couldn’t properly assess the data. Some datasets were again collected prior to 2015 or were not easily understandable to me. What was most frustrating was that I also happened to find datasets whose main data were unavailable. [4] Given my continued troubles, I figured I would try searching the Census database because I have used it before and figured that I could find something useful and/or interesting there. [5]
Thankfully, I found a really interesting set of tables that was used in a report entitled “Dynamics of Economic Well-Being: Poverty, 2005-2011” pretty quickly. The years of data collection are broken up into 2005-2007 and 2009-2011 (not sure where 2008 went). I downloaded the files for the years 2009-2011 and was about to start the process of combining the various excel sheets into one larger central workbook when I realized that the year is 2020 and that this would be data for ten years ago, not five. I silently wept as I kept the files in the rare case that I decided to say fork it and use them; for the moment I went back to the drawing board.
Given the trials and tribulations of my search for data, when I saw a 2019 data report completed by the Current Population Survey (CPS) I was happy to use the first one I found which was on food security. Alas, they used so many acronyms that I could not make heads or tails of any of the data. Eventually, I was able to find CPS survey data that used simple accessible language about poverty levels in 2019. I then had to figure out how exactly I wanted to collate the data because the CPS is very thorough and has multiple excel sheets that explore various factors such as school enrollment, household size, etc which are then broken down even further based on race as well as percentage below the poverty line. [6] During that process I happened across this article from the Center on Budget and Policy Priorities website which discusses how the pandemic disrupted data collection by the CPS and will likely result in a lowered poverty estimate of about one million. While understandable, that’s also a really large amount and the article mentions that another department within the Census Bureau called American Community Survey would not have a similar problem because their data collection ended prior to the pandemic. I was trying to access that data instead, but for some reason I couldn’t easily download it. [7] Stuck in such a manner, I was struck by the life-saving idea to use the 2018 CPS survey of the same data. I further limited my data by focusing only on those who were 200% below the poverty level. Thus my quest to find a suitable dataset has ended!
The next step is to explore the data and design questions. Well in the time it took me to find this dataset I became quite familiar with the manner CPS organizes and collects their data. I knew for each racial category what percentage of that population were 200% below the poverty level as well as how many of the total enrolled and not enrolled populations for an age group were living 200% below the poverty level. The chart also delineates between “both sexes,” “female,” and “male.” What I didn’t know was how any of this information would look in comparison with each other. For instance, how many public-school aged Black males [8] are enrolled and not enrolled in school in comparison to other ages? What about in comparison to public school aged Black females? What about the comparison between the public-school aged kids of all incomes and the ones who are categorized as living below the poverty line? What about for White public-school aged males/female and Black public-school aged males/females? In this way I was able to set about using Excel to create visualizations of the data to help me answer these questions.
While I professed a certain, albeit limited, expertise in creating visualizations using Excel in my self-assessment which you can read here, I realized two things as I went through this process. Firstly, that it had been quite some time since I have had to use Excel for data visualization and as such had to re-familiarize myself with the best ways to do so which took a while. Secondly, as the professor who first taught me the ins and outs of Excel used to say: Excel is really smart, but it’s also really dumb. [9] The CPS, and most of the tables I’ve found and worked with on the Census Bureau’s website, structures and displays their data with the human on the other side of the screen in mind. What that means is that they have the title and a short description preceding the table, and in the table itself there are superscripts next to certain headings which lead the viewer to the bottom of the screen to see a definition of said heading. Moreover, some of the data points are symbols rather than numbers. In the tables I was using, for example, multiple cells had (B) meaning “base less than 75,000” while others had (X) meaning “estimate is not applicable or not available.” All of that information is important for the human viewer to be able to properly understand all of the data that they are seeing, yet the same information that was helpful for a human makes it extremely hard for a computer [10] to create a visualization of the data.
Due to this snag, I had to figure out the best way to reconfigure the data in a way that Excel understood. Eventually, I figured out that I would make a new workbook, copy and paste only the necessary information from the tables that I wanted to look at and compare and then create separate sheets to view all of the charts I could create from said data. The resulting visualizations were a mix of bar graphs, line graphs, combined graphs where some variables were bars and others were lines, and in one memorable occasion I even tried displaying the data in a treemap. Overall, however, I found that the data were generally easiest to digest when it was displayed in some type of bar graph which meant that, in my opinion, from a purely aesthetic perspective the charts were a bit boring to look at. I tried creating a pivot table/chart [11] from the data, but I think I need to either re-familiarize myself with pivot tables more or I need to reorganize the table to make it better suited for creating a pivot table/chart. Regardless of all the data drama (haha) I ended up being quite happy with what I was able to accomplish. If you would like to look at what my hard work, sweat, and tears produced, you can do so here.
[1] I’m putting this in the footnotes because I think I might have said some of this already in previous posts and I also don’t want to seem like I’m bragging: I’ve done (relatively) extensive research into gentrification, the school to prison pipeline (including educational attainment, school discipline, prison structures, prison populations, poverty, etc), and social class in the U.S. (such as social mobility, wealth inequality, wealth v income inequality). I linked to my gentrification research previously but for ease/new readers here it is again, and here is the link to my school to prison pipeline project. I did a presentation on the social class research so I can’t easily link that.
[2] Actually, do my bookmarks count as data visualization? I hadn’t thought they did, but one of the sites we could use to visualize data is canva which is where I made the bookmarks and I guess they are similar to infographics? Hmmm, food for thought I suppose.
[3] In case any of my readers are not my professor, here is the list he directed me to.
[4] Granted this only really happened twice, once where it said there was a dataset but the link led to a screen that said there were no data files available and another where the main data was hidden behind a paywall.
[5] Half-way through the process outlined in the preceding paragraph I thought that I might want to use some of the more interesting datasets I got from Kaggle, primarily the OkCupid dating profile and the Goodreads reviews. Funnily enough, when I tried to input the data into a data visualization site, the file was so large it kept crashing my web browser. I am playing with the Goodreads reviews on the visualization tool Tableau Public but have yet to find a way to create a meaningful visualization. Ultimately, I think I will not use either of those datasets for this project.
[6] This took me a while to understand because it is not overtly explained, but 100% below the poverty level means that the household/person’s annual income is equal to the income that the state has designated as being the financial threshold for poverty and thus for some federal assistance programs. If, then, someone is 200% below the poverty level means that their income is twice that of the poverty level. I found this by looking at this web page which details federal and state poverty guidelines. By the way, according to that same webpage, “[t]he poverty guidelines are sometimes loosely referred to as the “federal poverty level” (FPL), but that phrase is ambiguous and should be avoided, especially in situations (e.g., legislative or administrative) where precision is important.”
[7] The datasets were in csv files and for some reason the data would not properly translate into Excel. This was particularly strange given that many files on Kaggle are csv and I had no problem opening the file in Excel.
[8] For the rest of this post I am planning on just writing male and female to refer to the participants in the survey. We know for sure that, at the very least, they physically appeared to be male and female, but we don’t know if they identified as men and women or something else entirely so rather than assume they are all cisgender, I am opting to use the sex terminology.
[9] I have kept the word dumb in an effort to be faithful to what the professor actually said (although now I can’t remember if he said “dumb” or “stupid”), but I recently learned that “dumb” has ableist origins/history. As in “dumb and deaf” to indicate a Deaf, hard of hearing person, and/or someone who has linguistic/speech difficulties. When I was looking up exactly why it’s ableist I found this great glossary of ableist terms and alternatives to used that also explains why we should avoid ableist language. What I found particularly interesting was that the author acknowledged that not everyone who has a disability will think certain phrases/words are offensive and that you’re not a terrible person if you’ve used them (excluding actual slurs of course), but that if you are aware of it, trying to use different language is something important to consider/try. I know I’m not great at avoiding ableist language even though I’ve been aware of this overall issue for a few years, and honestly, it’s probably because no one in my life has a disability that would make me become more aware of my language and then remember to be aware. However, I feel bad about that because I don’t think you need to be personally impacted by an oppressive system in order to care and try and fight against it. I’m partially writing this to inform you, mystery reader, and to show you that no one is perfect, but I’m also partially writing this in an attempt to keep myself accountable for my words now that I’ve put it out in writing.
[10] I don’t think my terminology is right, but I’m not sure what I should use instead—program? Algorithm? Application? Software?
[11] In case you are unfamiliar with the concept of a pivot table/chart, it is a function on Excel (it’s probably available on any kind of spreadsheet application but I’m not sure if it uses the same name or not) that allows you to manipulate the information in a table in various ways more effectively than if you had to do said manipulation manually. If that doesn’t make sense, let me know and I’ll try to rephrase it!
Comments
Post a Comment