My Struggles with and Thoughts on Kaggle Datasets

For our latest project we have to find a data set, explore the data set, pose and answer questions based on that data, create a visualization of said data, and finally post about the experience. Currently, I have yet to even find a dataset let alone do all the other tasks. The problem arises from a difficulty in finding data that involves a topic I care about, appears reliable, is updated enough where I feel that the data is still relevant, and is accessible enough to my humanities centered brain that I can formulate questions about it. Why are you writing a post then, I hear some of you ask, well read on and find out!

Some of you readers might be saying “Ezra, that’s a lot of criteria to fill, of course you’re not finding anything that fits your exacting desires.” There might be a point to that, however, the issue so far seems to mainly rest on the fact that the data I do find is either collected more than five years ago—which is how old data could be if I wanted to use it in any sociological course that I took so I figured it was a good parameter to use—or it is bereft of any interesting demographic information. For instance, I thought I would look into advertising because that deals with language and is attempting to gain people’s attention so it is likely to utilize social norms that I would be interested in seeing in action. Yet a lot of the datasets that I was finding on Kaggle, the site my professor recommended to my class for raw datasets about anything and everything, have only basic information about what the ad was about rather than how it was presented, e.g. the exact kind of car that was advertised not if it was aiming at viewers through, say, posing it as a way to get women’s attention.,[1]

For a while I was also trying to look into car accidents in the U.S. and how many, if any, had alcohol and/or other drugs involved but the datasets I was finding were too old, they were not in the U.S.,[2] they did not discuss the use of drugs/alcohol, or their usability rating was low.[3] This series of failures to acquire good data continued for many of the topics I looked into, such as prisons and their intersection with education, school shootings,[4] employment and gender and/or race, dating, books, etc. I found a good dataset for books actually, it is about the number of reviews for, and the average rating of, books on Good Reads but I wasn’t sure if I had any interesting questions to ask nor was I sure how to properly organize the data.[5] I also got an interesting dataset for the employment question, and even though the data is specific to L.A. I was going to work with it, but I found two problems: one, there was a dataset that I knew when it was collected and another which I didn’t; secondly, despite my initial impression, the data was not cleanly organized[6] so when I tried to create a function that would judge the job description text based on whether it had masculine-coded or feminine-coded words or neither[7] it would come up with an error after about twenty entries because the job description was spread out over multiple text boxes in that column rather than one text box in the same row as the job title. Not sure if that made sense but trust me when I say it would’ve been a pain to fix so I said to heck with that and tried to look for something else. Lastly, I also found a reliable and interesting dataset that catalogued OKCupid profiles from a now forgotten year; the issue was that the dataset was really large and it included multiple paragraphs since you write up your interests and small bios in a dating profile so I got overwhelmed and tried to find something else that would work.[8]

If any of you are still here, waiting for that surprise reason as to why I’m writing this now, your wait is over! The reason is because after I was once again scrolling through the public datasets in search of the ever-elusive perfect dataset, I find this abomination. Is abomination a hyperbole? Probably to be honest, but it was almost funny how fallible their research premise is.[9] The title intrigued me, and the first sentence of the context section is perfectly benign: “gender is a social construct,” yes good, that’s a basic gender studies fact, now where are you going to go with it? Their second sentence is ok, although I take issue with their wording: “The way males and females are treated differently since birth moulds their behaviour and personal preferences into what society expects for their gender.” Technically, that’s totally accurate because our society assumes that sex and gender are not only synonymous but also caused by the other,[10] and thus people are treated differently based on their assigned sex at birth. However, it’s also not that simple because your body (sex) and your gender are different and continuing the erasure of those differences makes it harder for literally everyone.[11] The description continues on saying that the dataset was “designed to provide an idea about whether a person’s gender can be predicted. . .based on their personal preferences,” which relies on a lot of gender-based stereotyping but maybe that was their point so I move past that only to read in their insipration section that:

“With the rise of feminism, the difference between males and females in terms of their personal preferences has decreased in recent years. For instance, historically in many cultures, warm colors such as red and pink were thought of as feminine colors while cool colors such as blue were considered masculine. Today, such ideas are considered outdated.”

I’m pretty sure my jaw dropped when I read that. What world do they live in that the differences between the personal preferences of men and women have decreased? Ok, I know that there’s some truth to that, but largely that’s because it’s become acceptable for women to admit to enjoying, and participating in, masculine-dominated hobbies and fields.[12]

I suppose their most egregious error was the last sentence in that first paragraph. What a Euro-centric and shallow investigation into history must they have done to be able to say “historically, in many cultures” the  colors pink/blue were coded as feminine/masculine? I recognize that I am being too harsh because that is the narrative that we are taught at school so it makes sense they would uncritically take that as fact. At the same time, however, this is not the first time this semester[13] that I have had to deal with people investigating gender and gender stereotypes without actually interrogating any of their own biases first or knowing anything about previously completed research. This is a frustration that has built slowly over the semester and I guess this dataset was the last straw.

Thank you so much for reading this gigantic post of mine!

 [1] To be fair I did find a cool dataset about the ads during the Superbowl which I thought would work well, but it has all of the ads from 1967 to 2020 which is ridiculously large and thus extremely daunting.

 [2] Perhaps I should try looking at different countries? But I live in the U.S. and that is the context I am most familiar with so in some ways I won’t be able to properly analyze said data  

 [3] The usability rating meant that many aspects of the data were either not complete or not verified which made me hesitant to use it.  

 [4] There were understandably more datasets about police shootings then school shootings specifically. It is definitely a morbid topic but it is an important one; the one dataset on Kaggle that really went into this topic uses a study done by researchers from North Western University and Wikipedia, and it was the latter which really made question the data’s reliability because of the years of Wikipedia wariness drilled into me haha.

 [5] Thinking about it, maybe I can do something with authors and the amount of ratings they get? Hmmm, I’ll think on that.

 [6] Whether the data is “clean” or not is a lingo I have learned from scouring the datasets in Kaggle. :D

 [7] I realize now that I should have also made a way for it to recognize if the description has both but oh well, hindsight is 20/20.

 [8] Oh I should have asked my professor for help you say? Well, yes, you’re right and I just might after I finish this post, but now I’m second guessing whether even this dataset would prove fruitful for me. Although that could just be the melodramatic in me speaking.

 [9] Or should I say was because the data collection occurred in 2015? If you have the answer please let me know in the comments haha.

 [10] As the adage goes correlation does not equal causation, and that is the same for sex and gender.

 [11] Using the word for sex (male or female, although there’s also intersex which is a whole other topic) when you really mean gender (man or woman, I’m ignoring the other possibilities for simplicity’s sake), is a big pet peeve of mine, probably because of my personal connection. Half of my gender dysphoria is caused by other people connecting all the physical dots of my person that equal the category “female” and then assuming that must mean I’m a “woman.” I was planning on writing a separate post about this topic based on a very interesting and well-intentioned but a bit poorly executed Aeon article about the biological category of sex. I might still write it, but with my schedule lately, that is becoming less and less likely so here’s a snap shot of the post: the author was arguing that in biology there really are only two sexes and the definition of them is very narrow thereby allowing the definitions to act as a necessary tool for thinking about life and how it comes to be while simultaneously allowing for the diversity of life found in nature. I was going to argue that the author has a point, but he also misses the point of gender theory—the whole idea of there being more than two sexes arose because we were always taught that sex is determined by chromosomes (a characteristic the author says doesn’t even work as a proper definition) and that easily bolstered the idea that gender diversity could not occur because our bodies disallow the possibility. In this way, challenging that narrative was never about changing the academic discipline of biology, it was about changing the way main-stream society thought about these concepts so that the diversity of human life could exist.

 [12] That’s also a whole separate issue because it conflates gender expression with sexuality, which I also hate because it’s very heteronormative and further restricts how a person can or cannot act.

 [13] There’s a class I’m taking this semester that wants to focus on gender, and while well-intentioned, by doing so uncritically I find that they are not helping in the way they want to or think they are and I honestly sometimes feel uncomfortable/on the spot when talking to the professor because I am the one trans non-binary student. 

Comments

  1. Ezra,
    As usual, your blog post is very thorough and thoughtful.
    Although your journey seems as though it might have been painful, you have revealed a fundamental truth about dealing with data - data are never neutral. They always rely on several factors: 1) the quality and consistency with which they were gathered; 2) the biases/prejudices of those collecting them; 3) they all require some kind of analysis and interpretations, and these analyses and interpretations carry the risk of your biases and prejudices.
    All that being said, if the power is in the analysis, it frees us up (and our students as well). We have a mind and a voice that we can bring to these tasks.
    You may want to explore some resources beyond Kaggle (which I included only as a starting point. RIT has a very cool Open Data project underway (https://www.rit.edu/research/open#resources). And this link provides a number of Open Data resources (https://www.freecodecamp.org/news/https-medium-freecodecamp-org-best-free-open-data-sources-anyone-can-use-a65b514b0f2d/)
    Good luck on the journey and keep us posted.

    ReplyDelete

Post a Comment