# Ethics of Data Science: Students Weigh In

Students from Linda Clark’s new course, Data Science Fluency (DATA 0200), have been considering the ethical and societal implications of data and data science through the lenses of their class projects. The course is an introduction to data science for students from all fields of study, and students get hands-on experience working with data, which includes considering important questions about how data is collected, what data is collected (or not), what that data can’t tell us, and how data can be misused, as well as how it permeates daily life. Students wrote short essays on their thoughts for the class. Here are three that show the range of topics addressed:

##### Dataset Limitations

by Rachel Okin

Since learning about statistical inferences in class and also starting to work on my group project, I've run into limitations of statistical inferences that may cause ethical issues in my project. For the sake of this blog post, I will rephrase our group project question to one that better reflects a probability, such as: "What is the probability that a student who comes from a family with a higher income bracket will make greater earnings (than a student from a lower family income level) when both students are 20 years out of university, if both students went to a top tier university?" For this question, the population would be the millions of students who attend colleges or universities throughout the US and these students' families, but with the dataset we've chosen, the sample size is only ~1515 students. One limitation of using statistical inferences in this case is that we are using a relatively small amount of data to make conclusions about a population that is magnitudes greater. I wonder if our small sample is skewed or if our distribution would look completely different if we had 1000 more subjects or even 1,000,000 more?

Another ethical issue that comes with using statistical inferences for our specific college and university dataset is that the subjects are anonymized even though each individual subject is incredibly different. In this dataset, the only substantial information we know about the anonymous subjects is the tier of school they attended, their family’s income level, and their “earnings” when they left college. There are many more substantial factors about these individuals that, if recorded, could help us make more informed conclusions about the population, such as where each subject grew up, how many siblings their parents were providing for, or potentially if they’ve encountered impediments in their job search. Because the world we live in is full of biases that may determine who gets a job or how much one person gets paid in comparison to another person, we must evaluate each individual’s situation separately. Therefore, it feels ethically wrong to come to some conclusion about the probability of an individual to  come out of X tier college with X earnings because their parents were of X income bracket when we are completely disregarding the various other reasons for why or how that individual has that outcome. While we cannot always realistically collect data for and study every single subject in a population of millions, we must recognize that the alternative, using statistical inferences, also often has ethical flaws.

##### Personal Choices

by Katie O'Leary

A friend of mine made a comment the other day. As she was searching through her phone she made an off-hand comment “Do you ever start closing your apps and then realize how many are actually there?”

No. Actually, I don’t. Maybe it’s a difference in personality, maybe in generation, or maybe it’s a new-found awareness in myself. I tend to be on the cautious side, and so my location is turned off, I close my applications after using them, and always manage my cookie settings when I enter a website. That’s my normal for managing my data.

On the other hand, my friend recently mentioned that an ad came up for a candle she had been talking about. This data harvesting from audible conversations is real, and can make some people uncomfortable, but depends on those settings you read (but really didn’t) and accepted. She had also searched for the candle on her browser and opened a link to the seller’s website. She gets targeted for ads and that’s her normal for managing her data.

Neither approach is inherently better, but these interactions start a dialogue about what happens to our data and who we let collect it. The choices themselves are important, but not as critical as the ability to have the conversation in the first place. It all starts with honesty.

I firmly believe that honesty and transparency are the pillars of ethics in data science, and something that I work hard for in my own data projects. Knowing that I have to explain my choices in data science makes them more thoughtful. When data analysts, programmers, and scientists are honest, end-users attain self-actualization and personal choice. Honesty empowers. Data is given – not taken. The users are given the dignity of knowing, which lets them calculate their own data risks and prevents misuse.

In the end, we’re all learning and working through the digital age together, making our choices, and learning how to interact with data collection and use. The issues of ethics in data science is a continuous conversation, but end-users need honesty from data scientists in order to understand the choices they have. Plain-language honesty is paramount. It allows myself and my friends to be empowered by choice and spurs ethical thought in data scientists.

##### Misrepresenting Data

by Isaiah Spencer