Language Learning

Danger! You’re Using The Wrong Data To Teach AI!

Data is the fuel for artificial intelligence. The more data we have, the better the AI will learn and find those hidden patterns, right? Unfortunately, not so much. We have the ability to collect LOTS of data. Consider the nearly 31 billion IoT devices producing information for machine consumption. However, lots of data does not translate into good data. As humans, we have not fully grasped what is the slice of data that creates the real value in developing AI solutions. At the heart of this challenge, we have three major obstacles: 1) understanding what’s the real data, 2) validity of belief, and 3) implicit bias.

So, what is “real data?” Simply put, it is the data the machine really needs to learn and perform work. We have fallen into the trap that having big data gives us the key information to enable AI learning. The problem, though, is that more data can lead to more misconstructions and opportunities for bias. Consider what Dr. De Kai from the Hong Kong University of Science and Technology has shared: It takes an AI system hearing roughly 100 million words to learn a language, but a human child only needs to hear approximately 15 million words to learn it. Why is there such a delta? We don’t fully know, but there is a strong argument that it is particular words and phrases that really demonstrate the intricacies of language, not just a sheer volume. This makes for the argument that the secret lies in medium data, not big data. In other words, true AI skill development lies in using the critical data not just large volumes.  

To see the power of medium data, we can look at fake news detection. Unfortunately, there is a lot of fake news out there with a very large amount variability. More variability means more data that AI needs to learn. However, at the University of Washington, the computer scientists at the Allen Institute for AI took a different approach. They created a system called Grover that learned how to write fake news so that it can better detect fake news. To write fake news articles, Grover had to learn what is real news by reading real news articles, which has much less variability than fake news. In effect, through their training strategy, they simplified the amount of data needed and went from vast big data needs to a reduced data set.

The obstacles in the validity of belief is trickier to manage. Fundamentally, each person has assumptions that we consider to be true and wind upholding as fact. For example, what color is the sun? Most people would say yellow, maybe red or orange at sunset. However, the sun is actually white. (Sorry Superman fans, but he shouldn’t really have any powers from a yellow sun.) Most people believe the sun is yellow because the Earth’s atmosphere scatters out the short-wavelength colors, so it winds up looking yellow. What’s the big deal, right? Imagine that we teach AI systems the sun is yellow as a fact. How would that impact its learning when it tries to reconcile pictures of red-orange sunsets? Would the AI draw false conclusions that could potentially impact other work in astronomy? Could it create false impressions about where life could exist in the universe? Think about other misconceptions we have like different parts of our tongues detect different tastes (they’re all spread out on our tongue), peanuts are nuts (they’re actually legumes), sugar makes kids hyperactive (sorry, parents, it’s not true), people have five sense (there’s at least nine with scientists believing there could be twenty-one), and vitamin C helps fight a cold (no scientific proof of this.)  AI systems only learn what we teach them. If we tell a machine some is fact when it is only an assumption, our validity of belief will limit, and possibly corrupt, the AI’s ability to find those hidden patterns that will yield meaningful insights.

Finally, it is no big secret that people suffer from implicit bias. We all have subconscious stereotypes (positive or negative) that can warp our perceptions and attitudes about things or people. Unfortunately, data can also suffer from implicit bias. Imagine a future where people’s careers are chosen by an AI system based on their performance in school from kindergarten through twelfth grade. The same standards are applied to each student. Seems to be fair, right? Well, are there any other factors that should come into play? What if some families could afford tutors while others could not? How about the overall caliber of each school and its teacher? How about access to items like mobile devices that could help spur development of science and math at a younger age? How about the court system? In the article AI Taking A Knee: Action To Improve Equal Treatment Under The Law, the author discussed the real talk about creating AI robot judges. Is there implicit bias in the court data that may cause disparate impact? The author cities the example of Batey and Turner, who were two college athletes convicted of rape but got very different sentences. If an AI system were to take a look at this data, it would wonder why one Turner received six months of jail while Batey received fifteen years of jail time. The AI would start looking at differences between the two. While there a lot of similarities, one profound difference is their ethnicity. Would this be a determine factor for AI? If it could find other instances where judgements were drastically different for different ethnicities, then yes, this implicit bias would become a factor. That’s a problem, especially since we struggle with seeing our implicit biases.

Is there a real danger that we’re using the wrong data to train our AI systems? Yes, there is. However, that does not mean all hope is lost. Just as we are constantly trying to find more effective ways to teach our children, we must do the same in teaching AI. To solve the real data problem, we have to get better at understanding what is the meaningful data. This means stepping away from the secret is in the big data somewhere and understanding the true drivers of learning and skill development. To break the validity of belief challenge, we have to question what are really facts versus assumptions. That is, do we have concrete proof that what we believe is actually true. To address implicit bias, we must first acknowledge that we are biased. Then, people must engage in better diversity and inclusion to bring more perspectives to the table and consider what is happening and the outcomes that will get generated based on these viewpoints. For AI to be a truly effective tool for humanity, we have to address these three things to ensure the machine is getting the proper fuel.

Original post: Source link