What are we talking about when we talk about AI?

Real word with data

We are living in a real world filled with data. Everyday, there are exa-($10^{18}$) magnitude of data[1] generated in the worldwide. There is one thing that people always care about, how can we make use of these overwhelming data? One simple solution is just to store these data, and when the data is needed, we just replay the one stored in our database. However, the real world is always much more complicated that we think. In most situations, we are not lucky enough to get the right data we want. And there are also cases such as the data is too complicated (i.e. we need large aspects, or dimensions to describe single data) or too large amount of data so people can not easily search within the data. To recap, real world is always much more complicated than we imagine, and it is usually a hard task to describe the world using the raw data we have collected. In this case, what we need is a “model”, which has some ability to predict the pattern of the real world data (i.e. so-called pattern recognition). How to model the data is the central problem in machine learning and deep learning.

You may wonder what is so-called “the pattern” of the data. Well, we have a powerful tool to describe data distributions, which is statistics. Many readers may already have been familiar with this subject, and I will make a brief introduce to this topic.

Probability distributions and dataset

In the perspective of machine learning, we can always model the dataset using a parameterized distribution $P_\theta(X_{data})$, where $\theta$ is the parameter, and $X_{data}$ is the dataset. Given the assumption that dataset is labeled (such as the example of image classification, where each image is categorized into a class) or unlabeled (such as the example of clustering, where the data is classified into different clusters).

References

[1] https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read