# IA

# usage

Comprendre et utiliser les modèles de langage d'IA (Sébastien COLLET) - Devoxx 2023

# articles

Exploring AI - Kent Beck - 20240125

# training

Yann LeCun about IA training on LinkedIn

Animals and humans get very smart very quickly with vastly smaller amounts of training data than current AI systems.

Current LLMs are trained on text data that would take 20,000 years for a human to read.
And still, they haven't learned that if A is the same as B, then B is the same as A.
Humans get a lot smarter than that with comparatively little training data.
Even corvids, parrots, dogs, and octopuses get smarter than that very, very quickly, with only 2 billion neurons and a few trillion "parameters."

My money is on new architectures that would learn as efficiently as animals and humans.
Using more text data (synthetic or not) is a temporary stopgap made necessary by the limitations of our current approaches.
The salvation is in using sensory data, e.g. video, which has higher bandwidth and more internal structure.

The total amount of visual data seen by a 2 year-old is larger than the amount of data used to train LLMs, but still pretty reasonable.
2 years = 2x365x12x3600 or roughly 32 million seconds.
We have 2 million optical nerve fibers, carrying roughly ten bytes per second each.
That's a total of 6E14 bytes. The volume of data for LLM training is typically 1E13 tokens, which is about 2E13 bytes.
It's a factor of 30.

Importantly, there is more to learn from video than from text because it is more redundant.
It tells you a lot about the structure of the world.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

TLDR : Next gen IA needs to use video instead of text.

To compare, see this Jean-Baptiste Kempf (VLC) interview about how video works.

an image is an array of pixel, each pixel is a color
a video is a collection of images (something between 24 to 60 images per second)
CODEC = compression decompression algorithm to send video.
Video pixel by pixel is around 10 to 40 Gb/s
the goal of CODEC is to divide 100, 200, ... 1K the bandwith used.
dividing bandwith is destroying information
the tech behind is based on how the human eyes behave, some colors are better seen then others, so we can delete some colors without downgrading the image seen.

Each CODEC behave the same way, they delete data not seen by eyes, and they seek data blocks that are redundant image by image or between images.

MPEG-1 (1993) ---> MPEG-2 (1995) = DVD ---> DIVX (1999) (=MPEG-4) ---> H.264 (2003) ---> HEVC (2013) ---> VP9 (2013)

H.264 is the most common CODEC used in the world, around 80% of usage.
HEVC is crippled by royalties, it remains unused on the web instead of television, around 5%.
VP9 created by Google, royalty free, opensource, Youtube and Facebook uses it.
AV1 then AV2 created by the Open Media Alliance initiated by Google.
AV1 is implemented by Dav1d, a VLC project, around 210K assembly LoC + 30K C LoC. This impl is widely used by GAFAM.

# misc

Guide ChatGPT pour développeurs

vocabulaire : www.frenchweb.fr

Aux origines de l'intelligence artificielle - www.franceculture.fr - 20180331

Machine Learning: The High Interest Credit Card of Technical Debt - 2014

Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.

Took from Machine learning and tech debt: A publication from Google on www.funfunforum.com :

Another worry for real-world systems lies in hidden feedback loops. Systems that learn from world behavior are clearly intended to be part of a feedback loop. For example, a system for predicting the click through rate (CTR) of news headlines on a website likely relies on user clicks as training labels, which in turn depend on previous predictions from the model. This leads to issues in analyzing system performance, but these are the obvious kinds of statistical challenges that machine learning researchers may find natural to investigate [2].

Exponential growth of supercomputing power, 1995-2060 (logarithmic scale)

Human-level artificial intelligence could be achieved "within five to ten years", say experts - www.futuretimeline.net - 20180925

Santé : nos données personnelles peuvent-elles sauver des vies ?