I just heard Theo in this video skimming through Claude’s “Constitution” and he was just describing what synthetic data means and thought it was gold.
Think of training a model to colorize an image
You can generate tons of synthetic training data: convert color images to black and white. Now you have perfectly labeled pairs of input/output. The output images are all synthetic/generated.
Buying companies just for their codebase
He also mentioned some things like research labs buying companies just for their codebase / git history. All that data, PRs, bug fixes, etc., could be great data.
Interesting stuff.