The world is being quietly rearranged by people who write very long documents.


The title they went with Self-Directed Task Identification Noisy translates that to

Machine learning model learns to guess what you're trying to predict without being told


Researchers built a system that can look at a dataset and figure out on its own which column contains the answer you're actually looking for — no human labeling needed. Right now, someone has to manually mark every dataset to say 'this is what we're predicting.' This system does that step automatically, which could speed up the pipeline from raw data to a working model.
Data labeling is the grinding, expensive part of building machine learning systems — someone has to sit down and mark thousands of examples by hand. If a model can do this autonomously, the bottleneck moves upstream and the cost of building new systems drops. The catch is that this is a proof of concept on synthetic benchmarks, not real-world messy data, so the question is whether the trick survives contact with actual datasets that don't have clean structure.
Whether this works on real datasets from industry or research — not the synthetic benchmarks in the paper, but actual messy data where the target variable isn't obvious or where multiple columns could plausibly be the answer.

If you insist
Read the original →