The world is being quietly rearranged by people who write very long documents.


The title they went with CountsDiff: A Diffusion Model on the Natural Numbers for Generation and Imputation of Count-Based Data Noisy translates that to

Researchers built a diffusion model that works on count data — and it matches RNA-seq imputation methods


A new machine learning technique called CountsDiff can now generate and fill in missing values from count-based datasets (like gene expression measurements) as easily as other models handle images or text. This matters because RNA-seq data is noisy and incomplete by nature, and this method performs as well as specialized tools built specifically for that problem.
Until now, the standard machine learning techniques that excel at generating images or text didn't work well on count data — the discrete, ordinal numbers that show up everywhere in biology (gene expression levels, cell counts, protein abundances). Researchers had to build entirely separate, domain-specific tools. This paper shows that a general-purpose diffusion model, with the right tweaks, can handle counts just as well. What changes: biologists no longer need specialized imputation methods for RNA-seq data. A general tool now solves the problem as well or better. That's a structural shift — it means count-based work moves closer to the broader ecosystem of generative AI tooling.
Watch whether this gets adopted into standard RNA-seq analysis pipelines at research institutions or biotech companies within the next 12 months, or whether it remains a preprint with no downstream uptake.

If you insist
Read the original →