Newscycle's tech doctors win new ML clustering patent



A new technique which clusters like-with-like data at real-time speeds is the subject of a patent granted to Newscycle Solutions.

The 'unsupervised learning and document clustering' technique is already in use in Newscycle's NewsEdge content-as-a-service platform. It provides exact categorisation, whereas previous methods had been slower and approximate.

It was developed by Lawrence Rafsky, chief artificial intelligence/machine learning scientist at Newscycle subsidiary Acquire Media, in collaboration with Jonathan Marshall.

Rafsky says the challenge of unsupervised learning is to take a huge sample of individuals - "or individual things, like news articles" - and group like-with-like.

He says that while there are numerous other techniques for unsupervised learning, the Newscycle-patented solution is unique in that it runs in real time and gets an exact answer to the underlying "combinatorial minimization" problem.

"It's super-fast and we get the exact answer, whereas other techniques just arrive at an approximation," he said.

The NewsEdge content-as-a-service solution uses the algorithm to group more than 750,000 news articles a day into buckets of single-themed news events, continuously through a day, but the technique could also have applications in other industries.

Dr Marshall says the method - developed to support news article topic clustering - works equally well in medical data analysis, ecommerce transaction analytics, advertising segmentation, database similarity-joins, and other areas.

"For more than 50 years, textbooks have explained why clustering slows down with each arriving data item. Now, for the first time, we have produced a data clustering method that does not slow down - it is just as fast on the millionth data item as on the first. This makes it feasible and practical to find clusters in much larger data sets than before."

Tests involving clustering a data set of more than 10 million news articles found the technique was 10,000 times faster than typical industry-standard solutions.

Rafsky says they cracked a problem "most data scientists realise they don't have an exact solution for".

The technique is particularly suited for Newscycle's NewsEdge content-as-a-service platform, which has a proprietary real-time taxonomy for tagging news with very specific controlled-vocabulary keywords.

Together, the processes enable NewsEdge products to provide searchable theme-based content bundles to users in milliseconds, with near-zero post-publishing latency. The US Patent #10,216,829 was issued Feb 26, 2019. (See https://patents.google.com/patent/US10216829).