NLP News Cypher | 05.24.20
Last Updated on July 24, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
NLP News Cypher U+007C 05.24.20
One for the Road
ACL 2020 coming in July to a computer near you. Papers and accepted demos/frameworks are already on display:
Accepted Papers
Note that the titles/authors may change and papers may be withdrawn. For the final titles/authors, please refer to theβ¦
acl2020.org
Facebook AI released their Blender chatbot earlier this month. I was playing around with the 2.7B model and it talks fairly well U+1F447. A common problem with chit chat dialogue systems usually involves a modelβs disregard for a userβs statement which disrupts continuity. But as you can see, Blender does a great job of maintaining continuity.
Hereβs the 90M param model to run inference via Colab (the larger models make Colab explode):
Google Colaboratory
Edit description
colab.research.google.com
Paper:
Happy Memorial Day Weekend! U+1F32D
&
Eid Mubarak
This Week:
The Stormy Seas of Deployment
BERTweet
Microsoft Build Recap
HackerEarth Survey
Hugging Reformer
Dataset of the Week: ATOMIC
The Stormy Seas of Deployment
AI2 and company companies put out great research in NLP. And this week, one of their engineers wrote a blog post on how they manage to deploy all their great demos. They let us into their world of mass deployment and how they are able to run such a complex back-end to deploy real-time inference at scale. You donβt get to see inside the machine very often, so have look at their architecture (Kubernetes alertU+1F4A5):
Skiff: Taming the Stormy Seas of the Modern Web
Whereas many research organizations are hidden behind closed doors inherited by the stringent restrictions placed uponβ¦
medium.com
BERTweet
About time someone managed to pre-train BERT on massive amounts of English tweets. Having dealt with tweet data for my own demos, this will come in handy as Twitter is one of the obvious choices for text data harvesting. By the way, it includes COVID-related tweets which can come in handy if your model requires to be aware of COVID vocabulary.
Details:
850M English Tweets (16B word tokens ~ 80GB), containing 845M Tweets streamed from 01/2012 to 08/2019 and 5M Tweets related the COVID-19 pandemic
GitHub:
VinAIResearch/BERTweet
Table of contents BERTweet is the first public large-scale language model pre-trained for English Tweets. BERTweet isβ¦
github.com
Microsoft Build Recap
One of the biggest developer conferences happened. And Microsoft unveiled some really cool features like auto-completing VS code and a supercomputer for OpenAI running on Azure. Find the highlights here:
The most important announcements from Microsoft Build, its annual conference for softwareβ¦
The coronavirus didn't stop Microsoft from issuing a flood of news in its annual Build conference this week. Now in itsβ¦
www.cnbc.com
The supercomputer developed for OpenAI is a single system with more than 285,000 CPU cores, 10,000 GPUs and 400 gigabits per second of network connectivity for each GPU server.
Microsoft announces new supercomputer, lays out vision for future AI work – The AI Blog
Microsoft has built one of the top five publicly disclosed supercomputers in the world, making new infrastructureβ¦
blogs.microsoft.com
HackerEarth Survey
So what are developers up to nowadays? Letβs see what 16,655 developers from 76 countries have to say:
Highlights:
Among students (29%) and experienced developers (32%), Go has emerged as the clear winner for the most sought-after programming language.
The 2nd most used resource for learning about development is from YouTube, that includes both students and professionals.
Survey:
Hugging Reformer
Hugging Face introduced the Reformer to their growing list of transformers this past week, and even included a nice Colab for text generation. If you are having trouble keeping up with the list of all these transformers, you are not alone, itβs a good problem to have. U+1F917 documentation
Colab of the Week:
Google Colaboratory
Edit description
colab.research.google.com
Dataset of the Week: ATOMIC
What is it?
Dataset is a knowledge graph of 877K textual description triples of inferential knowledge.
Sample:
Where is it?
ATOMIC Knowledge Graph Browser
For some events, annotations are quite diverse, does this mean the data is noisy? Importantly, some events invokeβ¦
homes.cs.washington.edu
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
If you enjoyed this article, help us out and share with friends!
For complete coverage, follow our Twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI