Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Implementing a Large Concept Model with Pytorch
Artificial Intelligence   Latest   Machine Learning

Implementing a Large Concept Model with Pytorch

Last Updated on January 14, 2025 by Editorial Team

Author(s): Arthur Lagacherie

Originally published on Towards AI.

Implement step by step the recent Meta’s model: Large Concept Model

A few days ago, a research paper titled β€œLanguage Modeling in a Sentence Representation Space”. They introduced a new type of model, the LCM (Large Concept Model). In the is article, I will explain how it works and implement it step by step with Pytorch.

Image created by me with Figma

But first: what is an LCM

Current NLP models generally use the Transformer architecture and reason at the token level. To generate text they just predict the next token. But not the LCM, instead the LCM predicts the next concept.

A concept is a part of a sentence that represents an idea, for example in the sentence:

Tim wasn’t very athletic, he thought that would change if he joined a sport, he tried out for several teams but he didn’t make the cut for any of them.

The different concepts will be:

  • Tim wasn’t very athletic
  • he thought that would change if he joined a sport
  • he tried out for several teams
  • but he didn’t make the cut for any of them.

We can see that each concept represents an idea of the sentence.

image from paper

If you have already read the paper or some article about it, you may know that the author tried 4 different architectures to predict this concept :

  • Base-LCM
  • One-Tower diffusion LCM
  • Two-Tower Diffusion LCM
  • Quantized LCM

In this article, I will just implement the Base-LCM based on the Transformer decoder architecture.

The forward function of an LCM is composed of three steps, let’s implement and explain them one by one.

The embedding step

The first step to create the forward function is the embedding step. Composed of two processes: the cutting step and the embedding step.

Cut step

The first thing we need to do is to cut the sentences into concepts, the paper’s solution to this problem is to use Segment any Text (SaT) which offers a suite of models and adapters that predict sentence boundaries at the token level.

screen capture of there model’s bechmarks

The library to use this model is wtpsplit (under MIT license). So let’s import it.

pip install wtpsplit

There are a lot of models available, but the models we are interested in are the models with the β€œ-sm” suffix. The number followed by an β€œl” means the number of transformer layers used by the model. In our experience, we don’t need extremely good performance so the β€œsat-1l-sm” model is enough.

from wtpsplit import SaT
sat_sm = SaT("sat-1l-sm")

The model takes 10 seconds approximatively to be loaded. To split a sentence with it you just need one line.

sat_sm.split("Tim wasn't very at[...]e cut for any of them.", threshold=0.05)
Output:
["Tim wasn't very athletic, he thought that would change ",
'if he joined a sport, he tried out for several teams ',
"but he didn't make the cut for any of them."]

You can see it works (After some tests I test the threshold to 0.05 for more little concepts).

SONAR Embeddings

The following step is to convert to embedding these concepts. To do this Meta uses the SONAR encoder, a multilingual and multimodal text-to-enbeddings / speech-to-enbeddings model.

image from SONAR github

SONAR supports 200 languages for text and 76 for speech.

To install SONAR you need to install sonar-space, not sonar. You also need fairseq2 which is a toolkit used by sonar for sequence modeling.

pip install sonar-space
# GPU
pip install fairseq2==v0.3.0rc1 --pre --extra-index-url https://fair.pkg.atmeta.com/fairseq2/whl/rc/pt2.5.1/cu124 --upgrade
# CPU
pip install fairseq2==v0.3.0rc1 --pre --extra-index-url https://fair.pkg.atmeta.com/fairseq2/whl/rc/pt2.5.1/cpu

Note: for cpu installation you need pytorch CPU only

Like wtpsplit, sonar is easy to use. You can download the model with one line and embed any text in one line.

# to download the model and import the library
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
t2vec_model = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder", tokenizer="text_sonar_basic_encoder")
# run model
sentences = ["Tim wasn't very athletic, he thought that would change ",
'if he joined a sport, he tried out for several teams ',
"but he didn't make the cut for any of them."]
embeddings = t2vec_model.predict(sentences, source_lang="eng_Latn")
# Ouput:
# tensor of shape (3, 1024)

The Transformer

image from paper

The transformer is simple, just a decoder, with a PreNet and a PostNet to adapt the transformer’s input and output to embedding dimensions.

class Transformer(nn.Module):
def __init__(self, embd_dim, dim, layers, heads, device):
super().__init__()
self.embd_dim = embd_dim
self.dim = dim
self.layers = layers
self.heads = heads

self.prenet = nn.Sequential(
nn.LayerNorm(embd_dim),
nn.Linear(embd_dim, dim)
)

self.decoder = nn.ModuleList([nn.TransformerDecoderLayer(d_model=dim, nhead=heads) for i in range(layers)])

self.postnet = nn.Sequential(
nn.Linear(dim, embd_dim),
nn.LayerNorm(embd_dim)
)
def forward(self, x):
x = self.prenet(x)
for l in self.decoder:
x = l(x, x)
return self.postnet(x)

In the original paper they use a special normalizer for feature normalizer but here I just use a LayerNorm.

Transform embeddings into text

Now we have the embedding output we just have to convert this embed into text and we will have all the components of the final model. To do that we will use SONAR, the same system we used to create embeddings.

# import and load the model
from sonar.inference_pipelines.text import EmbeddingToTextModelPipeline
vec2text_model = EmbeddingToTextModelPipeline(decoder="text_sonar_basic_decoder", tokenizer="text_sonar_basic_encoder")
# reconstruct the text from the embeddings
reconstructed = vec2text_model.predict(embeddings, target_lang="eng_Latn", max_seq_len=512)
Output:
["Tim wasn't very athletic, he thought that would change",
'if he joined a sport, he tried out for several teams',
"but he didn't make the cut for any of them."]

We can see that the sentences are exactly the same.

Construct the Model

Now that we have all the components of an LCM model, it’s time to assemble them. First, I create a config class to store the arguments.

class LCMConfig:
def __init__(self):
self.device = "cuda"
# Transformer args
self.embd_dim = 1024 # dim of sonar embeddings
self.dim = 512 # dim of the transformer
self.layers = 3
self.heads = 8

# Sonar args
self.lang = "eng_Latn"
self.max_seq_len = 512
self.sonar_enc = "text_sonar_basic_encoder"
self.sonar_dec = "text_sonar_basic_decoder"

# wtpsplit args
self.model_name = "sat-1l-sm"
self.threshold = 0.05

Then init the model class:

class LCMModel(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.sat_sm = SaT(config.model_name)
print("splitter initialized")

self.t2vec_model = TextToEmbeddingModelPipeline(encoder=config.sonar_enc, tokenizer=config.sonar_enc, device=torch.device(config.device))
print("t2vec_model initialized")

self.transformer = Transformer(config.embd_dim, config.dim, config.layers, config.heads, config.device).to(config.device)
print("transformer initialized")

self.vec2text_model = EmbeddingToTextModelPipeline(decoder=config.sonar_dec, tokenizer=config.sonar_dec, device=torch.device(config.device))
print("vec2text_model initialized")

The model will have three functions:

  • split_into_concepts: Splits the text into concepts using wtpsplit.
  • forward: Runs the transformer.
  • generate: Generates the text.
class LCMModel(nn.Module):
def split_into_concepts(self, text):
return self.sat_sm.split(text, threshold=self.config.threshold)

Here, we use the model initialized in .__init__ to split the sentences into concepts.

class LCMModel(nn.Module):
def forward(self, embeddings):
out_embeddings = self.transformer.forward(embeddings)
return out_embeddings

For the forward function, we simply call the forward function of the transformer decoder.

class LCMModel(nn.Module):
def generate(self, text, num_generated_concepts=1):
with torch.no_grad():
concepts = self.split_into_concepts(text)
for c in range(num_generated_concepts):
embeddings = self.t2vec_model.predict(concepts, source_lang=self.config.lang)
out_embeddings = self.forward(embeddings)
next_concept = self.vec2text_model.predict(out_embeddings, target_lang=self.config.lang, max_seq_len=self.config.max_seq_len)
concepts.append(next_concept[0])
return "".join(concepts)

Test

Now that we have our model, we can try it.

config = LCMConfig()
lcm = LCMModel(config).to("cuda")
o = lcm.generate("hello", num_generated_concepts=2)
print(o)
Output:
helloUe Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue
Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue
Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue
[15 other lines of the same thing]
Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue
Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue@ @ @ @ @ @ @ @ @ @ @ @ @ @ @
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
[9 other lines of the same thing]
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @

We can see that it works perfectly 😂 (except a small detail is that the model is not trained, so it generated some things randomly).

End

If you want to try it yourself, you can install the GitHub repository that I created.

GitHub – styalai/LCM-torch: A simple and easy to use implementation of a Large Concept Model with…

A simple and easy to use implementation of a Large Concept Model with pytorch. – styalai/LCM-torch

github.com

It’s the end of this article, I hope you enjoyed it, and if this is the case you can clap it or/and follow me =).

Here are some of my best articles:

Can a LLM beat you at Chess?

Use Outlines to answer to this question.

pub.towardsai.net

Let’s Create an Agentic Multimodal Chatbot from Scratch.

A model to generate images, understand images, generate audio, generate and understand text.

pub.towardsai.net

Implement the xLSTM paper from scratch with Pytorch

You want to implement a simple research paper ? Or just find out more about xLSTM ? You’ve come to the right place.

medium.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓