Implementing a Large Concept Model with Pytorch

Last Updated on January 14, 2025 by Editorial Team

Author(s): Arthur Lagacherie

Originally published on Towards AI.

Implement step by step the recent Meta’s model: Large Concept Model

A few days ago, a research paper titled “Language Modeling in a Sentence Representation Space”. They introduced a new type of model, the LCM (Large Concept Model). In the is article, I will explain how it works and implement it step by step with Pytorch.

But first: what is an LCM

Current NLP models generally use the Transformer architecture and reason at the token level. To generate text they just predict the next token. But not the LCM, instead the LCM predicts the next concept.

A concept is a part of a sentence that represents an idea, for example in the sentence:

Tim wasn’t very athletic, he thought that would change if he joined a sport, he tried out for several teams but he didn’t make the cut for any of them.

The different concepts will be:

Tim wasn’t very athletic
he thought that would change if he joined a sport
he tried out for several teams
but he didn’t make the cut for any of them.

We can see that each concept represents an idea of the sentence.

If you have already read the paper or some article about it, you may know that the author tried 4 different architectures to predict this concept :

Base-LCM
One-Tower diffusion LCM
Two-Tower Diffusion LCM
Quantized LCM

In this article, I will just implement the Base-LCM based on the Transformer decoder architecture.

The forward function of an LCM is composed of three steps, let’s implement and explain them one by one.

The embedding step

The first step to create the forward function is the embedding step. Composed of two processes: the cutting step and the embedding step.

Cut step

The first thing we need to do is to cut the sentences into concepts, the paper’s solution to this problem is to use Segment any Text (SaT) which offers a suite of models and adapters that predict sentence boundaries at the token level.

screen capture of there model’s bechmarks

The library to use this model is wtpsplit (under MIT license). So let’s import it.

pip install wtpsplit

There are a lot of models available, but the models we are interested in are the models with the “-sm” suffix. The number followed by an “l” means the number of transformer layers used by the model. In our experience, we don’t need extremely good performance so the “sat-1l-sm” model is enough.

from wtpsplit import SaT
sat_sm = SaT("sat-1l-sm")

The model takes 10 seconds approximatively to be loaded. To split a sentence with it you just need one line.

sat_sm.split("Tim wasn't very at[...]e cut for any of them.", threshold=0.05)

Output:
["Tim wasn't very athletic, he thought that would change ",
 'if he joined a sport, he tried out for several teams ',
 "but he didn't make the cut for any of them."]

You can see it works (After some tests I test the threshold to 0.05 for more little concepts).

SONAR Embeddings

The following step is to convert to embedding these concepts. To do this Meta uses the SONAR encoder, a multilingual and multimodal text-to-enbeddings / speech-to-enbeddings model.

SONAR supports 200 languages for text and 76 for speech.

To install SONAR you need to install sonar-space, not sonar. You also need fairseq2 which is a toolkit used by sonar for sequence modeling.

pip install sonar-space

# GPU
pip install fairseq2==v0.3.0rc1 --pre --extra-index-url https://fair.pkg.atmeta.com/fairseq2/whl/rc/pt2.5.1/cu124 --upgrade
# CPU
pip install fairseq2==v0.3.0rc1 --pre --extra-index-url https://fair.pkg.atmeta.com/fairseq2/whl/rc/pt2.5.1/cpu

Note: for cpu installation you need pytorch CPU only

Like wtpsplit, sonar is easy to use. You can download the model with one line and embed any text in one line.

# to download the model and import the library
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
t2vec_model = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder", tokenizer="text_sonar_basic_encoder")

# run model
sentences = ["Tim wasn't very athletic, he thought that would change ",
 'if he joined a sport, he tried out for several teams ',
 "but he didn't make the cut for any of them."]
embeddings = t2vec_model.predict(sentences, source_lang="eng_Latn")
# Ouput:
# tensor of shape (3, 1024)

The Transformer

The transformer is simple, just a decoder, with a PreNet and a PostNet to adapt the transformer’s input and output to embedding dimensions.

class Transformer(nn.Module):
 def __init__(self, embd_dim, dim, layers, heads, device):
 super().__init__()
 self.embd_dim = embd_dim
 self.dim = dim
 self.layers = layers
 self.heads = heads
 
 self.prenet = nn.Sequential(
 nn.LayerNorm(embd_dim),
 nn.Linear(embd_dim, dim)
 )

 self.decoder = nn.ModuleList([nn.TransformerDecoderLayer(d_model=dim, nhead=heads) for i in range(layers)])
 
 self.postnet = nn.Sequential(
 nn.Linear(dim, embd_dim),
 nn.LayerNorm(embd_dim)
 )
 def forward(self, x):
 x = self.prenet(x)
 for l in self.decoder:
 x = l(x, x)
 return self.postnet(x)

In the original paper they use a special normalizer for feature normalizer but here I just use a LayerNorm.

Transform embeddings into text

Now we have the embedding output we just have to convert this embed into text and we will have all the components of the final model. To do that we will use SONAR, the same system we used to create embeddings.

# import and load the model
from sonar.inference_pipelines.text import EmbeddingToTextModelPipeline
vec2text_model = EmbeddingToTextModelPipeline(decoder="text_sonar_basic_decoder", tokenizer="text_sonar_basic_encoder")

# reconstruct the text from the embeddings
reconstructed = vec2text_model.predict(embeddings, target_lang="eng_Latn", max_seq_len=512)

Output:
["Tim wasn't very athletic, he thought that would change",
 'if he joined a sport, he tried out for several teams',
 "but he didn't make the cut for any of them."]

We can see that the sentences are exactly the same.

Construct the Model

Now that we have all the components of an LCM model, it’s time to assemble them. First, I create a config class to store the arguments.

class LCMConfig:
 def __init__(self):
 self.device = "cuda"
 # Transformer args
 self.embd_dim = 1024 # dim of sonar embeddings
 self.dim = 512 # dim of the transformer
 self.layers = 3
 self.heads = 8
 
 # Sonar args
 self.lang = "eng_Latn"
 self.max_seq_len = 512
 self.sonar_enc = "text_sonar_basic_encoder"
 self.sonar_dec = "text_sonar_basic_decoder"
 
 # wtpsplit args
 self.model_name = "sat-1l-sm"
 self.threshold = 0.05

Then init the model class:

class LCMModel(nn.Module):
 def __init__(self, config):
 super().__init__()
 self.config = config
 self.sat_sm = SaT(config.model_name)
 print("splitter initialized")
 
 self.t2vec_model = TextToEmbeddingModelPipeline(encoder=config.sonar_enc, tokenizer=config.sonar_enc, device=torch.device(config.device))
 print("t2vec_model initialized")
 
 self.transformer = Transformer(config.embd_dim, config.dim, config.layers, config.heads, config.device).to(config.device)
 print("transformer initialized")
 
 self.vec2text_model = EmbeddingToTextModelPipeline(decoder=config.sonar_dec, tokenizer=config.sonar_dec, device=torch.device(config.device))
 print("vec2text_model initialized")

The model will have three functions:

split_into_concepts: Splits the text into concepts using wtpsplit.
forward: Runs the transformer.
generate: Generates the text.

class LCMModel(nn.Module):
 def split_into_concepts(self, text):
 return self.sat_sm.split(text, threshold=self.config.threshold)

Here, we use the model initialized in .__init__ to split the sentences into concepts.

class LCMModel(nn.Module):
 def forward(self, embeddings):
 out_embeddings = self.transformer.forward(embeddings)
 return out_embeddings

For the forward function, we simply call the forward function of the transformer decoder.

class LCMModel(nn.Module):
 def generate(self, text, num_generated_concepts=1):
 with torch.no_grad():
 concepts = self.split_into_concepts(text)
 for c in range(num_generated_concepts):
 embeddings = self.t2vec_model.predict(concepts, source_lang=self.config.lang)
 out_embeddings = self.forward(embeddings)
 next_concept = self.vec2text_model.predict(out_embeddings, target_lang=self.config.lang, max_seq_len=self.config.max_seq_len)
 concepts.append(next_concept[0])
 return "".join(concepts)

Test

Now that we have our model, we can try it.

config = LCMConfig()
lcm = LCMModel(config).to("cuda")
o = lcm.generate("hello", num_generated_concepts=2)
print(o)

Output:
helloUe Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue
 Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue
 Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue
[15 other lines of the same thing]
Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue
 Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ 
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ 
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ 
[9 other lines of the same thing]
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ 
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @

We can see that it works perfectly 😂 (except a small detail is that the model is not trained, so it generated some things randomly).

End

If you want to try it yourself, you can install the GitHub repository that I created.

GitHub – styalai/LCM-torch: A simple and easy to use implementation of a Large Concept Model with…

A simple and easy to use implementation of a Large Concept Model with pytorch. – styalai/LCM-torch

github.com

It’s the end of this article, I hope you enjoyed it, and if this is the case you can clap it or/and follow me =).

Here are some of my best articles:

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Implementing a Large Concept Model with Pytorch

Author(s): Arthur Lagacherie

But first: what is an LCM

The embedding step

Cut step

SONAR Embeddings

The Transformer

Transform embeddings into text

Construct the Model

Test

End

GitHub – styalai/LCM-torch: A simple and easy to use implementation of a Large Concept Model with…

A simple and easy to use implementation of a Large Concept Model with pytorch. – styalai/LCM-torch

Can a LLM beat you at Chess?

Use Outlines to answer to this question.

Let’s Create an Agentic Multimodal Chatbot from Scratch.

A model to generate images, understand images, generate audio, generate and understand text.

Implement the xLSTM paper from scratch with Pytorch

You want to implement a simple research paper ? Or just find out more about xLSTM ? You’ve come to the right place.

JOIN NOW!

🔥 Recommended Articles 🔥

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement