Implementing a Large Concept Model with Pytorch
Last Updated on January 14, 2025 by Editorial Team
Author(s): Arthur Lagacherie
Originally published on Towards AI.
Implement step by step the recent Metaβs model: Large Concept Model
A few days ago, a research paper titled βLanguage Modeling in a Sentence Representation Spaceβ. They introduced a new type of model, the LCM (Large Concept Model). In the is article, I will explain how it works and implement it step by step with Pytorch.
But first: what is an LCM
Current NLP models generally use the Transformer architecture and reason at the token level. To generate text they just predict the next token. But not the LCM, instead the LCM predicts the next concept.
A concept is a part of a sentence that represents an idea, for example in the sentence:
Tim wasnβt very athletic, he thought that would change if he joined a sport, he tried out for several teams but he didnβt make the cut for any of them.
The different concepts will be:
- Tim wasnβt very athletic
- he thought that would change if he joined a sport
- he tried out for several teams
- but he didnβt make the cut for any of them.
We can see that each concept represents an idea of the sentence.
If you have already read the paper or some article about it, you may know that the author tried 4 different architectures to predict this concept :
- Base-LCM
- One-Tower diffusion LCM
- Two-Tower Diffusion LCM
- Quantized LCM
In this article, I will just implement the Base-LCM based on the Transformer decoder architecture.
The forward function of an LCM is composed of three steps, letβs implement and explain them one by one.
The embedding step
The first step to create the forward function is the embedding step. Composed of two processes: the cutting step and the embedding step.
Cut step
The first thing we need to do is to cut the sentences into concepts, the paperβs solution to this problem is to use Segment any Text (SaT) which offers a suite of models and adapters that predict sentence boundaries at the token level.
The library to use this model is wtpsplit (under MIT license). So letβs import it.
pip install wtpsplit
There are a lot of models available, but the models we are interested in are the models with the β-smβ suffix. The number followed by an βlβ means the number of transformer layers used by the model. In our experience, we donβt need extremely good performance so the βsat-1l-smβ model is enough.
from wtpsplit import SaT
sat_sm = SaT("sat-1l-sm")
The model takes 10 seconds approximatively to be loaded. To split a sentence with it you just need one line.
sat_sm.split("Tim wasn't very at[...]e cut for any of them.", threshold=0.05)
Output:
["Tim wasn't very athletic, he thought that would change ",
'if he joined a sport, he tried out for several teams ',
"but he didn't make the cut for any of them."]
You can see it works (After some tests I test the threshold to 0.05 for more little concepts).
SONAR Embeddings
The following step is to convert to embedding these concepts. To do this Meta uses the SONAR encoder, a multilingual and multimodal text-to-enbeddings / speech-to-enbeddings model.
SONAR supports 200 languages for text and 76 for speech.
To install SONAR you need to install sonar-space, not sonar. You also need fairseq2 which is a toolkit used by sonar for sequence modeling.
pip install sonar-space
# GPU
pip install fairseq2==v0.3.0rc1 --pre --extra-index-url https://fair.pkg.atmeta.com/fairseq2/whl/rc/pt2.5.1/cu124 --upgrade
# CPU
pip install fairseq2==v0.3.0rc1 --pre --extra-index-url https://fair.pkg.atmeta.com/fairseq2/whl/rc/pt2.5.1/cpu
Note: for cpu installation you need pytorch CPU only
Like wtpsplit, sonar is easy to use. You can download the model with one line and embed any text in one line.
# to download the model and import the library
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
t2vec_model = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder", tokenizer="text_sonar_basic_encoder")
# run model
sentences = ["Tim wasn't very athletic, he thought that would change ",
'if he joined a sport, he tried out for several teams ',
"but he didn't make the cut for any of them."]
embeddings = t2vec_model.predict(sentences, source_lang="eng_Latn")
# Ouput:
# tensor of shape (3, 1024)
The Transformer
The transformer is simple, just a decoder, with a PreNet and a PostNet to adapt the transformerβs input and output to embedding dimensions.
class Transformer(nn.Module):
def __init__(self, embd_dim, dim, layers, heads, device):
super().__init__()
self.embd_dim = embd_dim
self.dim = dim
self.layers = layers
self.heads = heads
self.prenet = nn.Sequential(
nn.LayerNorm(embd_dim),
nn.Linear(embd_dim, dim)
)
self.decoder = nn.ModuleList([nn.TransformerDecoderLayer(d_model=dim, nhead=heads) for i in range(layers)])
self.postnet = nn.Sequential(
nn.Linear(dim, embd_dim),
nn.LayerNorm(embd_dim)
)
def forward(self, x):
x = self.prenet(x)
for l in self.decoder:
x = l(x, x)
return self.postnet(x)
In the original paper they use a special normalizer for feature normalizer but here I just use a LayerNorm.
Transform embeddings into text
Now we have the embedding output we just have to convert this embed into text and we will have all the components of the final model. To do that we will use SONAR, the same system we used to create embeddings.
# import and load the model
from sonar.inference_pipelines.text import EmbeddingToTextModelPipeline
vec2text_model = EmbeddingToTextModelPipeline(decoder="text_sonar_basic_decoder", tokenizer="text_sonar_basic_encoder")
# reconstruct the text from the embeddings
reconstructed = vec2text_model.predict(embeddings, target_lang="eng_Latn", max_seq_len=512)
Output:
["Tim wasn't very athletic, he thought that would change",
'if he joined a sport, he tried out for several teams',
"but he didn't make the cut for any of them."]
We can see that the sentences are exactly the same.
Construct the Model
Now that we have all the components of an LCM model, itβs time to assemble them. First, I create a config class to store the arguments.
class LCMConfig:
def __init__(self):
self.device = "cuda"
# Transformer args
self.embd_dim = 1024 # dim of sonar embeddings
self.dim = 512 # dim of the transformer
self.layers = 3
self.heads = 8
# Sonar args
self.lang = "eng_Latn"
self.max_seq_len = 512
self.sonar_enc = "text_sonar_basic_encoder"
self.sonar_dec = "text_sonar_basic_decoder"
# wtpsplit args
self.model_name = "sat-1l-sm"
self.threshold = 0.05
Then init the model class:
class LCMModel(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.sat_sm = SaT(config.model_name)
print("splitter initialized")
self.t2vec_model = TextToEmbeddingModelPipeline(encoder=config.sonar_enc, tokenizer=config.sonar_enc, device=torch.device(config.device))
print("t2vec_model initialized")
self.transformer = Transformer(config.embd_dim, config.dim, config.layers, config.heads, config.device).to(config.device)
print("transformer initialized")
self.vec2text_model = EmbeddingToTextModelPipeline(decoder=config.sonar_dec, tokenizer=config.sonar_dec, device=torch.device(config.device))
print("vec2text_model initialized")
The model will have three functions:
split_into_concepts
: Splits the text into concepts usingwtpsplit
.forward
: Runs the transformer.generate
: Generates the text.
class LCMModel(nn.Module):
def split_into_concepts(self, text):
return self.sat_sm.split(text, threshold=self.config.threshold)
Here, we use the model initialized in .__init__
to split the sentences into concepts.
class LCMModel(nn.Module):
def forward(self, embeddings):
out_embeddings = self.transformer.forward(embeddings)
return out_embeddings
For the forward function, we simply call the forward function of the transformer decoder.
class LCMModel(nn.Module):
def generate(self, text, num_generated_concepts=1):
with torch.no_grad():
concepts = self.split_into_concepts(text)
for c in range(num_generated_concepts):
embeddings = self.t2vec_model.predict(concepts, source_lang=self.config.lang)
out_embeddings = self.forward(embeddings)
next_concept = self.vec2text_model.predict(out_embeddings, target_lang=self.config.lang, max_seq_len=self.config.max_seq_len)
concepts.append(next_concept[0])
return "".join(concepts)
Test
Now that we have our model, we can try it.
config = LCMConfig()
lcm = LCMModel(config).to("cuda")
o = lcm.generate("hello", num_generated_concepts=2)
print(o)
Output:
helloUe Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue
Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue
Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue
[15 other lines of the same thing]
Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue
Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue Ue@ @ @ @ @ @ @ @ @ @ @ @ @ @ @
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
[9 other lines of the same thing]
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
We can see that it works perfectly 😂 (except a small detail is that the model is not trained, so it generated some things randomly).
End
If you want to try it yourself, you can install the GitHub repository that I created.
GitHub – styalai/LCM-torch: A simple and easy to use implementation of a Large Concept Model withβ¦
A simple and easy to use implementation of a Large Concept Model with pytorch. – styalai/LCM-torch
github.com
Itβs the end of this article, I hope you enjoyed it, and if this is the case you can clap it or/and follow me =).
Here are some of my best articles:
Can a LLM beat you at Chess?
Use Outlines to answer to this question.
pub.towardsai.net
Letβs Create an Agentic Multimodal Chatbot from Scratch.
A model to generate images, understand images, generate audio, generate and understand text.
pub.towardsai.net
Implement the xLSTM paper from scratch with Pytorch
You want to implement a simple research paper ? Or just find out more about xLSTM ? Youβve come to the right place.
medium.com
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI