GSoC 2022 Final Report
Make a New Weaviate Module
Weaviate is a cloud-native vector search engine and a vector database. Weaviate is completely modularized and functionalities of the weaviate can be enhanced by using modules. This project focuses on creating a text summarization module where users can summarize any text on the fly.
In Short
- The Summarization (SUM) module is a Weaviate module that summarizes whole paragraphs into a short text.
- The module depends on a SUM Transformers model that should be running with Weaviate. There are pre-built models available, but you can also attach another HuggingFace Transformer or custom SUM model.
- The module adds a
summary {}
filter to the GraphQL_additional {}
field. - The module returns the results in the GraphQL
_additional { summary {} }
field.
This module is now released under release 1.15
Refer to the official documentation for more details related to the summarization module
Completed Tasks
Mainly worked on two repositories
Inference Service
Task carried out under this repo
- Implement the inference service related to the summarization task.
Transformer-based summarization inference models have been used and this model is available in the semi-technologies docker hub. - Added smoke test to cover overall workflow related to the summarization inference service
This repo consists of the inference service related to the summarization module. This is a containerized application and runs along with a vectorizer module such as text2vec-contextionary
Initially updated all the implementation to the main branch (which might not be the best practice, but since the inference service is already built and worked fine pushed it to the main branch). After that updated the repo accordingly via the below PRs.
Pull Requests
- PR#1
Update the travis.yml file
Updated the smoke test - PR#2
The summarization module allows user to use their own summarization models. In that case, there should be a mechanism to integrate those inference services with weaviate. In this PR customer dockerfile has been added to use in such scenarios. - Other PRs
Here all the PRs related to this service can be found.
Weaviate Core Module
Task carried out related to this
- Implement the core logic and functionality related to the summarization module
- Add unit test for core functions
- Add integration test for the whole module
This repo consists of all the core implementations related to the weaviate vector search engine.
Summarization module-related implementation can be found here.
Pull Requests
- PR#1
This PR consists of all the changes made
Remaining Works
There are no remaining works related to the summarization module.
But there are certain things that changed from the proposed project.
Initial Idea was to implement a custom text2text generator module that can cover a variety of potential tasks like summarization, translation, etc. But after having discussions with the mentors and the organization, the task was narrowed down to develop a separate module for the summarization task. Other text2text generator tasks will be addressed separately in the future if needed.
Usage of Summarization Module
Weaviate is a vector database and it stores data. In order to store data, there should be a defined structure or a schema. Then based on the schema, users can populate the database with real-world data entities.
Then based on the user configs and the dense retriever module, data will be stored in a high-dimensional vector space. Afterward, users can query the data and extend functionalities of weaviate, such as text summarization.
Schema
Then after adding objects accordingly, the user can query the summary as below
Query
Result
Acknowledgment
I would like to thank my mentor, Marcin Antas for the constant support he provided and for always being there to clear up any questions I had. This was an amazing experience I learned a lot about the open source culture, vector search engines, Golang, and many more. Also, thank you to everyone in the Semi-technologies who provided feedback and helped improve the project.