We're thrilled to announce the publication of our latest scientific paper: "Spanish TrOCR: Leveraging Transfer Learning for Language Adaptation"! by Filipe Lauar and Valentin LAURENT Optical Character Recognition (OCR) has transformed text extraction from images, and at Qantev, we're pushing the boundaries with our innovative approach to multilingual OCR, particularly for Visual Rich Documents (VRDs). 📖 About the Paper: Our research delves into creating a synthetic dataset in Spanish, designed to handle the unique challenges posed by VRDs. By fine-tuning the TrOCR model using this dataset, we've achieved remarkable results, making our Spanish OCR model a leading open-source solution. ⭐️ Key Highlights: - Creation of a synthetic VRD dataset in Spanish - Fine-tuning TrOCR with advanced data augmentation techniques - Benchmarking against EasyOCR and Microsoft Azure OCR API - Significant improvements in Character Error Rate (CER) and Word Error Rate (WER) You can explore the full paper here: https://lnkd.in/eS_C578s 💡 Read the Blog Post: Dive deeper into our methodology and findings in our comprehensive blog post. Learn how we tackled the challenges of VRDs and fine-tuned the TrOCR model for superior performance in Spanish OCR. 🔗 Blog Post: https://lnkd.in/eaFePisF 🔍 Explore Our Resources: - Spanish TrOCR models on Hugging Face: huggingface.co/qantev - Dataset Generation Method on GitHub: https://lnkd.in/eEPZGrq2 Special thanks to our amazing authors, Filipe Lauar and Valentin LAURENT, for their invaluable contributions to this paper. We're proud to contribute to the OCR community and excited to see how our work can aid in various applications, from digitizing documents to extracting text from complex images. #Qantev #OCR #TrOCR #MachineLearning #AI #Research #OpenSource #SpanishOCR #VRD Feel free to reach out if you have any questions or feedback. Thank you for your support!
Qantev’s Post
More Relevant Posts
-
Our new paper at Qantev!! In this paper we show how to generate a synthetic OCR dataset for the Visual Rich Documents problem and how to fine-tune TrOCR using this generated dataset in Spanish!
We're thrilled to announce the publication of our latest scientific paper: "Spanish TrOCR: Leveraging Transfer Learning for Language Adaptation"! by Filipe Lauar and Valentin LAURENT Optical Character Recognition (OCR) has transformed text extraction from images, and at Qantev, we're pushing the boundaries with our innovative approach to multilingual OCR, particularly for Visual Rich Documents (VRDs). 📖 About the Paper: Our research delves into creating a synthetic dataset in Spanish, designed to handle the unique challenges posed by VRDs. By fine-tuning the TrOCR model using this dataset, we've achieved remarkable results, making our Spanish OCR model a leading open-source solution. ⭐️ Key Highlights: - Creation of a synthetic VRD dataset in Spanish - Fine-tuning TrOCR with advanced data augmentation techniques - Benchmarking against EasyOCR and Microsoft Azure OCR API - Significant improvements in Character Error Rate (CER) and Word Error Rate (WER) You can explore the full paper here: https://lnkd.in/eS_C578s 💡 Read the Blog Post: Dive deeper into our methodology and findings in our comprehensive blog post. Learn how we tackled the challenges of VRDs and fine-tuned the TrOCR model for superior performance in Spanish OCR. 🔗 Blog Post: https://lnkd.in/eaFePisF 🔍 Explore Our Resources: - Spanish TrOCR models on Hugging Face: huggingface.co/qantev - Dataset Generation Method on GitHub: https://lnkd.in/eEPZGrq2 Special thanks to our amazing authors, Filipe Lauar and Valentin LAURENT, for their invaluable contributions to this paper. We're proud to contribute to the OCR community and excited to see how our work can aid in various applications, from digitizing documents to extracting text from complex images. #Qantev #OCR #TrOCR #MachineLearning #AI #Research #OpenSource #SpanishOCR #VRD Feel free to reach out if you have any questions or feedback. Thank you for your support!
Spanish TrOCR: Leveraging Transfer Learning for Language Adaptation
medium.com
To view or add a comment, sign in
-
I'm excited to announce that our recent research, titled "Uddessho: An Extensive Benchmark Dataset for Multimodal Author Intent Classification in Low-Resource Bangla Language," has been accepted for publication in The 18th International Conference on Information Technology and Applications (ICITA 2024) in Sydney, Australia! We named the dataset "Uddessho" (উদ্দেশ্য). The dataset was carefully curated from a range of social media posts, focusing on capturing a broad spectrum of author intent. Our data collection covers six categories: Informative, Advocative, Promotive, Exhibitionist, Expressive, and Controversial. This dataset represents a significant leap forward in understanding author intent in Bangla, a low-resource language, particularly in the complex world of social media. Here is a summary of the outcomes of our experiments: 💥 Developed a novel dataset named "Uddessho," comprising 3,048 post instances categorized into six distinct intents: Informative, Advocative, Promotive, Exhibitionist, Expressive, and Controversial. The dataset is divided into a training set (2,423 posts), a testing set (313 posts), and a validation set (312 posts). 💥 Proposed the Multimodal-based Author Bangla Intent Classification (MABIC) framework, which leverages both text and images to classify author intent in Bangla social media posts, showcasing how visual cues enhance the analysis of textual content. 💥 Demonstrated that the multimodal approach (text + images) achieved 76.19% accuracy, significantly outperforming the traditional unimodal text-only approach which reached 64.53% accuracy, marking an improvement of 11.66% . 💥 Highlighted the power of fusion techniques— Early Fusion and Late Fusion in combining textual and visual data, showing that the multimodal method provides a more comprehensive understanding of author intent compared to text-only methods. A big thank you to my co-authors Mukaffi Bin Moin, Mahfuzur Rahman, MD MORSHED ALAM SHANTO, and Asif Iftekher Fahim! The "Uddessho" dataset is now publicly available at: https://lnkd.in/gVjjQZZm, and you can access our preprint on https://lnkd.in/gKVNtFFs.
(PDF) Uddessho: An Extensive Benchmark Dataset for Multimodal Author Intent Classification in Low-Resource Bangla Language
researchgate.net
To view or add a comment, sign in
-
📃Scientific paper: SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model Abstract: Large language models (LLMs) have recently been extended to the vision-language realm, obtaining impressive general multi-modal capabilities. However, the exploration of multi-modal large language models (MLLMs) for remote sensing (RS) data is still in its infancy, and the performance is not satisfactory. In this work, we introduce SkyEyeGPT, a unified multi-modal large language model specifically designed for RS vision-language understanding. To this end, we meticulously curate an RS multi-modal instruction tuning dataset, including single-task and multi-task conversation instructions. After manual verification, we obtain a high-quality RS instruction-following dataset with 968k samples. Our research demonstrates that with a simple yet effective design, SkyEyeGPT works surprisingly well on considerably different tasks without the need for extra encoding modules. Specifically, after projecting RS visual features to the language domain via an alignment layer, they are fed jointly with task-specific instructions into an LLM-based RS decoder to predict answers for RS open-ended tasks. In addition, we design a two-stage tuning method to enhance instruction-following and multi-turn dialogue ability at different granularities. Experiments on 8 datasets for RS vision-language tasks demonstrate SkyEyeGPT's superiority in image-level and region-level tasks, such as captioning and visual grounding. In particular, SkyEyeGPT exhibits encouraging results compared to GPT-4V in s... Continued on ES/IODE ➡️ https://etcse.fr/1Gx5F ------- If you find this interesting, feel free to follow, comment and share. We need your help to enhance our visibility, so that our platform continues to serve you.
SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model
ethicseido.com
To view or add a comment, sign in
-
📣 📣 📣 [CFP; MRR 2024 Workshop @ SIGIR 2024 Conference]We're excited to invite researchers and practitioners across the spectrum of multimodal modeling, representation learning, and retrieval to participate in our workshop on "Multimodal Representation and Retrieval." This event is proudly organized in association with SIGIR 2024. Multimodal data is available in many applications like e-commerce production listings, social media posts and short videos. However, existing algorithms dealing with those types of data still focus on uni-modal representation learning by vision-language alignment and cross-modal retrieval. In this workshop, we target to bring a new retrieval problem where both queries and documents are multimodal. With the popularity of vision language modeling, large language models (LLMs), retrieval augmented generation (RAG), and multimodal LLM, we see a lot of new opportunities for multimodal representation and retrieval tasks. This event will be a comprehensive half-day workshop focusing on the subject of multimodal representation and retrieval. The agenda includes keynote speeches, oral presentations, and an interactive panel discussion. Our objective with this workshop is to capture the interest of researchers in this emerging challenge. By highlighting the novelty and significance of the problem, we aim to attract researchers who are eager to explore and contribute to this field. Join us as we welcome esteemed speakers including Tat-Seng Chua (National University of Singapore), and Dinesh Manocha (University of Maryland), who will share their invaluable insights and expertise. We invite original research & industrial application papers that present research on multimodal data representation and retrieval. 🔗 Workshop website: https://lnkd.in/gcpZtrdc 📅 Submission Deadline: May 5, 2024 (11:59 pm, AOE) 📅 Workshop Date: July 18th, 2024 🌆 Workshop Location: Washington D.C., USA Join us to shape the future of multimodal representation, retrieval, and related applications! ☀ My co-organizers Arnab Dhua (Amazon), Douglas Gray (Amazon), I. Zeki Yalniz (Meta), Tan Yu (Nvidia), Mohamed Elhoseiny (KAUST), Bryan A. Plummer (Boston University) and I look forward to hosting you.
Multimodal Representation and Retrieval
mrr-workshop.github.io
To view or add a comment, sign in
-
Large Language Models on Graphs: A Comprehensive Survey Large language models (LLMs), such as GPT-4 and LLaMA, are driving significant advancements in natural language processing due to their robust text encoding and decoding capabilities, as well as their newly emerged reasoning abilities. Although LLMs are primarily designed to process pure text, many real-world scenarios involve text data associated with rich structural information in the form of graphs, such as academic networks or e-commerce networks. Additionally, there are scenarios where graph data are paired with rich textual information, like molecules with descriptions. While LLMs have demonstrated strong text-based reasoning capabilities, it remains underexplored whether these abilities can be generalized to graphs, specifically graph-based reasoning. A systematic review of scenarios and techniques related to large language models on graphs is necessary. Potential scenarios for adopting LLMs on graphs can be categorized into three main areas: - Pure graphs: Graphs without associated text data. - Text-attributed graphs: Graphs where nodes or edges have textual attributes. - Text-paired graphs: Graphs paired with separate textual information. Techniques for utilizing LLMs on graphs include: - LLM as Predictor: Using LLMs to predict properties or relationships within graphs. - LLM as Encoder: Employing LLMs to encode graph structures into meaningful representations. - LLM as Aligner: Utilizing LLMs to align textual information with graph structures. Each of these approaches has its advantages and disadvantages, which need to be compared and evaluated. Real-world applications of these methods are diverse and include areas such as network analysis, molecular chemistry, and recommendation systems. Open-source codes and benchmark datasets are available to support these applications, such as those found at https://lnkd.in/eCpY9YXz. Future research directions in this fast-growing field involve further exploring the integration of LLMs with graph data, enhancing the reasoning capabilities of LLMs on graphs, and expanding the range of real-world applications. #LLM #Graph #textmining #text #proteingraph #moleculegraphs #graphtheory
To view or add a comment, sign in
-
As an ML enthusiast, I'm always fascinated by advancements in large language models (LLMs). But their massive size often makes them impractical for real-world applications. Enter PEFT (Parameter-Efficient Fine-Tuning)! This cool technique, with LoRA and QLoRA at its core, allows us to fine-tune LLMs for specific tasks without needing a ton of resources. 1. LoRA (Low-Rank Adaptation of LLM) : This technique introduces trainable rank decomposition matrices into each layer of transformer architecture and also reduces trainable parameters for downstream task while keeping the pre trained weights frozen. 2. QLoRA (Quantized LoRA): QLoRA is the extended version of LoRA which works by quantizing the precision of the weight parameters in the pre trained LLM to 4-bit precision. Benefits : 1. Faster training : By reducing the number of trainable parameters, LoRA and QLoRA can significantly speed up the fine-tuning process. 2. Improved performance : Despite being smaller, these fine-tuned models can still perform exceptionally well. PEFT is opening doors for all sorts of exciting applications, from personalized chatbots to efficient translation tools. Want to dive deeper, check out this- https://lnkd.in/gwmrq3w3 #PEFT #LoRA #QLoRA #LLMs #MachineLearning #generativeai
QLoRA: Fine-Tuning Large Language Models (LLM’s)
medium.com
To view or add a comment, sign in
-
📰 This paper from Microsoft Research tackles a fascinating question: what is the minimum number of parameters required for large language models to generate coherent language? 🔎 To explore this, the researchers developed a synthetic dataset called TinyStories, which includes stories written using vocabulary understandable to a 4-year-old child. They used this dataset to train small GPT-like architectures and found that models with as few as 30 million parameters could generate coherent sentences. 💡 This research is highly compelling, as it could open pathways to creating smaller, more sustainable language models. https://lnkd.in/e77jxqDA #AI #languagemodel #article
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
arxiv.org
To view or add a comment, sign in
-
Estonians doing wonderful work on LLMs! Keep up the good work! Our own high-quality Estonian language model is potentially super valuable - we get there faster by cooperation-cooperation-cooperation. #aiadoption #largelanguagemodels #sustainablechange
Professor of NLP, researcher, playboy philanthropist at the Institute of Computer Science, University of Tartu
Happy and proud to announce some of our papers, recently accepted to EACL, NAACL and LREC-COLING, covering question answering, LLMs, grammatical error correction, NLP for diagnostics and machine translation: * "No Error Left Behind: Multilingual Grammatical Error Correction with Pre-trained Translation Models" (Agnes Luhtaru, Lisa Korotkova and Mark Fishel: EACL'2024). TL;DR: using pre-trained translation models and translation data to improve error correction for Est/Ger/Cze/Eng. - a follow-up to that article is currently on Arxiv, where we use LLMs to generate artificial errors and also to correct errors: "To Err Is Human, but Llamas Can Learn It Too", Agnes Luhtaru/Taido Purason/Martin Vainikko/Maksym Del/Mark Fishel: https://lnkd.in/dwHxshpi 2. "On Narrative Question Answering Skills" (Emil Kalbaliyev and Kairit Sirts: NAACL). TL;DR: proposing a new skill taxonomy for question answering. 3. "Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer" (Hele-Andra Kuulmets, Taido Purason, Agnes Luhtaru and Mark Fishel: NAACL Findings). TL;DR: how to teach Estonian to Llama 2 without it forgetting English. 4. "Multilinguality or Back-translation? A Case Study with Estonian" (Lisa Korotkova, Taido Purason, Agnes Luhtaru and Mark Fishel: LREC-COLING). TL;DR: improving machine translation from/into Estonian and releasing massive synthetic data. 5. "Analyzing Symptom-based Depression Level Estimation through the Prism of Psychiatric Expertise" (Navneet Agarwal, Kirill Milintsevich, Lucie Métivier, Maud Rothärmel, Gaël Dias and Sonia Dollfus: LREC-COLING). TL;DR: when detecting depression based on text, medical professionals and a neural network are not that different.
To Err Is Human, but Llamas Can Learn It Too
arxiv.org
To view or add a comment, sign in
-
A Touch, Vision, and Language Dataset for Multimodal Alignment (UC Berkeley, February 2024) Paper: https://lnkd.in/dpreQQ2V Abstract: "Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human-labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark."
A Touch, Vision, and Language Dataset for Multimodal Alignment
arxiv.org
To view or add a comment, sign in
-
Expanding the Boundaries of Multimodal Models with LLM2CLIP 🌐 CLIP has revolutionized how we think about multimodal learning by seamlessly linking language and vision through shared embeddings. But as incredible as it is, it has its limitations, especially with longer, more nuanced captions or cross-lingual tasks. Enter LLM2CLIP, a groundbreaking extension that combines the strengths of large language models (LLMs) with CLIP to push its capabilities further. Here's what makes it remarkable: - Improved Textual Discriminability: By fine-tuning an LLM in the caption space using contrastive learning, LLM2CLIP extracts rich textual representations that significantly enhance the output embeddings of CLIP. - Overcoming Text Limitations: The fine-tuned LLM acts as a "teacher" for the visual encoder, enabling support for longer, more complex captions without the constraints of CLIP's vanilla text encoder. 🚀 State-of-the-Art (SOTA) Performance: - Boosts SOTA EVA02 model performance by 16.5% on text retrieval tasks. - Transforms CLIP, trained solely on English, into a SOTA cross-lingual model. - Outperforms CLIP across almost all benchmarks when paired with models like Llava 1.5. This is a big step forward for multimodal models, showing how LLMs can amplify the potential of vision-language architectures. One thing to note: the paper’s results were based on PyTorch weights, and there might be discrepancies when using Hugging Face weights. Keep this in mind if you’re experimenting with the model! Read more here: https://lnkd.in/dYuZGTfU #AI #MachineLearning #MultimodalAI #ComputerVision #NaturalLanguageProcessing #CLIP #LLMs #LLM2CLIP #DeepLearning #ArtificialIntelligence #GenAI #GenerativeAI #TextEncoder #Encoder #LargeLanguageModel #LLM #TextEmbedding #ImageEmbedding #Caption
LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation
microsoft.github.io
To view or add a comment, sign in
3,494 followers
AWS Principal SA Data & Analytics for Energy | Transforming business with Data Analytics & Cloud technology
4moCongrats 🎉