I Spent a Week Diving Into LLMs (for the First Time) as a Data Scientist
My learnings and resources to help you get started
Frankly, it was long overdue.
But I finally gave in to the GenAI “hype“ and did what I do best: Obsess.
So for the past week, I’ve been diving deep into LLMs with two main objectives in mind:
Gain a clear conceptual understanding of LLMs, enough to explain them even to non-technical people.
Learn enough to reduce the friction when deciding how to apply them to a technical project.
In this article, I want to share my learnings, along with some great resources to help you get started. Also, I give you my thoughts on what this means for Data Scientists, especially those entering the job market for the first time or changing companies.
My Takeaways
Rather than tell you what you can easily find with a quick Google search or by asking ChatGPT, I'd like to share my key takeaways and personal insights, focusing on what I believe is essential and practical for those looking to learn more:
1 — Pre-training LLMs is a long and expensive process, which is why pre-trained open-source models like the recently released LLama 3.1 are so valuable for the community.
Pre-training private LLMs can take weeks or even months, depending on the complexity of the model and the size of the dataset. This process can be quite expensive, as efficient training requires the use of cloud-based solutions and high-performance GPUs—lots of GPUs.
At an MIT event, when Sam Altman was asked if training GPT-4 cost $100 million dollars: He replied, “It’s more than that” (Source)
2 — LLMs can do much more than generate simple text responses; they can translate, summarize content, and even enhance information retrieval.
Tools like Perplexity AI best exemplify LLM’s capabilities for enhancing information retrieval, and by the looks of it, so will its soon-to-be-released competitor, SearchGPT (by OpenAI).
3 — Out of the box, LLMs suffer from many shortcomings like generating text that is either factually incorrect, nonsensical, or even fabricated— an issue popularly known as “Hallucinations“.
In a study published earlier this year, it was found that GPT-4 Hallucination Rate is 28.6% on simple tasks such as citing title, author, and year of publication.
4 — Although LLMs are great for general use cases, domain-specific LLMs can truly excel. This is what makes “domain adaptation” an important step in the training process.
There are many types of domain adaptation methods, like “Fine-tuning” which often follows the pre-training phase. This method may involve manually collected data, prioritizing quality over quantity, unlike the pre-training phase which requires a large quantity of data (of potentially low quality). Fine-tuning is what helps transform your model into an “Assistant" (in the general sense of the word).
This is why services like Amazon Mechanical Turk (MTurk) exist, which allows companies to outsource various tasks, including data annotation and labeling. Another example is SurgeAI, the platform known for providing its human data labeling services to help power Anthropic and Claude 3 (Source).
Another method for making LLMs more domain-specific, while also addressing their inability to (1) provide sources and (2) ensure their responses are not outdated, is by introducing the framework RAG.
5 — Getting practical experience using LLMs can be as easy as making a call to the OpenAI API, but lots of free and open-source options can be found via Hugging Face.
If you are a beginner or want a quick start then OpenAI API is likely the best choice due to its simplicity and powerful models, but it’s not free.
If you are comfortable with Python and want more control, Hugging Face provides a more flexible and open-source approach. The BART (large-sized model), fine-tuned on CNN Daily Mail is a great starting point for getting familiar with LLMs.
Resources
Core resources I used during my learning:
Applied LLMs Master 👩🏫: A FREE 10-week course developed by Aishwarya Naresh Reganti, an ML researcher, lecturer, and Gen AI Tech Lead at AWS.
Intro to Large Language Models 🎬: A 1-hour-long intro to LLMs by Andrej Karpathy, one of the founding members of OpenAI.
What’s wrong with LLMs and what we should be building instead 🎬: A talk by Thomas G. Dietterich, one of the pioneers of the field of machine learning.
Supplementary resources I recommend you check out:
No DS Left Behind
There are many Data Scientists out there still wondering if learning about LLMs is worth the time and effort.
We know that Machine Learning is an essential skill for Data Scientists. We also know that LLMs currently play a significant role in the field, and I believe they will continue to be influential in the foreseeable future.
, a Data Scientist, ML Engineer, and author of the newsletter, recently shared his thoughts on this, along with some good data on the popularity of LLMs:But this isn’t just about keeping up with the trends, the way I see it, there are two good reasons for Data Scientists to add LLMs to their toolkit:
Wide range of applications: LLMs can power much more than just chatbots, and there are tons of possible use cases yet to explore.
Growing demand for NLP skills: Many companies are actively seeking data scientists with knowledge of LLMs and NLP. Having these skills can make you more competitive in the job market.
💡 If you are just getting started with learning about Data Science, your first priority should be building a solid foundation of statistics & probability and getting lot’s of practice doing statistical analysis.
After spending a week learning about LLMs, there is no doubt in my mind that this is a skill I need to add to my toolkit. There are lots of ways I can leverage them in my day-to-day as a Data Scientist, and being able to extract insights from unstructured text data ALWAYS comes in handy.
What I plan to do next is get my hands dirty a bit (or a lot) by applying what I learned to a technical project. But first, I need a break from typing the word LLM.
Thank you for reading! I hope this article inspires you to take on a new challenge and keep developing your skills as a Data Scientist.
- Andres
Don’t forget to hit the like ❤️ button at the bottom of this email to help support me. It really makes a difference!
As Pascal states, the AI train hype won’t stop any time soon, we like it or not. So it’s better to at least understand the basics of it and avoid getting left behind.
Thanks for the concise list of resources to get started with them Andres!
A good learning process, keep it up!