OSAI more… 8th Edition — It’s all about the data… or is it?
Byline: The Editor in Chief (The EiC)
Week of June 09 - 15, 2024
Artificial intelligence and machine learning can be boiled down to software doing operations on data ... to generate more data ... about the data.
This week, we've got some juicy stories about data—where it comes from, whose data is in the mix, and how much of it is needed for an AI system to be Open Source AI. Grab your wakeboard and jump behind the speedboat; we’re going for a ride on the datalake.
OSAI News
Crescendo Jailbreak Technique: Microsoft researchers have developed an innovative multi-turn jailbreak technique called Crescendo. This technique uses harmless inputs to evade safety regulations. The automated tool, Crescendomation, has successfully jailbroken several Open Source AI chatbots. Expect Crescendomation to be released as Open Source soon. Just imagine, now you can jailbreak chatbots without lifting a finger! Or maybe just a thumb.
LAION-5B Dataset Controversy: The LAION-5B AI training image dataset recently came under fire for containing real images of children, as discovered by Human Rights Watch. The maintainers promptly removed the links, but the content remains on the web. This highlights a larger issue: the data exists out there, and Open Source datasets like LAION-5B make it visible and actionable. Stability AI’s Stable Diffusion, trained on this dataset, lacks the same level of transparency. The takeaway? Open Source can sometimes mean open scrutiny. Remember, once on the internet, always on the internet!
Octo: The Generalist Robot AI: Researchers from UC Berkeley, Stanford, and CMU have released Octo, a generalist Open Source AI model for robot control. This neural network can manage various tasks across different robots, from picking up spoons to wiping tables. Think of it as the Swiss Army knife of robot AI. The team’s goal is to create foundational models for robot controls, akin to how language models like ChatGPT serve natural language processing. Octo: Eight arms to hold your attention and a brain to match.
Timescale’s PostgreSQL Extensions: Timescale has introduced two new PostgreSQL extensions—pgvectorscale and pgai. These extensions are designed to enhance AI development cases like retrieval-augmented generation (RAG) search and agent applications. The vendor claims these make PostgreSQL faster and cheaper than proprietary vector databases used for AI. Looks like PostgreSQL just leveled up in the AI game! Finally, something that gets better with age.
Mistral’s Fundraising Success: Mistral has successfully raised €600 million from investors like IBM, Samsung, Nvidia, and Cisco. This underscores the appeal of Open Source AI to big money. However, Mistral isn't a pure Open Source player, and siliconANGLE's article mistakenly refers to Codestral as Open Source. Even Mistral knows better than to mislabel their own license. When life gives you money, make sure you read the fine print.
Deciphering RHEL AI: This article dives deep into Red Hat Enterprise Linux AI (RHEL AI), detailing how Red Hat’s platform aims to bring AI to non-technical users. It’s like giving everyone in the office a superpower. With great power comes great responsibility… to update your software.
GitHub CEO’s Keynote on Open Source: During his GitHub Universe keynote, CEO Thomas Dohmke highlighted the importance of Open Source and AI in supporting India's upcoming transition to a global software development leader. India, it seems, is ready to take the software crown. Namaste, AI!
Databricks Unity Catalog OSS: Databricks announced Unity Catalog OSS, an Open Source project for data and AI governance. This seems to be a response to Snowflake’s recent Open Source catalog. The code wasn't live at the time of writing, so we’ll stay tuned for updates. Unity Catalog OSS is coming soon to a GitHub near you. Stay tuned, data fans!
Protect AI’s Security Bounty: Security firm Protect AI found several critical vulnerabilities in Open Source AI/ML tools through their bounty program. Another win for transparency and community-driven security improvements. If only finding your car keys was this rewarding.
Raspberry Pi AI Kit: Raspberry Pi launched a $70 AI kit, ideal for hobbyists and design hackers. The kit includes the Hailo-8L co-processor and pre-installed Hailo AI Tappas libraries for real-time complex AI vision processing. Finally, a kit that lets you tinker with AI without breaking the bank. After all, it’s not just about the size of your hardware; it’s how you use it.
Raspberry Pi: The AI Tinkerer’s Dream: This article argues why the Raspberry Pi, with its AI kit and 13 TOPS of processing power, is the perfect companion for AI enthusiasts. Because every tinkerer needs a trusty sidekick.
Common Crawl and Data Licensing: The certainty of Open Source licensing is a contrast to the ambiguous licensing of web data. The Common Crawl project, which has been collecting web snapshots since 2007, is under increasing scrutiny. This Wired post dives into the legal complexities, while the Mozilla research report provides the backstory. It’s a tangled web of data and legalities. Why did the spider look for a job in tech? It was a natural at debugging. What kind of role did the spider actually land? Senior Web Crawler.
Yandex’s YaFSDP: Yandex has released YaFSDP, an Open Source project that reduces GPU resource consumption for model training by using a sharded data parallelism framework. Looks like training just got a bit cheaper and greener. Finally, a training program that doesn’t make you sweat.
BAAI’s Open Source AI Models: The Beijing Academy of Artificial Intelligence (BAAI) has launched a suite of Open Source AI models and tools. We weren’t entirely sure we found all the right repos, so let us know if you find any others! When in doubt, just keep searching the repo.
OSAI Opines
AI Security and Open Source: Derek Zimmer from the Open Source Technology Improvement Fund (OSTIF) lays out a sharp and on-point argument for why the industry needs to prioritize the security of Open Source AI. He also emphasizes the importance of OWASP’s extension on LLM application vulnerabilities. OSTIF is thinking about building a barn with a smart lock on the door, before you put the horses away.
Learning from Kubernetes: In an InfoWorld opinion piece, a seasoned product management expert from the Open Source community explains why the Open Source AI world should look to the Kubernetes project for guidance on development, governance, funding, and support. This post is full of clear wisdom, friends. Kubernetes: not just a buzzword, but a playbook.
A Red Hat for AI?: Another InfoWorld piece suggests there needs to be “a Red Hat for AI.” The subtitle seems to question if Red Hat has answered this at their recent summit: “We’re still waiting for a trusted vendor to spare enterprises from the confusion and guesswork of artificial intelligence.” Spoiler alert: Red Hat might be closer than you think. Certainly they think so, anyway.
Funding Open Source AI with Crypto: This Web3 perspective proposes funding Open Source AI model training using tokens on a blockchain. The author makes some interesting arguments but also lumps restrictive LLaMA-licensed models with Open Source ones. A “new open-source generative AI license” with questionable provisions is suggested. Sometimes the old ways are the best ways.
Go Small or Go Home: CIO magazine advises IT leaders to go for smaller, purpose-built language models for their customized needs. Because sometimes, bigger isn't always better.
Legal Insights on Generative AI: Some generic legal suggestions about generative AI include a clear and accurate description of Open Source. A lawyer, a robot, and a programmer walk into a bar… and only one of them knows the Open Source Definition.
OpenChessRobot: Play Chess with Robots: Researchers at Delft University of Technology have developed OpenChessRobot, a robotic system for human-robot interaction in chess. The code and datasets are available on GitHub. Finally, a robot that won’t cheat at chess—unless it’s programmed to.
OSAI FUD
Nothing on the teletype this week! Sometimes, no news is good news.
Recommended by LinkedIn
Eagle Eye on: OSAI Legislation and Policy
EU’s AI Regulations: This post from The Critic has strong opinions about the future of AI in Europe under current EU regulations. Critics argue that the sweeping legislation could have negative impacts on Open Source AI, drawing criticism from across the political and social spectrum. When in Europe, don’t let your AI take a vacation.
California SB1047 Update: This article covers the background and status of California SB1047, noting that the bill’s author has revised it to provide some exemption for Open Source developers from accountability for users of their platforms. Looks like even lawmakers know when to backpedal.
Gift to Big Tech? This post discusses the issues with SB1047, arguing that the current version favors large tech companies the bill is supposedly designed to control. The author mistakenly views LLaMA and other models as Open Source. Big Tech always seems to get the best gifts.
International Perspective on SB1047: A short post for an international audience in Korea, making the common mistake of viewing Meta as creating Open Source AI. No, Meta’s not in the Open Source AI club, not yet.
Andreesen’s AI Regulation Concerns: At a recent Stanford University meeting, venture capitalist Marc Andreesen warned about the dangers of AI regulation, even suggesting “stealth bombers” might target data centers. Sounds like a plot for a new tech thriller.
Colorado’s AI Law Revision: The governor and attorney general of Colorado announced they will work to revise the recently passed AI law SB205 in response to public and business criticism. Even laws need a version update sometimes.
SB1047 Resources: For those tracking SB1047, here are the California Senate press release, the full bill, and a link to compare versions of the bill as it was amended. Useful for figuring out who influenced what.
Open Source AI Definition Update
Eleventh Edition of OSAID Townhall: Catch up on the latest discussions from the OSAID Townhall. Because sometimes, the best ideas come from a good ol’ town meeting.
Data Requirements Discussion: Ongoing discussions about data requirements in the OSAID, focusing on how much is needed and what kind of data out of the training set is necessary for replication. The discussion has expanded this week to include explaining what is meant by "Data information'' in the OSAID. Data, data, data — it’s a conversation that keeps on giving.
Harvard’s Judicial AI Study: A recent paper from Harvard found that current judicial AI technologies perform worse than human judges at predicting recidivism. The key ingredient? An Open Source AI for empirical evaluation and analysis. AI: still learning to judge a book by its cover.
OSAI WTFaux?
Modello Italia’s Licensing Mystery: iGenius is touting their Modello Italia as 100% Open Source, but we’re skeptical. Digging through terms and conditions reveals restrictions that don’t align with the Open Source definition. It’s like promising pizza and delivering a cardboard cutout.
Nvidia’s Nemotron-4 340B: Nvidia released Nemotron-4 340B, a family of models for generating synthetic data for LLM training. The “NVIDIA Open Model License Agreement” might actually be OSD-compliant!? If so, it’s a good step for Open Source AI. Let’s hope this license isn’t too good to be true.
HUSKY: The Methodical Language Agent: Researchers have built HUSKY, a language agent designed to handle diverse, complex tasks. It methodically generates actions, then uses domain-specific models and tools to assist in performing the actions. Currently, there’s no license file in the GitHub repository. Fingers crossed it’s just an oversight.
Fauxpen AI
Stable Audio Open: The Openwashing Continues: Stability AI’s Stable Audio Open claims to be Open Source but it actually uses their NOT Open Source models on Hugging Face. Stability explains how they parse their licensing model as, ‘’This open model provides a glimpse into generative AI for sound design while prioritising responsible development alongside creative communities.’’ (Emphasis added.) It’s a freemium model — free-ish, closed-ish, and open-ish. Stability AI, pick a side!
Helping Our Fellow Journalists
Emerge’s Misstep: Stable Diffusion is NOT Open Source. Sorry, Emerge, but that’s a big “nope.”
Maeil Business Newspaper: Meta’s AI models are not Open Source. Just because you can see them doesn’t mean you can use them freely.
Fierce Network’s Mix-up: Stability AI has a mix of Open Source and non-Open licenses in their repo. Be cautious of company press releases as sources.
Benzinga’s Zinger: Neither Meta’s LLaMa models nor Alibaba’s Qwen2 are Open Source. That’s some classic openwashing at work.
siliconANGLE and Emerge’s Error: The Stable AI license isn’t Open Source. Publicly viewable model weights don’t equate to Open Source. They don't create a frictionless place to collaborate, and they don't assure developer it is safe to incorporate with their work the way genuine Open Source is.
PassionateGeekz Got It Right: They correctly pointed out Stable AI’s “claimed to be” status. Nicely done!
siliconANGLE’s Dream Machine Confusion: We’re beginning to worry about the writers and fact checkers at siliconANGLE magazine. They seem to consider Open Source to be a loosely defined concept for anything that isn’t a black box product. Here they describe a classic freemium model as an “open-source approach.” It’s like calling Pixar movie trailers an ‘’Open Source approach to releasing movies.’’
Analytics Insight’s Error: Starting your list of “best Open Source” LLMs with GPT-3, which is closed, is a surefire way to lose credibility.
##30##
Strategic Sales Consulting & Custom Software Solutions || China Sourcing with SinoImportSolutions
6modata is indeed essential for learning, understanding, and decision-making. open source ai sounds intriguing Karsten Wade