Run ChatGPT-like LLMs on your laptop in 3 lines of code

141 points by amaiya 2 years ago on hackernews | 36 comments

gitgud | 2 years ago

It's a little bit ironic that the package is called "onprem" but the second line imports an external model from huggingface...

    from onprem import LLM
    url = 'https://huggingface.co/TheBloke/CodeUp-Llama.....'
    llm = LLM(url, n_gpu_layers=43) # see below for GPU information
Anyway looks like a great little project, nice work!

keyle | 2 years ago

I'm not the OP but I believe it is correct. It says 'RUN' ... LLM. Meaning you have to get it from somewhere.

Training your own from nothing is a monumental task, I don't think many of us can realistically do it from scratch.

scubbo | 2 years ago

It's still running on-prem if the execution happens on compute hardware that you control. If importing code from somewhere else disqualified something from being on-prem, then no application would ever be on-prem - when's the last time a meaningful application had no dependencies!?

paxys | 2 years ago

Code that runs on prem doesn't magically show up on your server, it still has to be fetched from somewhere.

ralphc | 2 years ago

How big are these models that are downloaded? Is 7B 7 gigabytes of data?

notpublic | 2 years ago

If the model is stored in 32bits, it will be ~4x param size

You can see the actual size from hugging face. For example, https://huggingface.co/WizardLM/WizardLM-7B-V1.0/tree/main

Size with quantization, https://github.com/ggerganov/llama.cpp#quantization

raincole | 2 years ago

It's actually very straightforward.

7B = 7 Billions parameters.

If 1 parameter takes 1 byte then a 7B model would be about 7GB in size.

Usually 1 parameter takes 4 bytes tho (a 32-bit float), so it would be aboult 28GB.

You can use 16-bit, 8-bit or even 4-bit float.

artursapek | 2 years ago

wow, this looks approachable. will have to try it tomorrow

satvikpendem | 2 years ago

Ollama.ai is pretty good too, any differences with this one?

Seems like all of these open source wrappers, just as the closed sourced ones, are a race to the bottom.

jmorgan | 2 years ago

I work on Ollama. It's a good question since there are quite a few tools emerging in this space.

The focus for Ollama is to make downloading and serving a model easy – there's an included `ollama` CLI but it's all powered by a REST API. Hopefully, it's a way to support really cool applications of LLMs like OP's onprem tool.

OP's tool is more focused on ingesting and analyzing data. There seems to be quite a bit of interesting opportunity as an application of LLMs – e.g. analyzing not only local docs but data in a remote data store.

kacesensitive | 2 years ago

I've built a couple projects that use Ollama. Thanks for making such a cool tool!

[Deleted] | 2 years ago

<p>[Empty / deleted comment]</p>

d4rkp4ttern | 2 years ago

Related: say I’ve written code that uses OpenAI API, and code that handles streaming, retries, function-calls. And now I want to switch it to using a local non-API-based model such as llama2, without changing too much code.

Is there a library that offers a layer on top of local models that simulates the OpenAI API?

Daviey | 2 years ago

d4rkp4ttern | 2 years ago

Thanks, this was hard to find

dangus | 2 years ago

[flagged]

local_person | 2 years ago

[dead]

froggertoaster | 2 years ago

Wonderful. I love the advent of open source LLMs, and love the turnkey nature of this product.

What sold me on ChatGPT was its efficacy combined with its ease of use. As the owner of a consultancy, I find time to do technical exploration to be more and more scarce - stuff like this that makes it super easy for me to run an LLM is most welcome.

rawoke083600 | 2 years ago

We have come far, not too long ago it was 'sudoku solver in 5 lines' of bash !

Lol but for real today for the first time when browsing new laptops I was looking for high vram because of llm.

[Deleted] | 2 years ago

<p>[Empty / deleted comment]</p>

moneywoes | 2 years ago

What is the most effective local llm model?

tommek4077 | 2 years ago

Totally depends on your use case. There are optimized models for stories, code, knowlegde...

eriky | 2 years ago

[dead]

[Deleted] | 2 years ago

<p>[Empty / deleted comment]</p>

keyle | 2 years ago

Saving you some time, if you have a Macbook pro M1/M2 with 32GB of RAM (I presume a lot of HN folks would), you can comfortably run the `34B` models on CPU or GPU.

And... If you'd like a more hands on approach, here is a manual approach to get llama running locally

    - https://github.com/ggerganov/llama.cpp 
    - follow instructions to build it (note the `METAL` flag)
    - https://huggingface.co/models?sort=trending&search=gguf
    - pick any `gguf` model that tickles your fancy, download instructions will be there
and a little script like this will get it running swimmingly

   ./main -m ./models/<file>.gguf --color --keep -1 -n -1 -ngl 32 --repeat_penalty 1.1 -i -ins
Enjoy the next hours of digging through flags and the wonderful pit of time ahead of you.

NOTE: I'm new at this stuff, feedback welcome.

seaal | 2 years ago

Note that the OP repo doesn't yet support GGUF format.

keyle | 2 years ago

Yes it does. Or do you mean the OP's github repo?

seaal | 2 years ago

Yeah I was referring to OP, oops

>We currently support models in GGML format. However, the GGML format has now been superseded by GGUF.

>Future versions of OnPrem.LLM will use the newer GGUF format.

kgwgk | 2 years ago

You can do the first step a lot faster with nix: `nix shell github:ggerganov/llama.cpp -c llama`

MuffinFlavored | 2 years ago

> GGUF

GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. The key benefit of GGUF is that it is a extensible, future-proof format which stores more information about the model as metadata.

zitterbewegung | 2 years ago

With your M1 or M2 Max with 64 GB of ram and up you can run using llama.cpp the original llama from Facebook the 65B model.

Here is the starting output of running Llama 65B in a gist

https://gist.github.com/zitterbewegung/4787e42617aa0be6019c3...

phren0logy | 2 years ago

I learned about ollama here on HN, and have found that to be super easy. Worth a look to compare with this one if you are looking to run LLMs locally.

3abiton | 2 years ago

What's the difference with OP?

geepytee | 2 years ago

link?

carbocation | 2 years ago

brianjking | 2 years ago

https://ollama.ai/

There's also plenty of other local LLM type tools like GPT4All, LMStudio, Simon Wilsons LLM, privateGPT, and about a million other setups.

zyl1n | 2 years ago

Just yesterday, I found out that Simon Wilson's name is Simon Willison.

jmorgan | 2 years ago

Love how simple of an interface this has. Local LLM tooling can be super daunting, but reducing it to a simple ingest() and then prompt() is really neat.

By chance, have you checked out Ollama (https://github.com/jmorganca/ollama) as a way to run the models like Llama 2 under the hood?

One of the goals of the project is to make it easy to download and run GPU-accelerated models, ideally with everything pre-compiled so it's easy to get up and running. It's API that can be used by tools like this – would love to know if it would be helpful (or not!)

There's a LangChain model integration for it and a PrivateGPT example as well that might be a good pointer on using the LangChain integration: https://github.com/jmorganca/ollama/tree/main/examples/priva.... There's also a LangChain PR open to add support for generating embeddings, although there's a bit more work to do to support the major embedding models.

Best of luck with the project!