SymmetricalDataSecurity: TextSynth Server

Friday, March 17, 2023

TextSynth Server

ts_server is a web server proposing a REST API to large language models. They can be used for example for text completion, question answering, classification, chat, translation, image generation, ...

It has the following characteristics:

Supports many Transformer variants (GPT-J, GPT-NeoX, GPT-Neo, OPT, Fairseq GPT, M2M100, CodeGen, GPT2, T5, RWKV, LLAMA) and Stable Diffusion.
Integrated REST JSON API for text completion, translation and image generation. It is used by textsynth.com.
Very high performance for small and large batches on CPU and GPU.
Efficient custom 8 bit and 4 bit quantization.
Larger models work optimally on lower cost GPUs (e.g. RTX 3090, RTX A6000) thanks to efficient quantization.
All is included in a single binary. Very few external dependencies (Python is not needed) so installation is easy on most Linux distributions.
Uses the LibNC library for simple tensor manipulation using the C language.
Simple command line tools (ts_test, ts_sd) are provided to test the various models.

The CPU version is released as binary code under the MIT license. The GPU version is commercial software. Please contact for the exact terms.

Download

Benchmarks

CPU: the speed is measured on an AMD Epyc 7313 CPU using 8 threads (ts_test -T 8). 100 tokens are generated.
GPU: the speed is measured on a RTX A6000 GPU. 100 tokens are generated.

Model⁽³⁾	CPU Speed (tokens/s)	GPU Speed (tokens/s)
gptj_6B_q8	12.6	84.2
gptneox_20B_q4	4.3	40.8
gptneox_20B_q8	3.5	27.4
llama_65B_q4	1.4	13.8

Available Models

We provide here the model files that can be used with the TextSynth Server. Each model was evaluated with the lm-evaluation-harness with the TextSynth server on a RTX A6000 GPU.

Language Models:


bloom_560M	2	29.176	36.8%	35.8%	51.4%	63.7%	36.0%	44.7%
codegen_6B_mono_q4	5	69.409	28.0%	35.7%	51.1%	60.2%	38.0%	42.6%
codegen_6B_mono_q8	8	67.262	28.1%	35.8%	50.8%	60.1%	39.1%	42.8%
fairseq_gpt_13B	27	3.567	71.9%	72.7%	67.5%	77.6%	70.1%	71.9%
fairseq_gpt_13B_bf4	9	3.646	71.2%	72.5%	67.6%	77.4%	70.6%	71.9%
fairseq_gpt_13B_bf8	15	3.565	71.8%	72.7%	67.2%	77.7%	70.0%	71.9%
flan_t5_base	1	12.891	54.2%	36.5%	54.7%	65.8%	62.1%	54.7%
flan_t5_base_q8	1	13.098	54.2%	36.4%	54.2%	65.7%	61.8%	54.5%
flan_t5_small	1	23.343	46.7%	29.2%	50.0%	62.4%	47.9%	47.2%
flan_t5_small_q8	1	23.449	46.7%	29.2%	49.7%	62.4%	48.2%	47.2%
flan_t5_xxl_q4	7	3.010	77.7%	71.5%	73.4%	77.6%	71.8%	74.4%
flan_t5_xxl_q8	13	3.049	77.8%	72.1%	75.1%	77.8%	73.1%	75.2%
gpt2_117M	1	40.110	32.9%	31.1%	52.1%	62.9%	27.3%	41.3%
gpt2_1558M	4	10.637	51.3%	50.8%	58.4%	70.8%	53.2%	56.9%
gpt2_1558M_q8	2	10.655	51.2%	50.8%	58.6%	70.8%	53.2%	56.9%
gpt2_345M	1	18.272	43.5%	39.4%	53.3%	67.7%	43.1%	49.4%
gpt2_345M_q8	1	18.452	43.1%	39.4%	53.1%	67.5%	41.9%	49.0%
gpt2_774M	2	12.966	47.8%	45.4%	55.6%	70.4%	48.5%	53.5%
gpt2_774M_q8	1	12.928	47.9%	45.4%	55.3%	70.3%	48.2%	53.4%
gptj_6B	13	4.124	69.0%	66.2%	64.8%	75.5%	66.9%	68.5%
gptj_6B_q4	4	4.153	68.9%	65.7%	63.9%	74.4%	67.0%	68.0%
gptj_6B_q8	7	4.122	69.1%	66.2%	64.4%	75.4%	66.4%	68.3%
gptneox_20B	43	3.657	72.6%	71.4%	65.5%	77.5%	73.3%	72.0%
gptneox_20B_q4	13	3.711	72.0%	69.3%	64.8%	76.7%	70.8%	70.7%
gptneox_20B_q8	23	3.659	72.6%	71.3%	65.8%	77.3%	72.9%	72.0%
llama_13B_q4	8	3.130	77.1%	78.6%	72.2%	78.3%	77.8%	76.8%	llama_13B_q8	15	3.178	76.5%	79.1%	73.2%	79.1%	77.1%	77.0%	llama_30B_q4	20	2.877	77.5%	82.4%	75.7%	80.2%	80.2%	79.2%	llama_30B_q8	36	2.853	77.7%	82.7%	76.3%	80.3%	80.4%	79.5%	llama_65B_q4	39	2.760	78.5%	83.9%	76.6%	81.4%	83.2%	80.7%	llama_7B	14	3.463	73.6%	76.2%	70.4%	78.1%	75.4%	74.7%	llama_7B_q4	5	3.549	73.2%	75.5%	70.4%	78.0%	74.7%	74.4%	llama_7B_q8	8	3.453	73.7%	76.1%	70.2%	78.0%	75.5%	74.7%
opt_125M	1	26.028	37.9%	31.3%	50.2%	63.2%	23.4%	41.2%
opt_30B_q4	19	3.656	71.5%	72.1%	68.0%	77.4%	69.9%	71.8%
opt_30B_q8	34	3.628	71.6%	72.3%	68.2%	77.7%	71.4%	72.3%
opt_66B_q4	40	3.308	73.4%	74.4%	68.4%	78.5%	75.0%	73.9%
pythia_deduped_1.4B	3	6.546	63.1%	52.2%	57.1%	72.7%	52.6%	59.5%
pythia_deduped_1.4B_q8	2	6.577	63.3%	52.1%	55.7%	73.1%	53.0%	59.4%
pythia_deduped_12B	25	3.854	70.9%	69.2%	63.9%	76.3%	70.8%	70.2%
pythia_deduped_12B_q4	8	4.187	69.2%	68.5%	63.1%	76.4%	69.6%	69.4%
pythia_deduped_12B_q8	14	3.857	70.9%	69.2%	64.2%	76.1%	70.9%	70.3%
pythia_deduped_160M	1	26.380	36.9%	32.3%	51.4%	63.8%	23.2%	41.5%
pythia_deduped_1B	3	7.273	58.5%	49.0%	54.5%	71.0%	49.9%	56.6%
pythia_deduped_1B_q8	2	7.286	58.4%	49.0%	54.9%	70.9%	49.0%	56.5%
pythia_deduped_2.8B	6	4.787	67.1%	61.6%	60.9%	74.4%	65.5%	65.9%
pythia_deduped_2.8B_q8	4	4.778	66.9%	61.5%	61.2%	74.5%	65.6%	66.0%
pythia_deduped_410M	1	10.827	51.7%	40.8%	54.0%	67.2%	43.0%	51.4%
pythia_deduped_410M_q8	1	10.729	51.8%	40.7%	53.8%	67.1%	42.7%	51.2%
pythia_deduped_6.9B	15	4.195	69.1%	65.7%	63.9%	75.1%	66.1%	68.0%
pythia_deduped_6.9B_q4	5	4.344	68.3%	65.0%	62.5%	75.3%	66.3%	67.5%
pythia_deduped_6.9B_q8	8	4.187	69.4%	65.7%	63.6%	75.5%	66.8%	68.2%
pythia_deduped_70M	1	96.126	25.6%	28.3%	54.4%	60.4%	13.1%	36.3%
rwkv_14B	29	3.819	71.6%	70.2%	63.1%	77.5%	47.2%	65.9%
rwkv_14B_q4	9	4.076	68.3%	69.8%	63.1%	77.1%	45.0%	64.7%
rwkv_14B_q8	16	3.806	71.9%	70.2%	63.0%	77.5%	47.1%	65.9%
rwkv_7B	16	4.396	67.5%	65.6%	61.9%	75.6%	39.7%	62.1%
rwkv_7B_q4	5	4.939	64.7%	64.8%	61.2%	75.4%	38.4%	60.9%
rwkv_7B_q8	9	4.395	67.5%	65.6%	61.6%	75.9%	40.2%	62.2%

Additional Models:

		Description
m2m100_1_2B_q8	2	Translation between 100 languages
sd-v1-4	3	Stable Diffusion text-to-image version 1.4

SHA256 of all the models: sha256.txt.

Notes:

Some models have restrictive licenses. In particular, OPT and LLAMA cannot be used commercially. BLOOM and Stable Diffusion can be used commercially but have use limitations.
For the larger models we don't provide the unquantized version when it is too large for consumer GPUs or when the quantized version gives the same performance as the unquantized version.
The q8 suffix indicates that the model was 8 bit quantized. The q4 suffix indicates that the model was 4 bit quantized. Unquantized models use either float16 or bfloat16 parameters.
Approximate amount of CPU or GPU RAM needed to run the model. It is also the approximate size of the model file.
lambada perplexity (ppl) are comparable only for models using the same tokenizer. So the lambada accuracy (acc) should be used when comparing all models.

Fabrice Bellard - https://bellard.org/

from Hacker News https://ift.tt/IvOLCMY

SymmetricalDataSecurity

Friday, March 17, 2023

TextSynth Server

TextSynth Server

Download

Benchmarks

Available Models

No comments:

Post a Comment

Blog Archive

Search This Blog

Total Pageviews