首页 最新 热门 推荐

  • 首页
  • 最新
  • 热门
  • 推荐

4.1 低精度训练与大模型下载

  • 25-02-16 23:42
  • 2204
  • 11945
blog.csdn.net

目录

1 背景介绍:

1.1 大模型训练的难点是什么?

1.2 模型训练的显存占用:

1.3 如何降低训练时的显存占用?

1.4 模型本身的显存占用:

2 模型下载:

2.1 配置环境

2.2 从modelscope上下载

3 模型加载(以chat-glm为例):


1 背景介绍:

1.1 大模型训练的难点是什么?

1.2 模型训练的显存占用:

1.3 如何降低训练时的显存占用?

1.4 模型本身的显存占用:

之前我们节约显存占用的优化手段都是针对降低训练时的显存占用,那么如何降低模型本身的显存占用呢?这就需要涉及低精度问题。

2 模型下载:

可以理解modelscope为一个更适合中国宝宝的transformers库,当然没有transformers那么完善,但用于下载模型还是可以的。

2.1 配置环境

pip install modelscope jupyterlab

2.2 从modelscope上下载

  1. from modelscope.hub.snapshot_download import snapshot_download
  2. # model_id 模型id
  3. # cache_dir 本地缓存目录
  4. # ignore_file_pattern 无需下载的文件
  5. snapshot_download(
  6. model_id="Shanghai_AI_Laboratory/internlm-20b", cache_dir="F:\Modelscope_models"
  7. )

3 模型加载(以chat-glm为例):

因为transformers中没有chat_glm的官方实现,所以要设置trust_remote_code=True。

huggingface可以加载modelscope上的模型,但是不是所有模型都可以。

  1. from transformers import AutoTokenizer, AutoModel
  2. tokenizer = AutoTokenizer.from_pretrained("d:/Pretrained_models/ZhipuAI/chatglm2-6b/", trust_remote_code=True)
  3. tokenizer

ChatGLMTokenizer(name_or_path='d:/Pretrained_models/ZhipuAI/chatglm2-6b/', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)

  1. model = AutoModel.from_pretrained("d:/Pretrained_models/ZhipuAI/chatglm2-6b/", trust_remote_code=True)
  2. model
  1. ChatGLMForConditionalGeneration(
  2. (transformer): ChatGLMModel(
  3. (embedding): Embedding(
  4. (word_embeddings): Embedding(65024, 4096)
  5. )
  6. (rotary_pos_emb): RotaryEmbedding()
  7. (encoder): GLMTransformer(
  8. (layers): ModuleList(
  9. (0): GLMBlock(
  10. (input_layernorm): RMSNorm()
  11. (self_attention): SelfAttention(
  12. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  13. (core_attention): CoreAttention(
  14. (attention_dropout): Dropout(p=0.0, inplace=False)
  15. )
  16. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  17. )
  18. (post_attention_layernorm): RMSNorm()
  19. (mlp): MLP(
  20. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  21. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  22. )
  23. )
  24. (1): GLMBlock(
  25. (input_layernorm): RMSNorm()
  26. (self_attention): SelfAttention(
  27. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  28. (core_attention): CoreAttention(
  29. (attention_dropout): Dropout(p=0.0, inplace=False)
  30. )
  31. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  32. )
  33. (post_attention_layernorm): RMSNorm()
  34. (mlp): MLP(
  35. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  36. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  37. )
  38. )
  39. (2): GLMBlock(
  40. (input_layernorm): RMSNorm()
  41. (self_attention): SelfAttention(
  42. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  43. (core_attention): CoreAttention(
  44. (attention_dropout): Dropout(p=0.0, inplace=False)
  45. )
  46. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  47. )
  48. (post_attention_layernorm): RMSNorm()
  49. (mlp): MLP(
  50. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  51. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  52. )
  53. )
  54. (3): GLMBlock(
  55. (input_layernorm): RMSNorm()
  56. (self_attention): SelfAttention(
  57. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  58. (core_attention): CoreAttention(
  59. (attention_dropout): Dropout(p=0.0, inplace=False)
  60. )
  61. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  62. )
  63. (post_attention_layernorm): RMSNorm()
  64. (mlp): MLP(
  65. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  66. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  67. )
  68. )
  69. (4): GLMBlock(
  70. (input_layernorm): RMSNorm()
  71. (self_attention): SelfAttention(
  72. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  73. (core_attention): CoreAttention(
  74. (attention_dropout): Dropout(p=0.0, inplace=False)
  75. )
  76. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  77. )
  78. (post_attention_layernorm): RMSNorm()
  79. (mlp): MLP(
  80. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  81. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  82. )
  83. )
  84. (5): GLMBlock(
  85. (input_layernorm): RMSNorm()
  86. (self_attention): SelfAttention(
  87. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  88. (core_attention): CoreAttention(
  89. (attention_dropout): Dropout(p=0.0, inplace=False)
  90. )
  91. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  92. )
  93. (post_attention_layernorm): RMSNorm()
  94. (mlp): MLP(
  95. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  96. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  97. )
  98. )
  99. (6): GLMBlock(
  100. (input_layernorm): RMSNorm()
  101. (self_attention): SelfAttention(
  102. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  103. (core_attention): CoreAttention(
  104. (attention_dropout): Dropout(p=0.0, inplace=False)
  105. )
  106. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  107. )
  108. (post_attention_layernorm): RMSNorm()
  109. (mlp): MLP(
  110. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  111. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  112. )
  113. )
  114. (7): GLMBlock(
  115. (input_layernorm): RMSNorm()
  116. (self_attention): SelfAttention(
  117. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  118. (core_attention): CoreAttention(
  119. (attention_dropout): Dropout(p=0.0, inplace=False)
  120. )
  121. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  122. )
  123. (post_attention_layernorm): RMSNorm()
  124. (mlp): MLP(
  125. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  126. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  127. )
  128. )
  129. (8): GLMBlock(
  130. (input_layernorm): RMSNorm()
  131. (self_attention): SelfAttention(
  132. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  133. (core_attention): CoreAttention(
  134. (attention_dropout): Dropout(p=0.0, inplace=False)
  135. )
  136. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  137. )
  138. (post_attention_layernorm): RMSNorm()
  139. (mlp): MLP(
  140. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  141. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  142. )
  143. )
  144. (9): GLMBlock(
  145. (input_layernorm): RMSNorm()
  146. (self_attention): SelfAttention(
  147. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  148. (core_attention): CoreAttention(
  149. (attention_dropout): Dropout(p=0.0, inplace=False)
  150. )
  151. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  152. )
  153. (post_attention_layernorm): RMSNorm()
  154. (mlp): MLP(
  155. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  156. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  157. )
  158. )
  159. (10): GLMBlock(
  160. (input_layernorm): RMSNorm()
  161. (self_attention): SelfAttention(
  162. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  163. (core_attention): CoreAttention(
  164. (attention_dropout): Dropout(p=0.0, inplace=False)
  165. )
  166. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  167. )
  168. (post_attention_layernorm): RMSNorm()
  169. (mlp): MLP(
  170. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  171. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  172. )
  173. )
  174. (11): GLMBlock(
  175. (input_layernorm): RMSNorm()
  176. (self_attention): SelfAttention(
  177. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  178. (core_attention): CoreAttention(
  179. (attention_dropout): Dropout(p=0.0, inplace=False)
  180. )
  181. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  182. )
  183. (post_attention_layernorm): RMSNorm()
  184. (mlp): MLP(
  185. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  186. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  187. )
  188. )
  189. (12): GLMBlock(
  190. (input_layernorm): RMSNorm()
  191. (self_attention): SelfAttention(
  192. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  193. (core_attention): CoreAttention(
  194. (attention_dropout): Dropout(p=0.0, inplace=False)
  195. )
  196. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  197. )
  198. (post_attention_layernorm): RMSNorm()
  199. (mlp): MLP(
  200. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  201. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  202. )
  203. )
  204. (13): GLMBlock(
  205. (input_layernorm): RMSNorm()
  206. (self_attention): SelfAttention(
  207. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  208. (core_attention): CoreAttention(
  209. (attention_dropout): Dropout(p=0.0, inplace=False)
  210. )
  211. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  212. )
  213. (post_attention_layernorm): RMSNorm()
  214. (mlp): MLP(
  215. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  216. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  217. )
  218. )
  219. (14): GLMBlock(
  220. (input_layernorm): RMSNorm()
  221. (self_attention): SelfAttention(
  222. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  223. (core_attention): CoreAttention(
  224. (attention_dropout): Dropout(p=0.0, inplace=False)
  225. )
  226. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  227. )
  228. (post_attention_layernorm): RMSNorm()
  229. (mlp): MLP(
  230. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  231. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  232. )
  233. )
  234. (15): GLMBlock(
  235. (input_layernorm): RMSNorm()
  236. (self_attention): SelfAttention(
  237. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  238. (core_attention): CoreAttention(
  239. (attention_dropout): Dropout(p=0.0, inplace=False)
  240. )
  241. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  242. )
  243. (post_attention_layernorm): RMSNorm()
  244. (mlp): MLP(
  245. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  246. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  247. )
  248. )
  249. (16): GLMBlock(
  250. (input_layernorm): RMSNorm()
  251. (self_attention): SelfAttention(
  252. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  253. (core_attention): CoreAttention(
  254. (attention_dropout): Dropout(p=0.0, inplace=False)
  255. )
  256. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  257. )
  258. (post_attention_layernorm): RMSNorm()
  259. (mlp): MLP(
  260. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  261. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  262. )
  263. )
  264. (17): GLMBlock(
  265. (input_layernorm): RMSNorm()
  266. (self_attention): SelfAttention(
  267. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  268. (core_attention): CoreAttention(
  269. (attention_dropout): Dropout(p=0.0, inplace=False)
  270. )
  271. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  272. )
  273. (post_attention_layernorm): RMSNorm()
  274. (mlp): MLP(
  275. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  276. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  277. )
  278. )
  279. (18): GLMBlock(
  280. (input_layernorm): RMSNorm()
  281. (self_attention): SelfAttention(
  282. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  283. (core_attention): CoreAttention(
  284. (attention_dropout): Dropout(p=0.0, inplace=False)
  285. )
  286. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  287. )
  288. (post_attention_layernorm): RMSNorm()
  289. (mlp): MLP(
  290. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  291. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  292. )
  293. )
  294. (19): GLMBlock(
  295. (input_layernorm): RMSNorm()
  296. (self_attention): SelfAttention(
  297. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  298. (core_attention): CoreAttention(
  299. (attention_dropout): Dropout(p=0.0, inplace=False)
  300. )
  301. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  302. )
  303. (post_attention_layernorm): RMSNorm()
  304. (mlp): MLP(
  305. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  306. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  307. )
  308. )
  309. (20): GLMBlock(
  310. (input_layernorm): RMSNorm()
  311. (self_attention): SelfAttention(
  312. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  313. (core_attention): CoreAttention(
  314. (attention_dropout): Dropout(p=0.0, inplace=False)
  315. )
  316. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  317. )
  318. (post_attention_layernorm): RMSNorm()
  319. (mlp): MLP(
  320. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  321. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  322. )
  323. )
  324. (21): GLMBlock(
  325. (input_layernorm): RMSNorm()
  326. (self_attention): SelfAttention(
  327. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  328. (core_attention): CoreAttention(
  329. (attention_dropout): Dropout(p=0.0, inplace=False)
  330. )
  331. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  332. )
  333. (post_attention_layernorm): RMSNorm()
  334. (mlp): MLP(
  335. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  336. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  337. )
  338. )
  339. (22): GLMBlock(
  340. (input_layernorm): RMSNorm()
  341. (self_attention): SelfAttention(
  342. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  343. (core_attention): CoreAttention(
  344. (attention_dropout): Dropout(p=0.0, inplace=False)
  345. )
  346. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  347. )
  348. (post_attention_layernorm): RMSNorm()
  349. (mlp): MLP(
  350. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  351. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  352. )
  353. )
  354. (23): GLMBlock(
  355. (input_layernorm): RMSNorm()
  356. (self_attention): SelfAttention(
  357. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  358. (core_attention): CoreAttention(
  359. (attention_dropout): Dropout(p=0.0, inplace=False)
  360. )
  361. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  362. )
  363. (post_attention_layernorm): RMSNorm()
  364. (mlp): MLP(
  365. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  366. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  367. )
  368. )
  369. (24): GLMBlock(
  370. (input_layernorm): RMSNorm()
  371. (self_attention): SelfAttention(
  372. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  373. (core_attention): CoreAttention(
  374. (attention_dropout): Dropout(p=0.0, inplace=False)
  375. )
  376. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  377. )
  378. (post_attention_layernorm): RMSNorm()
  379. (mlp): MLP(
  380. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  381. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  382. )
  383. )
  384. (25): GLMBlock(
  385. (input_layernorm): RMSNorm()
  386. (self_attention): SelfAttention(
  387. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  388. (core_attention): CoreAttention(
  389. (attention_dropout): Dropout(p=0.0, inplace=False)
  390. )
  391. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  392. )
  393. (post_attention_layernorm): RMSNorm()
  394. (mlp): MLP(
  395. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  396. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  397. )
  398. )
  399. (26): GLMBlock(
  400. (input_layernorm): RMSNorm()
  401. (self_attention): SelfAttention(
  402. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  403. (core_attention): CoreAttention(
  404. (attention_dropout): Dropout(p=0.0, inplace=False)
  405. )
  406. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  407. )
  408. (post_attention_layernorm): RMSNorm()
  409. (mlp): MLP(
  410. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  411. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  412. )
  413. )
  414. (27): GLMBlock(
  415. (input_layernorm): RMSNorm()
  416. (self_attention): SelfAttention(
  417. (query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
  418. (core_attention): CoreAttention(
  419. (attention_dropout): Dropout(p=0.0, inplace=False)
  420. )
  421. (dense): Linear(in_features=4096, out_features=4096, bias=False)
  422. )
  423. (post_attention_layernorm): RMSNorm()
  424. (mlp): MLP(
  425. (dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
  426. (dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
  427. )
  428. )
  429. )
  430. (final_layernorm): RMSNorm()
  431. )
  432. (output_layer): Linear(in_features=4096, out_features=65024, bias=False)
  433. )
  434. )
注:本文转载自blog.csdn.net的笨笨sg的文章"https://blog.csdn.net/a131529/article/details/143018921"。版权归原作者所有,此博客不拥有其著作权,亦不承担相应法律责任。如有侵权,请联系我们删除。
复制链接
复制链接
相关推荐
发表评论
登录后才能发表评论和回复 注册

/ 登录

评论记录:

未查询到任何数据!
回复评论:

分类栏目

后端 (14832) 前端 (14280) 移动开发 (3760) 编程语言 (3851) Java (3904) Python (3298) 人工智能 (10119) AIGC (2810) 大数据 (3499) 数据库 (3945) 数据结构与算法 (3757) 音视频 (2669) 云原生 (3145) 云平台 (2965) 前沿技术 (2993) 开源 (2160) 小程序 (2860) 运维 (2533) 服务器 (2698) 操作系统 (2325) 硬件开发 (2491) 嵌入式 (2955) 微软技术 (2769) 软件工程 (2056) 测试 (2865) 网络空间安全 (2948) 网络与通信 (2797) 用户体验设计 (2592) 学习和成长 (2593) 搜索 (2744) 开发工具 (7108) 游戏 (2829) HarmonyOS (2935) 区块链 (2782) 数学 (3112) 3C硬件 (2759) 资讯 (2909) Android (4709) iOS (1850) 代码人生 (3043) 阅读 (2841)

热门文章

101
推荐
关于我们 隐私政策 免责声明 联系我们
Copyright © 2020-2025 蚁人论坛 (iYenn.com) All Rights Reserved.
Scroll to Top