如何將markdown轉換為wxml

話說我要為技術博客寫一個小程序版，我的博客解決方案是 hexo + github-page，格式當然是技術控們喜歡的 markdown 了。但小程序使用的卻是獨有的模版語言 WXML。我總不能把之前的文章手動轉換成小程序的 wxml 格式吧，而網上也沒完善的轉換庫，還是自己寫個解析器吧。

解析器最核心的部分就是字符串模式匹配，既然涉及到字符串匹配，那么就離不開正則表達式。幸好，正則表達式是我的優勢之一。

正則表達式

JavaScript中的正則表達式

解析器涉及到的 JavaScript 正則表達式知識

RegExp 構造函數屬性，其中lastMatch，rightContent在字符串截取時非常有用

長屬性名	短屬性名	替換標志	說明
input	$_		最近一次要匹配的字符串。Opera未實現此屬性
lastMatch	$&	$&	最近一次的匹配項。Opera未實現此屬性
lastParen	$+		最近一次匹配的捕獲組。Opera未實現此屬性
leftContext	$` \|$ `	input字符串中lastMatch之前的文本
rightContext	$'	$'	Input字符串中lastMatch之后的文本
multiline	$*		布爾值，表示是否所有表達式都使用多行模式。IE和Opera未實現此屬性
	$n	$n	分組
		$$	轉義$

test 方法和 RegExp 構造函數

test 方法調用后，上面的屬性就會出現在 RegExp 中，不推薦使用短屬性名，因為會造成代碼可讀性的問題，下面就是樣例

var text = "this has been a short summer";
var pattern = /(.)hort/g;if (pattern.test(text)){alert(RegExp.input);         // this has been a short summeralert(RegExp.leftContext);   // this has been aalert(RegExp.rightContext);  // summeralert(RegExp.lastMatch);     // shortalert(RegExp.lastParen);     // salert(RegExp.multiline);     // false
}//長屬性名都可以用相應的短屬性名來代替。不過由于這些短屬性名大都不是有效的ECMAScript標識符，因此必須通過方括號語法來訪問它們
if (pattern.test(text)){alert(RegExp.$_);alert(RegExp["$`"]);alert(RegExp["$'"]);alert(RegExp["$&"]);alert(RegExp["$+"]);alert(RegExp["$*"]);
}
復制代碼

replace 方法

一般使用的是沒有回調函數的簡單版本，而回調函數版本則是個大殺器，及其強大

//簡單替換, replace默認只進行一次替換, 如設定全局模式,  將會對符合條件的子字符串進行多次替換，最后返回經過多次替換的結果字符串.
var regex = /(\d{4})-(\d{2})-(\d{2})/;
"2011-11-11".replace(regex, "$2/$3/$1");//replace 使用回調函數自定義替換，必須啟用全局模式g，因為要不斷向前匹配，直到匹配完整個字符串
//match為當前匹配到的字符串，index為當前匹配結果在字符串中的位置，sourceStr表示原字符串，
//如果有分組，則中間多了匹配到的分組內容，match,group1(分組1)...groupN(分組n),index,sourceStr
"one two three".replace(/\bt[a-zA-Z]+\b/g, function (match,index,str) { //將非開頭的單詞大寫console.log(match,index,str);return match.toUpperCase(); 
});
復制代碼

match 方法

全局模式和非全局模式有顯著的區別，全局模式和 exec 方法類似。

// 如果參數中傳入的是子字符串或是沒有進行全局匹配的正則表達式，那么match()方法會從開始位置執行一次匹配，如果沒有匹配到結果，則返回null.否則則會返回一個數組,該數組的第0個元素存放的是匹配文本，返回的數組還含有兩個對象屬性index和input，分別表示匹配文本的起始字符索引和原字符串，還有分組屬性
var str = '1a2b3c4d5e';
console.log(str.match(/b/)); //返回["b", index: 3, input: "1a2b3c4d5e"]//如果參數傳入的是具有全局匹配的正則表達式，那么match()從開始位置進行多次匹配，直到最后.如果沒有匹配到結果，則返回null.否則則會返回一個數組，數組中存放所有符合要求的子字符串，但沒有index和input屬性,也沒有分組屬性
var str = '1a2b3c4d5e';
str.match(/h/g); //返回null
str.match(/\d/g); //返回["1", "2", "3", "4", "5"]var pattern = /\d{4}-\d{2}-\d{2}/g;
var str ="2010-11-10 2012-12-12";
var matchArray = str.match(pattern);
for(vari = 0; i < matchArray.length; i++) {console.log(matchArray[i]);
}
復制代碼

exec 方法

與全局模式下的 match 類似，但 exec 更強大，因為返回結果包含各種匹配信息，而match全局模式是不包含具體匹配信息的。

//逐步提取,捕獲分組匹配文本,必須使用全局模式g, 成功則返回數組(包含匹配的分組信息), 否則為null
//Regex每次匹配成功后,會把匹配結束位置更新到lastIndex,下次從lastIndex開始匹配
//如果不指定全局模式,使用while循環,會造成無窮循環
var pattern = /(\d{4})-(\d{2})-(\d{2})/g;
var str2 = "2011-11-11 2013-13-13" ;
while ((matchArray = pattern.exec(str2)) != null) {console.log( "date: " + matchArray[0]+"start at:" + matchArray.index+" ends at:"+ 		pattern.lastIndex);console.log( ",year: " + matchArray[1]);console.log( ",month: " + matchArray[2]);console.log( ",day: " + matchArray[3]);
}
復制代碼

search，split 這兩個比較簡單的方法則不再介紹

正則表達式高級概念

正常情況下正則是從左向右進行單字符匹配，每匹配到一個字符, 就后移位置, 直到最終消耗完整個字符串，這就是正則表達式的字符串匹配過程，也就是它會匹配字符，占用字符。相關的基本概念不再講解，這里要講的和字符匹配不同的概念 - 斷言。

斷言

正則中大多數結構都是匹配字符，而斷言則不同，它不匹配字符，不占用字符，而只在某個位置判斷左/右側的文本是否符合要求。這類匹配位置的元素，可以稱為 "錨點"，主要分為三類：單詞邊界，開始結束位置，環視。

單詞邊界 \b 是這樣的位置，一邊是單詞字符，一邊不是單詞字符，如下字符串樣例所示

\brow\b   //row
\brow     //row， rowdy
row\b     //row， tomorow
復制代碼

^ 行開頭，多行模式下亦匹配每個換行符后的位置，即行首
$ 行結束，多行模式下亦匹配每個換行符前的位置，即行尾

//js 中的 $ 只能匹配字符串的結束位置，不會匹配末尾換行符之前的換行符。但開啟多行模式(m)后，^ 和 $ 則可以匹配中間的換行符。 如下例子可驗證：// 默認全局模式下，^ 和 $ 直接匹配到了文本最開頭和末尾，忽略了中間的換行符
'hello\nword'.replace(/^|$/g,'<p>')
"<p>hello
word<p>"// 多行模式下，同時能匹配到結束符中間的換行符
'hello\nword\nhi'.replace(/^|$/mg,'<p>')
"<p>hello<p>
<p>word<p>
<p>hi<p>"
復制代碼

環視

環視是斷言中最強的存在，同樣不占用字符也不提取任何字符，只匹配文本中的特定位置，與\b, ^ $ 邊界符號相似；但環視更加強大，因為它可以指定位置和在指定位置處添加向前或向后驗證的條件。

而環視主要體現在它的不占位（不消耗匹配字符）, 因此又被稱為零寬斷言。所謂不占寬度，可以這樣理解：
- 環視的匹配結果不納入數據結果；
- 環視它匹配過的地方，下次還能用它繼續匹配。
環視包括順序環視和逆序環視，javascript 在 ES 2018 才開始支持逆序環視
- (?=) 順序肯定環視匹配右邊
- (?!) 順序否定環視
- (?<=) 逆序肯定環視匹配左邊
- (?<!) 逆序否定環視
來看一下具體的樣例
```
// 獲取.exe后綴的文件名，不使用分組捕獲，能使捕獲結果不包含.exe后綴，充分利用了環視匹配結果同時不占位的特性
'asd.exe'.match(/.+(?=\.exe)/)
=> ["asd", index: 0, input: "asd.exe", groups: undefined]// 變種否定順序環視，排除特定標簽p/a/img，匹配html標簽
</?(?!p|a|img)([^> /]+)[^>]*/?> //常規逆序環視，同樣利用了環視匹配不占位的特性
/(?<=\$)\d+/.exec('Benjamin Franklin is on the $100 bill')  // ["100",index: 29,...]
/(?<!\$)\d+/.exec('it’s is worth about €90')                // ["90", index: 21,...] // 利用環視占位但不匹配的特性
'12345678'.replace(/\B(?=(\d{3})+$)/g , ',') 
=> "12,345,678" //分割數字
復制代碼
```

解析器的編寫

正則表達式相關寫得有點多，但磨刀不誤砍柴工，開始進入主題

markdown格式

hexo 生成的 markdwon 文件格式如下，解析器就是要把它解析成json格式的輸出結果，供小程序輸出 wxml

---
title: Haskell學習-functor
date: 2018-08-15 21:27:15
tags: [haskell]
categories: 技術
banner: https://upload-images.jianshu.io/upload_images/127924-be9013350ffc4b88.jpg
---
<!-- 原文地址：[Haskell學習-functor](https://edwardzhong.github.io/2018/08/15/haskellc/) -->
## 什么是Functor
**functor** 就是可以執行map操作的對象，functor就像是附加了語義的表達式，可以用盒子進行比喻。**functor** 的定義可以這樣理解：給出a映射到b的函數和裝了a的盒子，結果會返回裝了b的盒子。**fmap** 可以看作是一個接受一個function 和一個 **functor** 的函數，它把function 應用到 **functor** 的每一個元素（映射）。```haskell
-- Functor的定義
class Functor f wherefmap :: (a -> b) -> f a -> f b
```
<!-- more -->
復制代碼

入口

使用node進行文件操作，然后調用解析器生成json文件

const { readdirSync, readFileSync, writeFile } = require("fs");
const path = require("path");
const parse = require("./parse");const files = readdirSync(path.join(__dirname, "posts"));
for (let p of files) {let md = readFileSync(path.join(__dirname, "posts", p));const objs = parse(md);writeFile(path.join(__dirname, "json", p.replace('.md','.json')), JSON.stringify(objs), function( err ){err && console.log(err);});
}
復制代碼

來看一下解析器入口部分，主要分為：summary 部分，code代碼部分，markdown文本部分。將文本內容的注釋和空格過濾掉，但是代碼部分的注釋要保留。

module.exports = function analyze(str) {let ret = { summary: {}, lines: [] };while (str) {// 空格if (/^([\s\t\r\n]+)/.test(str)) {str = RegExp.rightContext;}// summary 內容塊if (/^(\-{3})[\r\n]?([\s\S]+?)\1[\r\n]?/.test(str)) {str = RegExp.rightContext;ret.summary = summaryParse(RegExp.$2);ret.num = new Date(ret.summary.date).getTime();}// codeif (/^`{3}(\w+)?([\s\S]+?)`{3}/.test(str)) {const codeStr = RegExp.$2 || RegExp.$1;const fn = (RegExp.$2 && codeParse[RegExp.$1]) ? codeParse[RegExp.$1] : codeParse.javascript;str = RegExp.rightContext;ret.lines.push({ type: "code", child: fn(codeStr) });}// 注釋行if (/^<!--[\s\S]*?-->/.test(str)) {str = RegExp.rightContext;}// 提取每行字符串, 利用 . 不匹配換行符的特性if (/^(.+)[\r\n]?/.test(str)) {str = RegExp.rightContext;ret.lines.push(textParse(RegExp.$1));}}return ret;
};
復制代碼

文本內容提取

summary 內容塊的提取比較簡單，不講敘。還是看 markdown 文本內容的解析吧。這里匹配 markdown 常用類型，比如列表，標題h，鏈接a，圖片img等。而返回結果的數據結構就是一個列表，列表里面可以嵌套子列表。但基本就是正則表達式提取內容，最終消耗完字符行。

function textParse(s) {const trans = /^\\(\S)/; //轉義字符const italy = /^(\*)(.+?)\1/; //傾斜const bold = /^(\*{2})(.+?)\1/; //加粗const italyBold = /^(\*{3})(.+?)\1/; //傾斜和加粗const headLine = /^(\#{1,6})\s+/; //h1-6const unsortList = /^([*\-+])\s+/; //無序列表const sortList = /^(\d+)\.\s+/; //有序列表const link = /^\*?\[(.+)\]\(([^()]+)\)\*?/; //鏈接const img = /^(?:!\[([^\]]+)\]\(([^)]+)\)|<img(\s+)src="([^"]+)")/; //圖片const text =/^[^\\\s*]+/; //普通文本if (headLine.test(s)) return { type: "h" + RegExp.$1.length, text: RegExp.rightContext };if (sortList.test(s)) return { type: "sl", num: RegExp.$1, child: lineParse(RegExp.rightContext) };if (unsortList.test(s)) return { type: "ul", num: RegExp.$1, child: lineParse(RegExp.rightContext) };if (img.test(s)) return { type: "img", src: RegExp.$2||RegExp.$4, alt: RegExp.$1||RegExp.$3 };if (link.test(s)) return { type: "link", href: RegExp.$2, text: RegExp.$1 };return { type: "text", child: lineParse(s) };function lineParse(line) {let ws = [];while (line) {if (/^[\s]+/.test(line)) {ws.push({ type: "text", text: "&nbsp;" });line = RegExp.rightContext;}if (trans.test(line)) {ws.push({ type: "text", text: RegExp.$1 });line = RegExp.rightContext;}if (sortList.test(line)) {return { child: lineParse(RegExp.rightContext) };}if (unsortList.test(line)) {return { child: lineParse(RegExp.rightContext) };}if (link.test(line)) {ws.push({ type: "link", href: RegExp.$2, text: RegExp.$1 });line = RegExp.rightContext;}if (italyBold.test(line)) {ws.push({ type: "italybold", text: RegExp.$2 });line = RegExp.rightContext;}if (bold.test(line)) {ws.push({ type: "bold", text: RegExp.$2 });line = RegExp.rightContext;}if (italy.test(line)) {ws.push({ type: "italy", text: RegExp.$2 });line = RegExp.rightContext;}if (text.test(line)) {ws.push({ type: "text", text: RegExp.lastMatch });line = RegExp.rightContext;}}return ws;}
}復制代碼

代碼塊顯示

如果只是解析文本內容，還是非常簡單的，但是技術博客嘛，代碼塊是少不了的。為了代碼關鍵字符的顏色顯示效果，為了方便閱讀，還得繼續解析。我博客目前使用到的語言，基本寫了對應的解析器，其實有些解析器是可以共用的，比如 style方法不僅可應用到 css 上，還可以應用到類似的預解析器上比如：scss，less。html也一樣可應用到類似的標記語言上。

const codeParse = {haskell(str){},javascript(str){},html:html,css:style
};
復制代碼

來看一下比較有代表性的 JavaScript 解析器，這里沒有使用根據換行符(\n)將文本內容切割成字符串數組的方式，因為有些類型需要跨行進行聯合推斷，比如解析塊，方法名稱判斷就是如此。只能將一整塊文本用正則表達式慢慢匹配消耗完。最終的結果類似上面的文本匹配結果 - 嵌套列表，類型就是語法關鍵字，常用內置方法，字符串，數字，特殊符號等。

其實根據這個解析器可以進一步擴展和抽象一下，將它作為類 C 語言族的基本框架。然后只要傳遞對應語言的正則表達式規則，就能解析出不同語言的結果出來，比如 C#，java，C++，GO。

javascript(str) {const comReg = /^\/{2,}.*/;const keyReg = /^(import|from|extends|new|var|let|const|return|if|else|switch|case|break|continue|of|for|in|Array|Object|Number|Boolean|String|RegExp|Date|Error|undefined|null|true|false|this|alert|console)(?=([\s.,;(]|$))/;const typeReg = /^(window|document|location|sessionStorage|localStorage|Math|this)(?=[,.;\s])/;const regReg = /^\/\S+\/[gimuys]?/;const sysfunReg = /^(forEach|map|filter|reduce|some|every|splice|slice|split|shift|unshift|push|pop|substr|substring|call|apply|bind|match|exec|test|search|replace)(?=[\s\(])/;const funReg = /^(function|class)\s+(\w+)(?=[\s({])/;const methodReg = /^(\w+?)\s*?(\([^()]*\)\s*?{)/;const symbolReg = /^([!><?|\^$&~%*/+\-]+)/;const strReg = /^([`'"])([^\1]*?)\1/;const numReg = /^(\d+\.\d+|\d+)(?!\w)/;const parseComment = s => {const ret = [];const lines = s.split(/[\r\n]/g);for (let line of lines) {ret.push({ type: "comm", text: line });}return ret;};let ret = [];while (str) {if (/^\s*\/\*([\s\S]+?)\*\//.test(str)) {str = RegExp.rightContext;const coms = parseComment(RegExp.lastMatch);ret = ret.concat(coms);}if (/^(?!\/\*).+/.test(str)) {str = RegExp.rightContext;ret.push({ type: "text", child:lineParse(RegExp.lastMatch) });}if(/^[\r\n]+/.test(str)){str=RegExp.rightContext;ret.push({type:'text',text:RegExp.lastMatch});}}return ret;function lineParse(line) {let ws = [];while (line) {if (/^([\s\t\r\n]+)/.test(line)) {ws.push({ type: "text", text: RegExp.$1 });line = RegExp.rightContext;}if (comReg.test(line)) {ws.push({ type: "comm", text: line });break;}if (regReg.test(line)) {ws.push({ type: "fun", text: RegExp.lastMatch });line = RegExp.rightContext;}if (symbolReg.test(line)) {ws.push({ type: "keyword", text: RegExp.$1 });line = RegExp.rightContext;}if (keyReg.test(line)) {ws.push({ type: "keyword", text: RegExp.$1 });line = RegExp.rightContext;}if (funReg.test(line)) {ws.push({ type: "keyword", text: RegExp.$1 });ws.push({ type: "text", text: "&nbsp;" });ws.push({ type: "fun", text: RegExp.$2 });line = RegExp.rightContext;}if (methodReg.test(line)) {ws.push({ type: "fun", text: RegExp.$1 });ws.push({ type: "text", text: "&nbsp;" });ws.push({ type: "text", text: RegExp.$2 });line = RegExp.rightContext;}if (typeReg.test(line)) {ws.push({ type: "fun", text: RegExp.$1 });line = RegExp.rightContext;}if (sysfunReg.test(line)) {ws.push({ type: "var", text: RegExp.$1 });line = RegExp.rightContext;}if (strReg.test(line)) {ws.push({ type: "var", text: RegExp.$1 + RegExp.$2 + RegExp.$1 });line = RegExp.rightContext;}if (numReg.test(line)) {ws.push({ type: "var", text: RegExp.$1 });line = RegExp.rightContext;}if (/^\w+/.test(line)) {ws.push({ type: "text", text: RegExp.lastMatch });line = RegExp.rightContext;}if (/^[^`'"!><?|\^$&~%*/+\-\w]+/.test(line)) {ws.push({ type: "text", text: RegExp.lastMatch });line = RegExp.rightContext;}}return ws;}
}
復制代碼

顯示WXML

最后只要運行解析器，就能生成 markdown 對應的 json 文件了，然后把json加載到微信小程序的云數據庫里面，剩下的顯示就交由小程序完成。下面就是使用 taro 編寫 jsx 顯示部分

<View className='article'>{lines.map(l => (<Block><View className='line'>{l.type.search("h") == 0 && ( <Text className={l.type}>{l.text}</Text> )}{l.type == "link" && ( <Navigator className='link' url={l.href}> {l.text} </Navigator> )}{l.type == "img" && ( <Image className='pic' mode='widthFix' src={l.src} /> )}{l.type == "sl" && ( <Block> <Text decode className='num'> {l.num}.{" "} </Text><TextChild list={l.child} /></Block>)}{l.type == "ul" && ( <Block> <Text decode className='num'> {" "} &bull;{" "} </Text><TextChild list={l.child} /></Block>)}{l.type == "text" && l.child.length && ( <TextChild list={l.child} /> )}</View>{l.type == "code" && (<View className='code'>{l.child.map(c => (<View className='code-line'>{c.type == 'comm' && <Text decode className='comm'> {c.text} </Text>}{c.type == 'text' && c.child.map(i => (<Block>{i.type == "comm" && ( <Text decode className='comm'> {i.text} </Text> )}{i.type == "keyword" && ( <Text decode className='keyword'> {i.text} </Text> )}{i.type == "var" && ( <Text decode className='var'> {i.text} </Text> )}{i.type == "fun" && ( <Text decode className='fun'> {i.text} </Text> )}{i.type == "text" && ( <Text decode className='text'> {i.text} </Text> )}</Block>))}</View>))}</View>)}</Block>))}
</View>
復制代碼