-
-
Notifications
You must be signed in to change notification settings - Fork 203
Description
Since chumsky supports imperative-style using custom, if we are not able to lookback in custom, maybe it's not fully impreative.
I found slice() but I'm not able to construct the input parameters for slice():
- Using the current cursor location, find the boundary of previous UTF8 char byte.
- Create range of previous character we want to lookback
- using InputRef::slice() to lookback.
background: In the parsing of org-mode file
Previous char is used to valid a pattern. See https://orgmode.org/worg/org-syntax.html
In a few cases, an instance of an element or object must be preceded or succeeded by a certain pattern, which is not itself part of the element or object. These patterns are specified using the PRE and POST tokens respectively, like so:
Methods tried:
any().then(parser), has some problem and not valid. For example:
- Parsing relies on the state of the previous character (PRE). Note that PRE is not part of the current object. PRE might be part of the previous object, so parsing with
pre.then(object)is problematic and not feasible. For example, in[[https://a.b][foo]]^{2}, the "]" before "^" is part of the Link in AST.
- the state of PRE(previous char) needs to be maintained and managed manually and very carefully
- this method works, but the code is very dirty. You have to decide whether to update PRE in every sub-parser implementation
- RollbackState have to be used to avoid state corrupted in backtrack.
- the performance is slow.
- use imperative-style parser of custom, the problem is that in the implementation of chumsky, it seems that we can't lookback to previous character have been consumed.
I can't find any API of chumsky to get the previous char, but the logic is simple.
- Using the cursor, find the boundary of previous UTF8 char byte, which is in patter of 0zzz_zzzz / 110y_yyyy / 1110_xxxx / 1111_0www.
- Since the utf8 char has 4 bytes at most, we look 4 bytes earlier at most.
In my code, the final parser if a prev_valid_parser implement by custom(), but I change one line of code in input.rs of InputRef Struct to let the prev_valid_parser work in my local Chumsky crate.
/// Internal type representing an input as well as all the necessary context for parsing.
pub struct InputRef<'src, 'parse, I: Input<'src>, E: ParserExtra<'src, I>> {
cursor: I::Cursor,
// pub(crate) cache: &'parse mut I::Cache,
pub cache: &'parse mut I::Cache, // <--------------------------------
pub(crate) errors: &'parse mut Errors<I::Cursor, E::Error>,
pub(crate) state: &'parse mut E::State,
pub(crate) ctx: &'parse E::Context,
#[cfg(feature = "memoization")]
pub(crate) memos: &'parse mut HashMap<(usize, usize), Option<Located<I::Cursor, E::Error>>>,
}pub trait PrevInput<'src>: Input<'src> {
unsafe fn prev(cache: &mut Self::Cache, cursor: & Self::Cursor) -> Option<Self::Token>;
}
impl<'src> PrevInput<'src> for &'src str {
#[inline(always)]
unsafe fn prev(this: &mut Self::Cache, cursor: & Self::Cursor) -> Option<Self::Token> {
let idx_byte_current = *cursor;
let mut prev_char = None;
for i in 1..5 {
if idx_byte_current<i { // at the start of Self::Cache
break;
}
let idx_byte = idx_byte_current-i;
// from is_utf8_char_boundary()
// This is bit magic equivalent to: b < 128 || b >= 192
if ((this.as_bytes()[idx_byte]) as i8) >= -0x40 {
let c = this.get_unchecked(idx_byte..)
.chars()
.next()
.unwrap_unchecked();
prev_char = Some(c);
break;
}
}
prev_char
}
}
pub trait PrevInputRef<'src, I> {
fn prev(&mut self) -> Option<I::Token> where I: PrevInput<'src>;
}
impl<'src, 'parse, I: Input<'src>, E: extra::ParserExtra<'src, I>> PrevInputRef<'src, I> for chumsky::input::InputRef<'src, 'parse, I, E> {
#[inline(always)]
fn prev(&mut self) -> Option<I::Token>
where
I: PrevInput<'src>,
{
// E0716
let binding = self.cursor();
let a = binding.inner();
let token = unsafe { I::prev(self.cache, a)};
token
}
}
// valid prev char using `f`
pub(crate) fn prev_valid_parser<'a, C: 'a, F: Fn(Option<char>)->bool + Clone>(
f: F
) -> impl Parser<'a, &'a str, (), MyExtra<'a, C>> + Clone {
custom(move |inp| {
let before = inp.cursor();
let maybe_prev = inp.prev();
if f(maybe_prev) {
Ok(())
} else {
Err(Rich::custom(
inp.span_since(&before),
format!("invalid PRE: {maybe_prev:?}"),
))
}
})
}I know let cache be public in InputRef is not a good idea. Can we add one API to let the InputRef to lookback? By the way, I'm not able to create a Cursor to use InputRef::slice().
Mabey Cursor::new(cursor:usize), thus we can use the slice() API?