Skip to content

Suggestion: InputRef should have a public API to visit the raw input, such as full_slice() #943

@cnglen

Description

@cnglen

Since chumsky supports imperative-style using custom, if we are not able to lookback in custom, maybe it's not fully impreative.

I found slice() but I'm not able to construct the input parameters for slice():

  • Using the current cursor location, find the boundary of previous UTF8 char byte.
  • Create range of previous character we want to lookback
  • using InputRef::slice() to lookback.

background: In the parsing of org-mode file

Previous char is used to valid a pattern. See https://orgmode.org/worg/org-syntax.html

In a few cases, an instance of an element or object must be preceded or succeeded by a certain pattern, which is not itself part of the element or object. These patterns are specified using the PRE and POST tokens respectively, like so:

Methods tried:

  1. any().then(parser), has some problem and not valid. For example:
  • Parsing relies on the state of the previous character (PRE). Note that PRE is not part of the current object. PRE might be part of the previous object, so parsing with pre.then(object) is problematic and not feasible. For example, in [[https://a.b][foo]]^{2}, the "]" before "^" is part of the Link in AST.
  1. the state of PRE(previous char) needs to be maintained and managed manually and very carefully
  • this method works, but the code is very dirty. You have to decide whether to update PRE in every sub-parser implementation
  • RollbackState have to be used to avoid state corrupted in backtrack.
  • the performance is slow.
  1. use imperative-style parser of custom, the problem is that in the implementation of chumsky, it seems that we can't lookback to previous character have been consumed.

I can't find any API of chumsky to get the previous char, but the logic is simple.

  • Using the cursor, find the boundary of previous UTF8 char byte, which is in patter of 0zzz_zzzz / 110y_yyyy / 1110_xxxx / 1111_0www.
  • Since the utf8 char has 4 bytes at most, we look 4 bytes earlier at most.

In my code, the final parser if a prev_valid_parser implement by custom(), but I change one line of code in input.rs of InputRef Struct to let the prev_valid_parser work in my local Chumsky crate.

/// Internal type representing an input as well as all the necessary context for parsing.
pub struct InputRef<'src, 'parse, I: Input<'src>, E: ParserExtra<'src, I>> {
    cursor: I::Cursor,
    // pub(crate) cache: &'parse mut I::Cache,
    pub cache: &'parse mut I::Cache, // <--------------------------------
    
    pub(crate) errors: &'parse mut Errors<I::Cursor, E::Error>,
    pub(crate) state: &'parse mut E::State,
    pub(crate) ctx: &'parse E::Context,
    #[cfg(feature = "memoization")]
    pub(crate) memos: &'parse mut HashMap<(usize, usize), Option<Located<I::Cursor, E::Error>>>,
}
pub trait PrevInput<'src>: Input<'src> {
    unsafe fn prev(cache: &mut Self::Cache, cursor: & Self::Cursor) -> Option<Self::Token>;
}

impl<'src> PrevInput<'src> for &'src str {
    #[inline(always)]
    unsafe fn prev(this: &mut Self::Cache, cursor: & Self::Cursor) -> Option<Self::Token> {
        let idx_byte_current = *cursor;
        let mut prev_char = None;
        for i in 1..5 {
            if idx_byte_current<i { // at the start of Self::Cache
                break;
            }
            let idx_byte = idx_byte_current-i;

            // from is_utf8_char_boundary() 
            // This is bit magic equivalent to: b < 128 || b >= 192
            if ((this.as_bytes()[idx_byte]) as i8) >= -0x40 {
                let c = this.get_unchecked(idx_byte..)
                    .chars()
                    .next()
                    .unwrap_unchecked();
                prev_char = Some(c);
                break;
            }
        }
        prev_char
    }
}

pub trait PrevInputRef<'src, I> {
    fn prev(&mut self) -> Option<I::Token> where I: PrevInput<'src>;
}

impl<'src, 'parse, I: Input<'src>, E: extra::ParserExtra<'src, I>> PrevInputRef<'src, I> for  chumsky::input::InputRef<'src, 'parse, I, E> {
    #[inline(always)]
    fn prev(&mut self) -> Option<I::Token>
    where
        I: PrevInput<'src>,
    {
        // E0716
        let binding = self.cursor();
        let a = binding.inner();        
        let token = unsafe { I::prev(self.cache, a)};
        token
    }
}

// valid prev char using `f`
pub(crate) fn prev_valid_parser<'a, C: 'a, F: Fn(Option<char>)->bool + Clone>(
    f: F
) -> impl Parser<'a, &'a str, (), MyExtra<'a, C>> + Clone {
    custom(move |inp| {
        let before = inp.cursor();
        let maybe_prev = inp.prev();
        if f(maybe_prev) {
            Ok(())
        } else {
            Err(Rich::custom(
                inp.span_since(&before),
                format!("invalid PRE: {maybe_prev:?}"),
            ))            
        }
    })
}

I know let cache be public in InputRef is not a good idea. Can we add one API to let the InputRef to lookback? By the way, I'm not able to create a Cursor to use InputRef::slice().

Mabey Cursor::new(cursor:usize), thus we can use the slice() API?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions